Malware Detection: A Framework For Reverse Engineered Android Applications Through Machine Learning Algorithms
Malware Detection: A Framework For Reverse Engineered Android Applications Through Machine Learning Algorithms
Malware Detection: A Framework For Reverse Engineered Android Applications Through Machine Learning Algorithms
ABSTRACT Today, Android is one of the most used operating systems in smartphone technology. This
is the main reason, Android has become the favorite target for hackers and attackers. Malicious codes are
being embedded in Android applications in such a sophisticated manner that detecting and identifying an
application as a malware has become the toughest job for security providers. In terms of ingenuity and
cognition, Android malware has progressed to the point where they’re more impervious to conventional
detection techniques. Approaches based on machine learning have emerged as a much more effective way
to tackle the intricacy and originality of developing Android threats. They function by first identifying
current patterns of malware activity and then using this information to distinguish between identified threats
and unidentified threats with unknown behavior. This research paper uses Reverse Engineered Android
applications’ features and Machine Learning algorithms to find vulnerabilities present in Smartphone
applications. Our contribution is twofold. Firstly, we propose a model that incorporates more innovative
static feature sets with the largest current datasets of malware samples than conventional methods. Secondly,
we have used ensemble learning with machine learning algorithms i.e., AdaBoost, Support Vector Machine
(SVM), etc. to improve our model’s performance. Our experimental results and findings exhibit 96.24%
accuracy to detect extracted malware from Android applications, with a 0.3 False Positive Rate (FPR).
The proposed model incorporates ignored detrimental features such as permissions, intents, Application
Programming Interface (API) calls, and so on, trained by feeding a solitary arbitrary feature, extracted by
reverse engineering as an input to the machine.
INDEX TERMS Android applications, benign, feature extraction, malware detection, reverse engineering,
machine learning.
I. INTRODUCTION system, the level of danger it poses, with malware authors and
To this degree, it is guaranteed that mobile devices are programmers implementing unwanted permissions, features
an integral part of most people’s daily lives. Furthermore, and application components in Android apps. The option
Android now controls the vast majority of mobile devices, to expand its capabilities with third-party software is also
with Android devices accounting for an average of 80% of the appealing, but this capability comes with the risk of malicious
global market share over the past years [1]. With the ongoing attacks. When the number of smartphone apps increases,
plan of Android to a growing range of smartphones and con- so does the security problem with unnecessary access to
sumers around the world, malware targeting Android devices different personal resources. As a result, the applications
has increased as well. Since it is an open-source operating are becoming more insecure, and they are stealing personal
information, SMS frauds, ransomware, etc.
The associate editor coordinating the review of this manuscript and In contrast to static analysis methods such as a manual
approving it for publication was Tony Thomas. assessment of AndroidManifest.xml, source files and Dalvik
II. RELATED WORKS from different sources. Model continuously achieved 97%
Linux (Android core) keeps key aspects of the security infras- F1-measure accuracy for identifying applications or catego-
tructure of the operating system. The Android displays to rizing malware. [25] The authors present a unique Android
the administrator a list of features, sought to reinstall an malware detection approach dubbed Permission-based Mal-
update. The program installs itself on the computer after ware Detection Systems (PMDS) based on a study of
they issue access. Figure 3 shows the integrated core parts 2950 samples of benign and malicious Android applica-
of Android architecture. It comprises applications at the top tions. In PMDS, requested permissions are viewed as behav-
layer and also includes an application framework, libraries ioral markers, and a machine learning model is built on
or a Runtime layer, and a Linux kernel. These levels are fur- those indicators to detect new potentially dangerous behav-
ther divided into their components, which make an Android ior in unknown apps depending on the mix of rights they
Application. The Linux Kernel is the key part of Android require. PMDS identifies more than 92–94% of all heretofore
that provides its OS functionality to phones, and the Dalvik unknown malware, with a false positive rate of 1.52–3.93%.
Virtual Machine (DVM) is to manage a mobile device. Appli- The authors of this article [26] solely use the machine learn-
cation is the Android architecture’s highest layer. Native ing ensemble learning method Random Forest supervised
and third-party apps such as contacts, email, audio, gallery, classifier on Android feature malware samples with 42 fea-
clock, sports, and so on are located only in this layer. This tures respectively. Their objective was to assess Random
framework gets the classes often used to develop Android Forest’s accuracy in identifying Android application activity
apps. It also handles the user interface and device infras- as harmful or benign. Dataset 1 is built on 1330 malicious
tructure and provides a common specification for hardware apk samples and 407 benign ones seen by the author. This is
entry. To facilitate the development of Android, the Platform based on the collection of feature vectors for each application.
Libraries include many C/C++ core libraries and Java-based Based on an ensemble learning approach, Congyi proposes a
libraries such as SSL, libc, Graphics, SQLite, Webkit, Media, concept in [27] for recognizing and distinguishing Android
Surface Manager, OpenGL, and others. The taxonomy helps malware. To begin, a static analysis of the Android Manifest
understand the viewer with a logical algorithmic approach for file in the Android Application Package (APK) is done to
grasping the core surfaces and functionality of the operating extract system characteristics such as permission calls, com-
system. ponent calls, and intents. Then, to detect malicious apps, they
The methods proposed in this related work contribute deploy the XGBoost technique, which is an implementation
to key aspects such as selected features for classification of ensemble learning. Analyzing more than 6,000 Android
and a higher predictive rate for malware detection. Cer- apps on the Kaggle platform provided the initial data for
tain research has focused on increasing accuracy, while this experiment. They tested both benign and malicious apps
others have focused on providing a larger dataset, some based on 3 feature sets for a testing set of 2,000 samples
have been implemented by employing various feature sets, and used the remaining data to create a training set of 6,315
and many studies have combined all of these to improve samples. Additional approaches include [28], an SVM-based
detection rate efficiency. In [22], the authors offer a sys- malware detection technique for the Android platform that
tem for detecting Android malware apps to aid in the orga- incorporates both dangerous permission combinations and
nization of the Android Market. The proposed framework susceptible API calls as elements in the SVM algorithm. The
aims to provide a machine learning-based malware detection dataset includes 400 Android applications, which included
system for Android to detect malware apps and improve 200 benign apps from the official Android market and
phone users’ safety and privacy. This system monitors dif- 200 malicious apps from the Drebin dataset. [29] Determines
ferent permission-based characteristics and events acquired whether the program is dangerous and, if so, categorizes it as
from Android apps and examines these features employing part of a malware family. They obtain up to 99.82 % accuracy
machine learning classifiers to determine if the program is with zero false positives for malware detection at a fraction
goodware or malicious. The paper uses two datasets with of the computation power of state-of-the-art methods but
collectively 700 malware samples and 160 features. Both incorporate a minimal feature set. The results of [30] demon-
datasets achieved approximately 91% accuracy with Ran- strate that deep learning is adequate for classifying Android
dom Forest (RF) Algorithm. [23] Examines 5,560 malware malware and that it is much more successful when additional
samples, detecting 94 % of the malware with minimal false training data is available. A permission-based strategy for
alarms, where the reasons supplied for each detection dis- identifying malware in Android applications is described
close key features of the identified malware. Another tech- in [31], which uses filter feature selection algorithms to pick
nique [24] exceeds both static and dynamic methods that rely features and implements machine learning algorithms such
on system calls in terms of resilience. Researchers demon- as Random Forest, SVM, and J48 to classify applications
strated the consistency of the model in attaining maximum as malware or benign. This research [32] provides a feature
classification performance and better accuracy compared to selection using the Genetic algorithm (GA) approach for
two state-of-the-art peer methods that represent both static identifying Android malware. For identifying and analyzing
and dynamic methodologies over for nine years through three Android malware, three alternative classifier techniques with
interrelated assessments with satisfactory malware samples distinct feature subsets were built and compared using GA.
TABLE 1. Relative techniques analysis on basis of multiple factors in comparison to proposed approach (PER: Permissions, STR: String, API: Application
Programming Interface, INT: Intents, PKG: Package, APP-C: App Components, SR: Services, RS: Receivers).
it’s not a convenient approach and we preferred our sys- impact the high dimensionality of the data being used. Sup-
tem for physical devices in the future. Jadx carries out the port vector machine and AdaBoost can handle relatively
modification and evaluation of source code. The system con- well than other algorithms because of their high dimensional
centrated on trying to hook the byte-level API calls [40]. space/hyperplane sectioning. Another suspension for our
For our dataset, features from over 1, 00,000 applications datasets was the tool used for extracting the given datasets.
are extracted containing around 56000 extracted features. Androguard implements parsing and analyzing automation
Functions and processes of opcode API features are removed to further break down components of application apk’s after
from the disassembled Smali and Manifest files of an APK decompiling and encourages weighting of the data into
file. The Smali file, segmented by the process and the opcode binary, making it easy to use relevant data for classifica-
frequency of Dalvik for every method, is determined by tion. It uses certain functionality to get useful features from
scanning Dalvik Bytecodes. To verify invocation of a haz- manifest files of these Android applications reducing the
ardous API in that form, it is also possible to determine the acquiring irrelevant features. Although the data in this study
hazardous frequency of an API invocation for each method works significantly well for evaluation, however, the datasets
during the byte code search. For string functions, strings are will be needed to upgrade in terms of forthcoming evolving
selected without the method of isolation from the entire Smali measures.
archives [41]. Certain other authors have presented many tools and pro-
We will never have a predictable response when the posals to deal with high dimensional data such as [42], [43],
number of features inside a dataset exceeds the num- inducing multiple methods such as filtering wrapping to
ber of occurrences. In other terms, when we don’t have enhance robustness.
enough data to train our machine on, generating a struc- The feature set of our model includes:
ture that could capture the association between both F1 → Permissions
the predictive variables and responses variable appears F2 → API Calls
problematic. F3 → Intents
The system used in this study also incorporates larger F4 → App Components
feature sets for classification. Although this problem arises F5 → Packages
in machine learning quite often to some extent choosing F6 → Services
the type of model for detection or classification can highly F7 → Receivers
TABLE 2. Relative techniques analysis on basis of features and sample collected in comparison to proposed approach.
1) PERMISSIONS API libraries are used for both proprietary and third-party
Permission is a security feature that limits access to certain users [46]. Table 3 shows some dangerous permissions that
information on smartphone, with the role of preserving sen- pose problems to the reverse engineered applications.
sitive data and functions that might be exploited to harm
the user’s experience. A unique label is assigned for every 2) INTENTS
permit, which typically denotes a limited operation. The per- The message delivered among modules such as activities,
missions are further categorized into four parts by Google: content providers, broadcast receivers, and services is known
normal, dangerous, signature, and SignatureOrSystem. For as Android Intent. It’s commonly used along with the star-
evaluating Android permissions, researchers take a variety of tActivity() function to start activities, broadcast receivers,
methods [44]. Standard (also called secure) levels of secu- and other things. Individual intent counts are exploited as
rity permissions, such as VIBRATE and SET WALLPAPER, a continuous feature in categorization. To provide more
are permissions without risk. Android kit installer will not specificity, we divide the list of intents into further sec-
allow the user to approve these permissions. The dangerous tions, such as intentions including the phrases (android.net),
security standard will pose warnings to the user before imple- which are linked to the network manager, intents including
mentation and will require the user’s consent. The signature (com.android.vending), for billing transactions, and intents
and symbol Security stages of SignatureOrSystem cover the addressing framework components (com.android) and prov-
riskiest permits. Only applications with the same certificate, ing to be harmful elements in these apps.
as the certificate used to sign the request declaring approval,
are allowed to sign signature permissions [45]. It also acts as 3) API CALLS
a buffer in the middle of hardware and the rest of the stack. Safe APIs are tools that are only available by the operating
A variety of different C/C++ core libraries, such as libc system. GPS, camera, SMS, Bluetooth, and network or data
and SSL, are being used in libraries. Dalvik virtual machines are some examples. To make use of such resources, the
and key libraries are part of the Android Run Time. App application must identify them in its manifest [47]. The Cost-
Model defines classes for developing Android applications, sensitive APIs are those that can increase cost through their
as well as a standardized structure for hardware control usages, such as SMS, data or network, and NFC. Each version
and the management of user experience and app property. includes these APIs in the OS-controlled set of protected
TABLE 4. Frequently deployed malware sensitive API Calls. When a broadcast is sent, the system automatically directs
it to applications that have signed up to receive that sort of
broadcast. Services, unlike activities, do not have a graphi-
cal user interface. They’re used to build long-running back-
ground processes or a complex communications API that
other programs may access. In the manifest file, all services
are represented by < service > elements and they allow the
APIs that require the device’s user’s sole permission. API developer to invalidate the structure of the application.
calls that grant sensitive information or device resources
are commonly detected in malicious codes. These calls are C. CLASSIFICATION
isolated and compiled in a different feature set so they might The collection of chosen features in the signature database,
contribute to harmful activity. Table 4 elaborates dangerous separated into training and test data, and is used to recog-
API features: nize android malware apps by traditional machine learning
techniques [49]. There are three different computer frame-
4) API COMPONENTS works: supervised learning, unsupervised learning, and rein-
The program that requires access or activity e.g., a path from forcement learning. The supervised learning method is the
point A to point B on a route predicated on a user’s location focus of this paper, comprises algorithms that learn a model
from another application makes a call to its API, stating from externally provided instances of known data and known
the data/functionality demands. The other software includes results to produce a theoretical model so that the learned
the data/functionality that the first program requested. For model predicts feedback about previous occurrences over
privacy reasons, some API features must be declared and new data [50]. The deployment of ensemble techniques and
not used in these apps. These components relate to broadcast strong learning classifiers helps classification of our binary
features present in these applications. feature sets, resulting in correctly categorized malware and
benign samples. We believe that these classification mechan-
5) PACKAGES, SERVICES AND RECEIVERS ics produce efficient outputs because of their sorting nature.
The package manifest has always been found in the package’s Fig. 6 explains the process of the learning model.
root and provides information about the package, such as A comparative algorithm selection for our model based
its registered name and sequence number. It also specifies on AdaBoost, Naive Bayes, Decision Tree classifier,
crucial data to convey to the user, such as a consumer name K-Neighbor, Gaussian NB, Random forest classifier, and
for the program that displays in the User Interface (UI). The Support Vector Machine performing a relative review which
file format is in .json for packages. will give an accurate analysis of the algorithm for the predic-
According to a publication process model, Android apps tion of our model.
can transmit and receive messages from the Android system
and other Android apps. When a noteworthy event occurs, 1) ALGORITHM CHARACTERISTICS APPRAISAL
these broadcasts are sent out. The Android system, for exam- The assessment of suggested algorithms was carried out using
ple, sends broadcasts when different system events occur, Python. The use of FPR and Accuracy assess our compara-
such as the system booting up or the smartphone charging. tive algorithms trials [51]. These estimates, derived from the
Individuals can sign up to receive certain broadcasts [48]. following basic factors, are listed further down:
gain [63]. These applications are used daily and if they are TABLE 5. Sample datasets.
involved in unnecessary and third-party access, then there is a
special need to apply countermeasures on these applications,
as this is going to be a major threat in the future.
Also, the Figure depicts the need to measure these threats
and devise countermeasures or at least present models to
provide more encoded procedures to carry out for these well-
known applications. These apps provide a lot of opportuni-
ties, but with an increase in private and intellectual property TABLE 6. Datasets ratio (Training & Testing), MalD (MalDroid), DefenseD
(DefenseDroid), GD (Generated Dataset), Pre-Pro (Pre-Processing).
stored in these apps, certain anecdotes need to be proposed.
V. EXPERIMENTAL RESULTS
In this section, the results of our experimentation are stated.
To start our experimentation discussion, we will elaborate on
the basic criteria for performing our implementation success-
fully and will also briefly discuss the data collection or the
dataset that we got and then further converse about the actual
contribution part.
A. EXPERIMENT SETUP
Our environment is based on Windows 8.1 Pro with Intel
, R
Core (MT) i5-2450 CPU, 2.50 GHz as a processor. The
installed memory (RAM) of the system is 4.00 GB with a
64-bit Operating System (OS), x-64 based processor.
For the generated dataset Androguard 3.3.5 (latest release)
is used for decompiling and feature extraction, deployed
in regulated .csv files in binary vectors. We have installed
Python 3.8.12 (version 3.8) on our system for the implemen-
tation and execution of training and testing scripts of imported
machine learning models.
D. PROGRAM PARAMETERS
Our project is based on Python 3.9.7 and divided our execu- missions features through the dataset binary values (0.1) and
tion into two programs. The first program, written to compare specifies those results in function pred () As you can see in
the algorithms for the accuracy check of respective mod- the code below, the program uses a fit () function, which takes
els, based on AdaBoost, Decision Tree, KNN, SVM, Naive the training data as arguments that are fitted using the x and
Bayes, and Random Forest for the comparative analysis. The y parameters into testing data for our two models (AdaBoost
program uses different import and split functions to train the and SVM). All the variables were specified at the end that
models and then stores the result in a variable embedded was given to each of our algorithms in the program to the
for the testing model. The function sklearn.model_selection, variable acc. After executing the program, every algorithm
used for accessing the bundles of algorithms, accuracy_score will start accessing the dataset and start predicting the dataset
for accuracy readings, pandas to read the database, and value for the android features. Figures 16 and 17 represent
NumPy to convert the testing model data into rr format. the main key functions for our models AdaBoost and SVM,
The parameter on the x-axis is the features of the algo- which are discussed above.
rithms and on the y-axis is its label (figure 19), meaning the Figure 17 also explains the predictive procedure of the
accuracy percentage for these algorithms. The x (accuracy ensemble model with 1000 malware sample runs and given
of the models) and y (labels of the models) parameters of features to train for a single predictive classification output.
the program are configured to shuffle = True using the The same fit() function is used for dataset training. The
test_train_split function, so each algorithm takes a random model is placed for higher weights of decision trees algorithm
permission value from the dataset. Figures 14 and 15 show within row values and executed in yhat. Accuracy is then
the import modules and parameters values set in our program. accomplished by declaring the mean and standard deviation
First, all the algorithms are imported into the program (mean (n_acc_scores), std (n)acc_score))) for the binary clas-
to implement the training data for the model, meaning the sification output of malware. Further ahead, Figure 18 shows
machine is training based on the given datasets. The program the plotted assigned value for accuracy after the data is trained
will work as each algorithm will take up random binary value on the models.
of an app from the dataset and execute its feature’s accuracy Figure 19 shows the accuracy percentage for our models
score in another variable. After training the data, the program which is 96.24% and the graph displays the highest correct
passes the testing data to store into a predictive function. The predictive frequency out of all the algorithms, professing
program is designed to identify the normal and harmful per- the research work for greater validity. This graph is plotted
TABLE 7. Shows the label values for each algorithm and their accuracy
percentage.
FIGURE 21. Prediction function for AdaBoost for testing data for the
database.
FIGURE 23. Output [0] representing the malware application (SVM). FIGURE 26. Orange entries for hon-harmful applications in AdaBoost.
A. RESULTS
After the forecast of our models, results show that the accu-
racy for our highest predictive systems is 96% and 92%. The
proposed model doesn’t peak in higher accuracy or predictive
FIGURE 28. Orange entries for non-harmful applications in SVM.
rate but it contributes by introducing enhanced and large fea-
ture sets (containing around 56000 newly extracted features)
with the latest API level applications datasets collected in sources/environment to process and generate these datasets
recent years than state-of-the-art approaches. Another point on our models. The novelty and contributions are explained
of view for a less predictive rate is the limitation of our in Tables 1 and 2.
FIGURE 30. Comparative analysis of malicious and benign in Adaboost and SVM.
TABLE 9. Relative resources (Pro = Processing), (Acc = Accuracy), (FPR = combination with methods based on AI or machine
False Positive Rate).
learning, such as inept learning, to make the detection
more sophisticated to make it easier to identify and
regulate app prediction rate.
3. Application behaviours in the malware ecosys-
tem encourage non-emerging threats. Our study
doesn’t incorporate the rider analysis or behaviour
of repackaged malware. The study simply uses the
reverse-engineered apk files and extracts the given
context to the AndroGuard and extracts features in
binary vectors. Although this is a major issue and a key
challenge with the advancement in Android malware.
This approach will be our advanced project to perform
differential or effective analysis on reverse applica-
tions, determining the effects of these applications and
their datasets. [24] Has somewhat of a similar resource with their results.
higher processing but their sample size is very limited in 4. The applications with time induce new features with
comparison to our model. A few other studies describe similar enhanced malware abilities which is why we would
technical advantages, thus, leaving us to work with restrictive have to upgrade the system whenever the model’s FPR
measures. Table 9 presents some key properties to elaborate rate after execution increases. The simplest explana-
on similar systems’ components. tion for how to identify if the model is degrading on
evolved features is that our datasets are designed in
VII. RESEARCH ISSUES AND CHALLENGES binary matrix extracted from features that are currently
This section highlights our experiment’s prevalent and crucial implemented in these applications and not features that
topics. These hurdles are based on various stages of our work will be present in evolved apps in coming years. With
and maybe gradually rectified in the work to be undertaken new features, we would have to reverse and extract
in the future. those features to form an updated dataset again to train
1. Features declared mostly on the device are more on these classifiers. 66], [67], [68] and [69] discuss
durable than the features specific to the applications the possible solutions for this key issue and propose
and therefore can usually automate malware detection. some possible solutions but for our model and given the
The range of android parameters for processing is resource we have only performed for current features.
rather big and difficult to detect properly if someone For future work, we will consider model sustainability
does not extract the features properly. and how to classify the malware that our system will
2. There is still a fast increase in the number of apps. be able to detect even if the features are not yet imple-
Malware apps can always be identified in potential in mented.
5. The research mentions the problem of multicollinearity machine algorithms before employing them is also a promis-
in the introduction, depicting the rise of dependent vari- ing field.
ables in-between machine learning algorithms which
REFERENCES
cause interpretation in results. However, this field of [1] Android (GOOG) Just Hit a Record 88% Market Share of All
study can be taken as a future work for further testing Smartphones—Quartz. Accessed: Jan. 28, 2022. [Online]. Available:
https://qz.com/826672/android-goog-just-hit-a-record-88-market-share-
of several models handling multicollinearity because
of-all-smartphones/
our model itself is already performing high processing [2] A. O. Christiana, B. A. Gyunka, and A. Noah, ‘‘Android malware detection
detection schemes to generate accuracy for Android through machine learning techniques: A review,’’ Int. J. Online Biomed.
Eng., vol. 16, no. 2, p. 14, Feb. 2020, doi: 10.3991/ijoe.v16i02.11549.
applications features malware. We will foresee this [3] D. Ghimire and J. Lee, ‘‘Geometric feature-based facial expression recog-
issue and incorporate it to produce an efficient solution nition in image sequences using multi-class AdaBoost and support vec-
to the problem. Authors in [70], [71], [72] proposes tor machines,’’ Sensors, vol. 13, no. 6, pp. 7714–7734, Jun. 2013, doi:
10.3390/s130607714.
some solutions to tackle this challenge and can help
[4] R. Wang, ‘‘AdaBoost for feature selection, classification and its relation
understand viewers queries. with SVM, a review,’’ Phys. Proc., vol. 25, pp. 800–807, Jan. 2012, doi:
10.1016/j.phpro.2012.03.160.
A. LIMITATIONS [5] J. Sun, H. Fujita, P. Chen, and H. Li, ‘‘Dynamic financial distress prediction
with concept drift based on time weighting combined with Adaboost sup-
The technique in this paper is based on binary classification of port vector machine ensemble,’’ Knowl.-Based Syst., vol. 120, pp. 4–14,
lightweight code of static feature sets present in the Android Mar. 2017, doi: 10.1016/j.knosys.2016.12.019.
manifest file. The three major limitations of our method are: [6] A. Garg and K. Tai, ‘‘Comparison of statistical and machine learning meth-
ods in modelling of data with multicollinearity,’’ Int. J. Model., Identificat.
1. The research doesn’t include dynamic or runtime appli- Control, vol. 18, no. 4, p. 295, 2013, doi: 10.1504/IJMIC.2013.053535.
[7] C. P. Obite, N. P. Olewuezi, G. U. Ugwuanyim, and D. C. Bartholomew,
cation features. We will consider the potential dynamic ‘‘Multicollinearity effect in regression analysis: A feed forward artifi-
aspects of Android applications in the future, including cial neural network approach,’’ Asian J. Probab. Statist., vol. 6, no. 1,
real-time permissions and API requests and possible pp. 22–33, Jan. 2020, doi: 10.9734/ajpas/2020/v6i130151.
[8] W. Wang, M. Zhao, Z. Gao, G. Xu, H. Xian, Y. Li, and X. Zhang, ‘‘Con-
features extracted. We will evaluate the behavioural structing features for detecting Android malicious applications: Issues,
traits of the app using a mixture of dynamic and static taxonomy and directions,’’ IEEE Access, vol. 7, pp. 67602–67631, 2019,
evaluation to discover harmful tendencies. doi: 10.1109/ACCESS.2019.2918139.
[9] B. Rashidi, C. Fung, and E. Bertino, ‘‘Android malicious application detec-
2. Our system lags in future sustainable operative mea- tion using support vector machine and active learning,’’ in Proc. 13th Int.
sures, meaning the system will need to be upgraded in Conf. Netw. Service Manage. (CNSM), Tokyo, Japan, Nov. 2017, pp. 1–9,
terms of forthcoming API levels and malware collec- doi: 10.23919/CNSM.2017.8256035.
[10] J. Li, L. Sun, Q. Yan, Z. Li, W. Srisa-An, and H. Ye, ‘‘Significant
tion or terms of new innovative features present in these permission identification for machine-learning-based Android malware
Android applications. detection,’’ IEEE Trans. Ind. Informat., vol. 14, no. 7, pp. 3216–3225,
3. The constraint of a slow and low processing environ- Jul. 2018, doi: 10.1109/TII.2017.2789219.
[11] G. Suarez-Tangil, J. E. Tapiador, P. Peris-Lopez, and J. Blasco, ‘‘Dendroid:
ment is another motive for less accuracy and predictive A text mining approach to analyzing and classifying code structures in
measures of our model in comparison to a few other Android malware families,’’ Exp. Syst. Appl., vol. 41, no. 4, pp. 1104–1117,
peak detection techniques achieving higher accuracy. Mar. 2014, doi: 10.1016/j.eswa.2013.07.106.
[12] M. Magdum, ‘‘Permission based mobile malware detection system using
machine learning,’’ Techniques, vol. 14, no. 6, pp. 6170–6174, 2015.
VIII. CONCLUSION [13] M. Qiao, A. H. Sung, and Q. Liu, ‘‘Merging permission and API fea-
tures for Android malware detection,’’ in Proc. 5th IIAI Int. Congr. Adv.
In this research, we devised a framework that can detect Appl. Informat. (IIAI-AAI), Kumamoto, Japan, Jul. 2016, pp. 566–571, doi:
malicious Android applications. The proposed technique 10.1109/IIAI-AAI.2016.237.
takes into account various elements of machine learning and [14] D. O. Sahin, O. E. Kural, S. Akleylek, and E. Kilic, ‘‘New results on
permission based static analysis for Android malware,’’ in Proc. 6th Int.
achieves a 96.24% in identifying malicious Android appli- Symp. Digit. Forensic Secur. (ISDFS), Antalya, Turkey, Mar. 2018, pp. 1–4,
cations. We first define and pick functions to capture and doi: 10.1109/ISDFS.2018.8355377.
analyze Android apps’ behavior, leveraging reverse appli- [15] A. Mahindru and A. L. Sangal, ‘‘MLDroid—Framework for Android
malware detection using machine learning techniques,’’ Neural Comput.
cation engineering and AndroGuard to extract features into Appl., vol. 33, no. 10, pp. 5183–5240, May 2021, doi: 10.1007/s00521-
binary vectors and then use python build modules and split 020-05309-4.
shuffle functions to train the model with benign and malicious [16] X. Su, D. Zhang, W. Li, and K. Zhao, ‘‘A deep learning approach
to Android malware feature learning and detection,’’ in Proc. IEEE
datasets. Our experimental findings show that our suggested Trustcom/BigDataSE/ISPA, Tianjin, China, Aug. 2016, pp. 244–251, doi:
model has a false positive rate of 0.3 with 96% accuracy in 10.1109/TrustCom.2016.0070.
the given environment with an enhanced and larger feature [17] K. A. Talha, D. I. Alper, and C. Aydin, ‘‘APK auditor: Permission-based
Android malware detection system,’’ Digit. Invest., vol. 13, pp. 1–14,
and sample sets. The study also discovered that when dealing Jun. 2015, doi: 10.1016/j.diin.2015.01.001.
with classifications and high-dimensional data, ensemble and [18] A. Mahindru and P. Singh, ‘‘Dynamic permissions based Android
strong learner algorithms perform comparatively better. The malware detection using machine learning techniques,’’ in Proc. 10th
Innov. Softw. Eng. Conf., Jaipur, India, Feb. 2017, pp. 202–210, doi:
suggested approach is restricted in terms of static analy- 10.1145/3021460.3021485.
sis, lacks sustainability concerns, and fails to address a key [19] U. Pehlivan, N. Baltaci, C. Acarturk, and N. Baykal, ‘‘The analysis
multicollinearity barrier. In the future, we’ll consider model of feature selection methods and classification algorithms in permis-
sion based Android malware detection,’’ in Proc. IEEE Symp. Comput.
resilience in terms of enhanced and dynamic features. The Intell. Cyber Secur. (CICS), Orlando, FL, USA, Dec. 2014, pp. 1–8, doi:
issue of dependent variables or high intercorrelation between 10.1109/CICYBS.2014.7013371.
89048 VOLUME 10, 2022
B. Urooj et al.: Malware Detection: Framework for Reverse Engineered Android Applications
[20] M. Kedziora, P. Gawin, M. Szczepanik, and I. Jozwiak, ‘‘Malware detec- [38] C. L. P. M. Hein, ‘‘Permission based malware protection model for Android
tion using machine learning algorithms and reverse engineering of Android application,’’ presented at the Int. Conf. Adv. Eng. Technol., Mar. 2014,
Java code,’’ Int. J. Netw. Secur. Appl., vol. 11, no. 1, pp. 1–14, Jan. 2019, doi: 10.15242/IIE.E0314102.
doi: 10.5121/ijnsa.2019.11101. [39] G. L. Scoccia, S. Ruberto, I. Malavolta, M. Autili, and P. Inver-
[21] X. Liu and J. Liu, ‘‘A two-layered permission-based Android mal- ardi, ‘‘An investigation into Android run-time permissions from the
ware detection scheme,’’ in Proc. 2nd IEEE Int. Conf. Mobile Cloud end users’ perspective,’’ in Proc. 5th Int. Conf. Mobile Softw. Eng.
Comput., Services, Eng., Oxford, U.K., Apr. 2014, pp. 142–148, doi: Syst., Gothenburg, Sweden, May 2018, pp. 45–55, doi: 10.1145/3197231.
10.1109/MobileCloud.2014.22. 3197236.
[22] Permission-Based Android Malware Detection | Semantic Scholar. [40] P. Topark-Ngarm, ‘‘Identifying Android malware using machine learning
Accessed: Oct. 31, 2021. [Online]. Available: https://www. based upon both static and dynamic features,’’ M.S. thesis, Victoria Univ.
semanticscholar.org/paper/Permission-Based-Android-Malware- Wellington, Wellington, New Zealand, 2014, p. 87. [Online].
Detection-Aung-Zaw/c8576b5df33813fe8938cbb19e35217ee21fc80b Available: https://ecs.wgtn.ac.nz/foswiki/pub/Main/IanWelch/pacharawit-
[23] D. Arp, M. Spreitzenbarth, M. Hübner, H. Gascon, and K. Rieck, ‘‘Drebin: thesis.pdf
Effective and explainable detection of Android malware in your pocket,’’ [41] N. Milosevic, A. Dehghantanha, and K.-K. R. Choo, ‘‘Machine learning
presented at the Netw. Distrib. Syst. Secur. Symp., San Diego, CA, USA, aided Android malware classification,’’ Comput. Electr. Eng., vol. 61,
2014, doi: 10.14722/ndss.2014.23247. pp. 266–274, Jul. 2017, doi: 10.1016/j.compeleceng.2017.02.013.
[24] H. Cai, N. Meng, B. G. Ryder, and D. Yao, ‘‘DroidCat: Effective Android [42] V. Bolón-Canedo, N. Sánchez-Maroño, and A. Alonso-Betanzos, Feature
malware detection and categorization via app-level profiling,’’ IEEE Trans. Selection for High-Dimensional Data. Cham, Switzerland: Springer, 2015,
Inf. Forensics Security, vol. 14, no. 6, pp. 1455–1470, Jun. 2019, doi: doi: 10.1007/978-3-319-21858-8.
10.1109/TIFS.2018.2879302. [43] B. Pes, ‘‘Ensemble feature selection for high-dimensional data: A stability
[25] P. Rovelli and Ý. Vigfússon, ‘‘PMDS: Permission-based malware detection analysis across multiple domains,’’ Neural Comput. Appl., vol. 32, no. 10,
system,’’ in Information Systems Security, vol. 8880, A. Prakash and pp. 5951–5973, May 2020, doi: 10.1007/s00521-019-04082-3.
R. Shyamasundar, Eds. Cham, Switzerland: Springer, 2014, pp. 338–357, [44] A. Hamidreza and N. Mohammed, ‘‘Permission-based analysis of Android
doi: 10.1007/978-3-319-13841-1_19. applications using categorization and deep learning scheme,’’ in Proc.
[26] M. S. Alam and S. T. Vuong, ‘‘Random forest classification for MATEC Web Conf., vol. 255, 2019, p. 05005, doi: 10.1051/matec-
detecting Android malware,’’ in Proc. IEEE Int. Conf. Green Comput. conf/201925505005.
Commun. IEEE Internet Things IEEE Cyber, Phys. Social Comput., [45] T. Boksasp and E. Utnes, ‘‘Android apps and permissions: Security
Beijing, China, Aug. 2013, pp. 663–669, doi: 10.1109/GreenCom- and privacy risks,’’ M.S. thesis, Dept. Telematics, Norwegian Sci.
iThings-CPSCom.2013.122. Technol., Trondheim, Norway, 2012, p. 143. [Online]. Available:
[27] D. Congyi and S. Guangshun, ‘‘Method for detecting Android malware https://ntnuopen.ntnu.no/ntnu-xmlui/bitstream/handle/11250/262677/
based on ensemble learning,’’ in Proc. 5th Int. Conf. Mach. Learn. 566356_FULLTEXT01.pdf?sequence=1
Technol., Beijing, China, Jun. 2020, pp. 28–31, doi: 10.1145/3409073. [46] N. Yadav, A. Sharma, and A. Doegar, ‘‘A survey on Android malware
3409084. detection,’’ Int. J. New Technol. Res., vol. 2, no. 12, p. 7, 2016.
[28] W. Li, J. Ge, and G. Dai, ‘‘Detecting malware for Android platform:
[47] F. I. Abro, ‘‘Investigating Android permissions and intents for malware
An SVM-based approach,’’ in Proc. IEEE 2nd Int. Conf. Cyber Secur.
detection,’’ City Univ., London, U.K., Tech. Rep., 2014, p. 5 and 169.
Cloud Comput., New York, NY, USA, Nov. 2015, pp. 464–469, doi:
[48] M. Magdum and S. K. Wagh, ‘‘Permission based Android malware detec-
10.1109/CSCloud.2015.50.
tion system using machine learning approach,’’ Int. J. Comput. Sci. Inf.
[29] G. Suarez-Tangil, S. K. Dash, M. Ahmadi, J. Kinder, G. Giacinto,
Secur., vol. 14, no. 6, Jun. 2016.
and L. Cavallaro, ‘‘DroidSieve: Fast and accurate classification of
[49] T. Kim, B. Kang, M. Rho, S. Sezer, and E. G. Im, ‘‘A multimodal deep
obfuscated Android malware,’’ in Proc. 7th ACM Conf. Data Appl.
learning method for Android malware detection using various features,’’
Secur. Privacy, Scottsdale, AZ, USA, Mar. 2017, pp. 309–320, doi:
IEEE Trans. Inf. Forensics Security, vol. 14, no. 3, pp. 773–788, Mar. 2019,
10.1145/3029806.3029825.
doi: 10.1109/TIFS.2018.2866319.
[30] Z. Yuan, Y. Lu, and Y. Xue, ‘‘Droiddetector: Android malware char-
acterization and detection using deep learning,’’ Tsinghua Sci. Tech- [50] H. A. Alatwi, ‘‘Android malware detection using category-
nol., vol. 21, no. 1, pp. 114–123, Feb. 2016, doi: 10.1109/TST.2016. based machine learning classifers,’’ M.S. thesis, Rochester Inst.
7399288. Technol., Rochester, NY, USA, 2016, p. 62. [Online]. Available:
[31] S. Ilham, G. Abderrahim, and B. A. Abdelhakim, ‘‘Permission https://scholarworks.rit.edu/theses/9069/
based malware detection in Android devices,’’ in Proc. 3rd Int. [51] P. Basavaraju and A. S. Varde, ‘‘Supervised learning techniques in mobile
Conf. Smart City Appl., Tetouan, Morocco, Oct. 2018, pp. 1–6, doi: device apps for Androids,’’ Dept. Comput. Sci., Montclair State Univ.,
10.1145/3286606.3286860. Montclair, NJ, USA, Tech. Rep., 2017, p. 12, vol. 19, no. 2.
[32] O. Yildiz and I. A. Doğru, ‘‘Permission-based Android malware detec- [52] R. N. Romli, M. F. Zolkipli, and M. Z. Osman, ‘‘Efficient feature
tion system using feature selection with genetic algorithm,’’ Int. J. selection analysis for accuracy malware classification,’’ J. Phys., Conf.
Softw. Eng. Knowl. Eng., vol. 29, no. 2, pp. 245–262, Feb. 2019, doi: Ser., vol. 1918, no. 4, Jun. 2021, Art. no. 042140, doi: 10.1088/1742-
10.1142/S0218194019500116. 6596/1918/4/042140.
[33] J. Garcia, M. Hammad, B. Pedrood, A. Bagheri-Khaligh, and S. Malek, [53] J. Abah, O. V. Waziri, M. B. Abdullahi, U. M. Arthur, and O. S. Adewale,
‘‘Obfuscation-resilient, efficient, and accurate detection and family identi- ‘‘A machine learning approach to anomaly-based detection on Android
fication of Android malware,’’ Dept. Comput. Sci., George Mason Univ., platforms,’’ Int. J. Netw. Secur. Appl., vol. 7, no. 6, pp. 15–35, Nov. 2015,
Fairfax, VA, USA, Tech. Rep. GMU-CS-TR-2015-10, 2015, vol. 202. doi: 10.5121/ijnsa.2015.7602.
[34] A. Senawi, H.-L. Wei, and S. A. Billings, ‘‘A new maximum relevance- [54] I. K. Aksakalli, ‘‘Using convolutional neural network for Android malware
minimum multicollinearity (MRmMC) method for feature selection detection,’’ Dept. Comput. Eng., Erzurum, Turkey, 2019.
and ranking,’’ Pattern Recognit., vol. 67, pp. 47–61, Jul. 2017, doi: [55] I. Martín, J. A. Hernández, A. Muñoz, and A. Guzmán, ‘‘Android
10.1016/j.patcog.2017.01.026. malware characterization using metadata and machine learning tech-
[35] R. Tamura, K. Kobayashi, Y. Takano, R. Miyashiro, K. Nakata, and niques,’’ Secur. Commun. Netw., vol. 2018, pp. 1–11, Jul. 2018, doi:
T. Matsui, ‘‘Best subset selection for eliminating multicollinearity,’’ 10.1155/2018/5749481.
J. Oper. Res. Soc. Jpn., vol. 60, no. 3, pp. 321–336, 2017, doi: [56] S. Fallah and A. J. Bidgoly, ‘‘Benchmarking machine learning algorithms
10.15807/jorsj.60.321. for Android malware detection,’’ Jordanian J. Comput. Inf. Technol., vol. 5,
[36] A. Farrell, G. Wang, S. A. Rush, J. A. Martin, J. L. Belant, A. B. Butler, and no. 3, p. 15, 2019.
D. Godwin, ‘‘Machine learning of large-scale spatial distributions of wild [57] X. Jiang, B. Mao, J. Guan, and X. Huang, ‘‘Android malware detection
turkeys with high-dimensional environmental data,’’ Ecol. Evol., vol. 9, using fine-grained features,’’ Sci. Program., vol. 2020, pp. 1–13, Jan. 2020,
no. 10, pp. 5938–5949, May 2019, doi: 10.1002/ece3.5177. doi: 10.1155/2020/5190138.
[37] S. Niu, R. Huang, W. Chen, and Y. Xue, ‘‘An improved permission [58] H. Yuan, Y. Tang, W. Sun, and L. Liu, ‘‘A detection method for
management scheme of Android application based on machine learn- Android application security based on TF-IDF and machine learning,’’
ing,’’ Secur. Commun. Netw., vol. 2018, pp. 1–12, Oct. 2018, doi: PLoS ONE, vol. 15, no. 9, Sep. 2020, Art. no. e0238694, doi: 10.1371/
10.1155/2018/2329891. journal.pone.0238694.
[59] A. M. García, ‘‘Machine learning techniques for Android malware detec- MUNAM ALI SHAH received the B.Sc. and
tion and classification,’’ Ph.D. thesis, Auton. Univ. Madrid, Madrid, M.Sc. degrees in computer science from the Uni-
Spain, 2019, p. 170. [Online]. Available: https://dialnet.unirioja.es/ versity of Peshawar, Pakistan, in 2001 and 2003,
servlet/tesis?codigo=221389 respectively, the M.S. degree in security technolo-
[60] S. Y. Yerima, M. K. Alzaylaee, and S. Sezer, ‘‘Machine learning- gies and applications from the University of Sur-
based dynamic analysis of Android apps with improved code cover- rey, U.K., in 2010, and the Ph.D. degree from the
age,’’ EURASIP J. Inf. Secur., vol. 2019, no. 1, p. 4, Dec. 2019, doi: University of Bedfordshire, U.K., in 2013. Since
10.1186/s13635-019-0087-1.
July 2004, he has been associated with the Depart-
[61] Y. Dong, ‘‘Android malware prediction by permission analysis and data
ment of Computer Science, COMSATS University
mining,’’ M.S. thesis, Dept. Comput. Inf. Sci., Univ. Michigan-Dearborn,
Dearborn, MI, USA, 2017, p. 71. [Online]. Available:https://deepblue.lib. Islamabad, Pakistan. He is the author of more than
umich.edu/bitstream/handle/2027.42/136197/YouchaoDong_Thesis_ 225 research articles published in international conferences and journals.
0327.pdf%3Fsequence%3D1%26isAllowed%3Dy His research interests include the IoT protocol design, QoS, and security
[62] D. V. Priya and P. Visalakshi, ‘‘Detecting Android malware using issues in wireless communication systems and applications of machine
an improved filter based technique in embedded software,’’ Micropro- learning. He received the Best Paper Award of the International Conference
cessors Microsyst., vol. 76, Jul. 2020, Art. no. 103115, doi: 10.1016/ on Automation and Computing, in 2012.
j.micpro.2020.103115.
[63] A. Hemalatha and D. S. S. Brunda, ‘‘Detection of mobile malwares using
improved deep convolutional neural network,’’ vol. 7, no. 14, p. 7, 2020.
[64] S. Mahdavifar, A. F. A. Kadir, R. Fatemi, D. Alhadidi, and A. A. Ghorbani,
‘‘Dynamic Android malware category classification using semi-supervised
deep learning,’’ in Proc. IEEE Int. Conf. Dependable, Autonomic Secure
CARSTEN MAPLE (Member, IEEE) is currently
Comput., Int. Conf. Pervasive Intell. Comput., Int. Conf. Cloud Big Data
Comput., Int. Conf. Cyber Sci. Technol. Congr. (DASC/PiCom/CBDCom/ a Professor of cyber systems engineering at the
CyberSciTech), Calgary, AB, Canada, Aug. 2020, pp. 515–522, doi: WMG’s Cyber Security Centre (CSC), University
10.1109/DASC-PICom-CBDCom-CyberSciTech49142.2020.00094. of Warwick. He is also the director of research
[65] Android Malware Detection | Kaggle. Accessed: Nov. 14, 2021. [Online]. in cyber security working with organizations in
Available: https://www.kaggle.com/defensedroid/android-malware-detec key sectors, such as manufacturing, healthcare,
tion financial services, and the broader public sector to
[66] X. Fu and H. Cai, ‘‘On the deterioration of learning-based malware address the challenges presented by today’s global
detectors for Android,’’ in Proc. IEEE/ACM 41st Int. Conf. Softw. Eng., cyber environment. His interests include informa-
Companion Proc. (ICSE-Companion), Montreal, QC, Canada, May 2019, tion security and trust and authentication in dis-
pp. 272–273, doi: 10.1109/ICSE-Companion.2019.00110. tributed systems. He is a member of several professional societies, including
[67] K. Xu, Y. Li, R. Deng, K. Chen, and J. Xu, ‘‘DroidEvolver: Self- the Council of Professors and Heads of Computing (CPHC) whose remit is
evolving Android malware detection system,’’ in Proc. IEEE Eur. Symp. to promote public education in computing and its applications and to provide
Secur. Privacy (EuroS P), Stockholm, Sweden, Jun. 2019, pp. 47–62, doi: a forum for those responsible for management and research in university
10.1109/EuroSP.2019.00014.
computing departments. He is also an elected member to the Committee
[68] H. Cai, ‘‘Assessing and improving malware detection sustainability
of this body. He is an Education Advisor for TIGA’s the trade association
through app evolution studies,’’ ACM Trans. Softw. Eng. Methodol., vol. 29,
no. 2, pp. 1–28, Apr. 2020, doi: 10.1145/3371924. representing the U.K. games industry. He is a fellow of the British Computer
[69] X. Zhang, Y. Zhang, M. Zhong, D. Ding, Y. Cao, Y. Zhang, M. Zhang, Society and the Chartered Institute for IT. He is a Chartered IT professional.
and M. Yang, ‘‘Enhancing state-of-the-art classifiers with API semantics He also holds two Professorships in China, including a position at one of the
to detect evolved Android malware,’’ in Proc. ACM SIGSAC Conf. Comput. top two control engineering departments in China.
Commun. Secur., New York, NY, USA, Oct. 2020, pp. 757–770, doi:
10.1145/3372297.3417291.
[70] A. Katrutsa and V. Strijov, ‘‘Comprehensive study of feature selec-
tion methods to solve multicollinearity problem according to evalua-
tion criteria,’’ Expert Syst. Appl., vol. 76, pp. 1–11, Jun. 2017, doi:
10.1016/j.eswa.2017.01.048. MUHAMMAD KAMRAN ABBASI received the
[71] R. Grewal, J. A. Cote, and H. Baumgartner, ‘‘Multicollinearity and mea-
Ph.D. degree in computer science from the Univer-
surement error in structural equation models: Implications for theory
sity of Bedfordshire, U.K. He is currently working
testing,’’ Marketing Sci., vol. 23, no. 4, pp. 519–529, Nov. 2004, doi:
10.1287/mksc.1040.0070. as an Associate Professor with the Department
[72] M. S. Devi, A. Poornima, J. Kosanam, and T. Hari S. Prashanth, ‘‘Outlier of Distance Continuing and Computer Education,
multicollinearity free fish weight prediction using machine learning,’’ University of Sindh. His research interests include
Mater. Today, Proc., p. 7, Mar. 2021, doi: 10.1016/j.matpr.2021.02.773. unsupervised machine learning, informatics, and
educational technology.