Malware Detection: A Framework For Reverse Engineered Android Applications Through Machine Learning Algorithms

Received 30 December 2021, accepted 27 January 2022, date of publication 4 February 2022, date of current version 29 August 2022.
Digital Object Identifier 10.1109/ACCESS.2022.3149053
Malware Detection: A Framework for Reverse

Engineered Android Applications Through
Machine Learning Algorithms
BEENISH UROOJ 1 , MUNAM ALI SHAH 1 , CARSTEN MAPLE 2, (Member, IEEE),
MUHAMMAD KAMRAN ABBASI3 , AND SIDRA RIASAT 1
1 Department of Computer Science, COMSATS University Islamabad, Islamabad 45550, Pakistan
2 WMG, University of Warwick, Coventry CV4 7AL, U.K.
3 Department of Distance Continuing and Computer Education, University of Sindh, Hyderabad 76080, Pakistan
Corresponding author: Carsten Maple ([email protected])
This work was supported by UKRI through the grants EP/R007195/1 (Academic Centre of Excellence in Cyber Security
Research - University of Warwick), EP/N510129/1 (The Alan Turing Institute) and EP/S035362/1 (PETRAS, the National Centre of
Excellence for IoT Systems Cybersecurity).
ABSTRACT Today, Android is one of the most used operating systems in smartphone technology. This
is the main reason, Android has become the favorite target for hackers and attackers. Malicious codes are
being embedded in Android applications in such a sophisticated manner that detecting and identifying an
application as a malware has become the toughest job for security providers. In terms of ingenuity and
cognition, Android malware has progressed to the point where they’re more impervious to conventional
detection techniques. Approaches based on machine learning have emerged as a much more effective way
to tackle the intricacy and originality of developing Android threats. They function by first identifying
current patterns of malware activity and then using this information to distinguish between identified threats
and unidentified threats with unknown behavior. This research paper uses Reverse Engineered Android
applications’ features and Machine Learning algorithms to find vulnerabilities present in Smartphone
applications. Our contribution is twofold. Firstly, we propose a model that incorporates more innovative
static feature sets with the largest current datasets of malware samples than conventional methods. Secondly,
we have used ensemble learning with machine learning algorithms i.e., AdaBoost, Support Vector Machine
(SVM), etc. to improve our model’s performance. Our experimental results and findings exhibit 96.24%
accuracy to detect extracted malware from Android applications, with a 0.3 False Positive Rate (FPR).
The proposed model incorporates ignored detrimental features such as permissions, intents, Application
Programming Interface (API) calls, and so on, trained by feeding a solitary arbitrary feature, extracted by
reverse engineering as an input to the machine.
INDEX TERMS Android applications, benign, feature extraction, malware detection, reverse engineering,
machine learning.
I. INTRODUCTION system, the level of danger it poses, with malware authors and
To this degree, it is guaranteed that mobile devices are programmers implementing unwanted permissions, features
an integral part of most people’s daily lives. Furthermore, and application components in Android apps. The option
Android now controls the vast majority of mobile devices, to expand its capabilities with third-party software is also
with Android devices accounting for an average of 80% of the appealing, but this capability comes with the risk of malicious
global market share over the past years [1]. With the ongoing attacks. When the number of smartphone apps increases,
plan of Android to a growing range of smartphones and con- so does the security problem with unnecessary access to
sumers around the world, malware targeting Android devices different personal resources. As a result, the applications
has increased as well. Since it is an open-source operating are becoming more insecure, and they are stealing personal
information, SMS frauds, ransomware, etc.
The associate editor coordinating the review of this manuscript and In contrast to static analysis methods such as a manual
approving it for publication was Tony Thomas. assessment of AndroidManifest.xml, source files and Dalvik
This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License.

VOLUME 10, 2022 For more information, see https://creativecommons.org/licenses/by-nc-nd/4.0/ 89031
B. Urooj et al.: Malware Detection: Framework for Reverse Engineered Android Applications
Byte Code and the complex analysis of a managed environ-

ment to study the way it treats a program, Machine Learn-
ing includes learning the fundamental rules and habits of
the positive and malicious settings of apps and then data-
enabling. The static attributes derived from an application are
extensively used in machine learning methodologies and the
tedious task of this can be relieved if the static features of
reverse-engineered Android Applications are extracted and
use machine learning SVM algorithm, logistic progression, FIGURE 1. Android security framework.
ensemble learning and other algorithms to help train the
model for prediction of these malware applications [2].
Machine learning employs a range of methodologies for outlets like file-sharing sites or third-party app stores. The
data classification. SVM is a strong learner that plots each malware virus problem has become so severe that 97 % of
data item as a point in n-dimensional space (where n denotes all Smartphone malware now targets Android phones. In a
the number of features you have), with the value of each fea- year, about 3.25 million new malware Android applications
ture becoming the vector value. Afterward, it performs classi- are discovered as the growth of smartphones increases. his
fication by locating the hyperplane that best distinguishes the is roughly equivalent to introducing a new malware version
two groups, thereby improving the recognition properties of of Android every few seconds [10]. The primary aim of
any two parameters. Conversely, boosting or ensemble tech- mobile malware is to gain entrance to user data saved on the
niques like Adaboost assigns higher weights to improve the computer and user information used in confidential financial
behavior of misclassified variables in conjunction with other activities, such as banking. Infected file extensions, files
machine learning algorithms. If combined along with weak received via Bluetooth, links to infected code in SMS, and
classifiers, our preliminary model benefits from deploying MMS application links are all ways that mobile malware
such models since they have a high degree of precision or can propagate [11]. There are some strategies for locating
classification. References [3], [4] and [5], supports classifiers apps that need additional features. Using these approaches,
in their system models to find the highest accuracy. Although it should be easy to assess whether the applications labelled
using ensemble or strong classifiers can cause problems like as suspicious and requiring extra authorization are malicious.
multicollinearity, which in a regression model, occurs when Static analysis methodologies are the most fundamental
two or more independent variables are strongly associated of all approaches. Until operating programs, the manifest
with one another. In multivariate regression, this indicates that file and source codes are examined [12]. For many machine
one regression analysis may be forecasted from another inde- learning tasks, such as enhancing predictive performance or
pendent variable. This scope of the study can be presented simplifying complicated learning problems, ensemble learn-
as a detection journal analysis itself and can present several ing is regarded as the most advanced method. It enhances
experimentations and results based on machine learning mod- a single model’s prediction performance by training several
els [6], [7]. models and combining their predictions. Boosting, bagging,
In the latest versions of the Android operating system (OS), and random forest are examples of common ensemble learn-
any app that requires access privileges may ask the OS for ing techniques [13]. In summary, the main contributions of
permission, and the OS will ask the user whether they want our study are as follows:
to approve or decline the request through a pop-up option. 1) We present a novel subset of features for static detection
Many studies have been conducted on the effectiveness of of Android malware, which consists of seven addi-
this resource management strategy. Research shows that con- tional selected feature sets such as (Permissions, App-
sumers make decisions by granting access to all requests to Components, Method Tags, Intents, Packages, API
applications [8]. In comparison, more than 70% of Android Calls, and Services/Receivers) that are using around
mobile applications request permission that isn’t required or 56000 features from these categories. On a collec-
is not needed in the app in the first place. A chess game tion of more than 500k benign and malicious Android
that asks for photographs or requests for SMS and phone call applications and the highest malware sample set than
permits, or loads unwanted packages is an example of an extra any state-of-the-art approach, we assess their stability.
requested authorization. So, trying to set an app’s vindictive- The results obtain a detection increase in accuracy to
ness and not understanding the app is a tough challenge. As a 96.24 % with 0.3% false positives.
result, successful malicious app monitoring will provide extra 2) With the additional features, we have trained six
information to customers to assist them and defend them from classifier models or machine learning algorithms
information disclosure [9]. Figure 1 elaborates the android and also implemented a Boosting ensemble learning
risk framework through the Google Play platform, which is approach (AdaBoost) with a Decision Tree based on
then manually configured by the android device developers. the binary classification to enhance our prediction rate.
In contrast to other smartphone operating systems, such as 3) Our model is trained on the latest and large time
iOS, Android requires users to access apps from untrusted aware samples of malware collected within recent years
89032 VOLUME 10, 2022

FIGURE 2. Static binary matrix extraction.
including the latest Android API level than state-of-the-

art approaches.
This research paper incorporates binary vector mapping for
classification by allocating 0 to malicious applications and
1 for non-harmful and for predictive analysis of each appli-
cation fed to the model implemented in the study. The tech-
nique eases the process by reducing fault predictive errors.
Figure 2 shows the procedure for a better understanding
of the concept applied later in our study. The paper passes
both the categories of applications through static analysis
and then is further processed for feature extraction. We pre-
sented features in 0’s and 1’s after extraction. Matrix displays
the extraction characteristics of each application used in the FIGURE 3. Taxonomy of android architecture.
dataset.
There are major issues to be addressed to incorporate our
strategy. High measurements of the features will make it
difficult to identify malware in many real-world Android of the functions (attributes) that aren’t helpful in the study
applications. Certain features overlap with innocuous apps are omitted from the data collection because of this. The
and malware [14]. In comparison, the vast number of features remaining features are chosen by weighing the representa-
will cause high throughput computing. Therefore, we can tional strength of all the dataset’s features [19]. Parsing tools
learn from the features directly derived from Android apps, can help learn which permissions, packages or services an
the most popular and significant features. The paper imple- application offers by analyzing the AndroidManifest.xml file,
ments prediction models and various computer ensemble such as permission android.permission.call phone, which
teaching strategies to boost and enhance accuracy to resolve allows an application to misuse calling abilities. The paper
this problem [15]. Feature selection is an essential step in all elaborates exactly what sort of sensitive API the authors
machine-based learning approaches. The optimum collection could name by decoding the classes.dex file with the Jadx-
of features will not only help boost the outcomes of tests but gui disassembler [20]. In certain cases, including two permis-
will also help to reduce the compass of most machine-based sions in a single app can signify the app’s possible malicious
learning algorithms [16]. attacks. For example, an application with RECEIVE SMS
Studies have extensively suggested three separate methods and WRITE SMS permissions can mask or interfere with
for identifying android malware: static, interactive meaning receiving text messages [21] or applying sensitive API such
dynamically, and synthetic or hybrid. Static analysis tech- as sendTextMessage can also be harmful and lead to fraud
niques look at the code without ever running it, so they’re and stealing.
a little sluggish if carried out manually and have to face Until we started our main idea of the project. The fact
a lot of false positives [17]. Data obfuscation and com- explained that Android applications pose a lot of threats
plex code loading are both significant pitfalls of the tech- to its user because of the unnecessary programs compiled
nique. That is why automated operation helps to achieve inside them and explained why it is necessary to automate
reliability, accuracy, and lesser time utilization [18]. Reverse the process of static analysis for the efficient detection of
engineer Android applications and extract features and do malware applications based on the extracted features. The
static analysis from them without having to execute them. rest of the paper is planned as follows. Related works are
This method entails examining the contents of two files: examined in Section II. Section III presents the design details
AndroidManifest.xml and classes.dex and working on the file of the proposed model. Section IV elaborates the assessment
with the.apk extension. Feature selection techniques and clas- findings and future threats. The experiments and results will
sification algorithms are two crucial areas of feature-based be dilated and performed in Sections V and VI. Section VII
types of fraudulent applications. Feature filtering methods includes our research issues, recommendations, and conclu-
are used to reduce the dimension size of a dataset. Any sions for the future.
VOLUME 10, 2022 89033

II. RELATED WORKS from different sources. Model continuously achieved 97%
Linux (Android core) keeps key aspects of the security infras- F1-measure accuracy for identifying applications or catego-
tructure of the operating system. The Android displays to rizing malware. [25] The authors present a unique Android
the administrator a list of features, sought to reinstall an malware detection approach dubbed Permission-based Mal-
update. The program installs itself on the computer after ware Detection Systems (PMDS) based on a study of
they issue access. Figure 3 shows the integrated core parts 2950 samples of benign and malicious Android applica-
of Android architecture. It comprises applications at the top tions. In PMDS, requested permissions are viewed as behav-
layer and also includes an application framework, libraries ioral markers, and a machine learning model is built on
or a Runtime layer, and a Linux kernel. These levels are fur- those indicators to detect new potentially dangerous behav-
ther divided into their components, which make an Android ior in unknown apps depending on the mix of rights they
Application. The Linux Kernel is the key part of Android require. PMDS identifies more than 92–94% of all heretofore
that provides its OS functionality to phones, and the Dalvik unknown malware, with a false positive rate of 1.52–3.93%.
Virtual Machine (DVM) is to manage a mobile device. Appli- The authors of this article [26] solely use the machine learn-
cation is the Android architecture’s highest layer. Native ing ensemble learning method Random Forest supervised
and third-party apps such as contacts, email, audio, gallery, classifier on Android feature malware samples with 42 fea-
clock, sports, and so on are located only in this layer. This tures respectively. Their objective was to assess Random
framework gets the classes often used to develop Android Forest’s accuracy in identifying Android application activity
apps. It also handles the user interface and device infras- as harmful or benign. Dataset 1 is built on 1330 malicious
tructure and provides a common specification for hardware apk samples and 407 benign ones seen by the author. This is
entry. To facilitate the development of Android, the Platform based on the collection of feature vectors for each application.
Libraries include many C/C++ core libraries and Java-based Based on an ensemble learning approach, Congyi proposes a
libraries such as SSL, libc, Graphics, SQLite, Webkit, Media, concept in [27] for recognizing and distinguishing Android
Surface Manager, OpenGL, and others. The taxonomy helps malware. To begin, a static analysis of the Android Manifest
understand the viewer with a logical algorithmic approach for file in the Android Application Package (APK) is done to
grasping the core surfaces and functionality of the operating extract system characteristics such as permission calls, com-
system. ponent calls, and intents. Then, to detect malicious apps, they
The methods proposed in this related work contribute deploy the XGBoost technique, which is an implementation
to key aspects such as selected features for classification of ensemble learning. Analyzing more than 6,000 Android
and a higher predictive rate for malware detection. Cer- apps on the Kaggle platform provided the initial data for
tain research has focused on increasing accuracy, while this experiment. They tested both benign and malicious apps
others have focused on providing a larger dataset, some based on 3 feature sets for a testing set of 2,000 samples
have been implemented by employing various feature sets, and used the remaining data to create a training set of 6,315
and many studies have combined all of these to improve samples. Additional approaches include [28], an SVM-based
detection rate efficiency. In [22], the authors offer a sys- malware detection technique for the Android platform that
tem for detecting Android malware apps to aid in the orga- incorporates both dangerous permission combinations and
nization of the Android Market. The proposed framework susceptible API calls as elements in the SVM algorithm. The
aims to provide a machine learning-based malware detection dataset includes 400 Android applications, which included
system for Android to detect malware apps and improve 200 benign apps from the official Android market and
phone users’ safety and privacy. This system monitors dif- 200 malicious apps from the Drebin dataset. [29] Determines
ferent permission-based characteristics and events acquired whether the program is dangerous and, if so, categorizes it as
from Android apps and examines these features employing part of a malware family. They obtain up to 99.82 % accuracy
machine learning classifiers to determine if the program is with zero false positives for malware detection at a fraction
goodware or malicious. The paper uses two datasets with of the computation power of state-of-the-art methods but
collectively 700 malware samples and 160 features. Both incorporate a minimal feature set. The results of [30] demon-
datasets achieved approximately 91% accuracy with Ran- strate that deep learning is adequate for classifying Android
dom Forest (RF) Algorithm. [23] Examines 5,560 malware malware and that it is much more successful when additional
samples, detecting 94 % of the malware with minimal false training data is available. A permission-based strategy for
alarms, where the reasons supplied for each detection dis- identifying malware in Android applications is described
close key features of the identified malware. Another tech- in [31], which uses filter feature selection algorithms to pick
nique [24] exceeds both static and dynamic methods that rely features and implements machine learning algorithms such
on system calls in terms of resilience. Researchers demon- as Random Forest, SVM, and J48 to classify applications
strated the consistency of the model in attaining maximum as malware or benign. This research [32] provides a feature
classification performance and better accuracy compared to selection using the Genetic algorithm (GA) approach for
two state-of-the-art peer methods that represent both static identifying Android malware. For identifying and analyzing
and dynamic methodologies over for nine years through three Android malware, three alternative classifier techniques with
interrelated assessments with satisfactory malware samples distinct feature subsets were built and compared using GA.
89034 VOLUME 10, 2022

Another technique achieves satisfactory accuracy but there

FPR is very high with limited samples [33].
One of the important matters that has not been considered
by any of the studies is the sustainability of the model after the
advancement of applications. This issue is still a challenge for
our research as well. The model’s ability to classify will grad-
ually decrease over time when new features or evolved appli-
cations are created. Only [29] and [26] specify this issue and
introduce it as a drift concept, describing the low performance
of their systems after some time. Our research doesn’t imple-
FIGURE 4. Reverse engineering APK files architecture.
ment this problem as well, but we suggested some potential
studies to initiate solutions for models’ sustainability in the
research issues and challenges section. Another matter that
could arise in the field of implementing machine learning
algorithms is the ‘‘Multicollinearity Problem’’ which we have
discussed in the introduction section. This subject arises
due to the algorithms being dependent on multiple variables
embedded in these machine learning or deep learning models.
Although it is one of the rising issues in the area and could
be present in our study it would constitute better as separate
research. Our model is already incorporating a wide range of
evaluations and analysis of Android applications features sets
but this would be a great opportunity to further enhance the
models for future use. There are relevant studies that support
alleviating this challenge by detecting the model’s dependen-
cies in terms of comparing multiple models together and then FIGURE 5. Taxonomy of android manifest.
calculating the greater impact of the highest given model.
Authors in references [34], [35], [36] consider different tales
concerning different machine learning models to highlight present in the APK files and how they are when they are
and find out the measures for different model scenarios. reverse engineered by using a disassembler, in our case Jadx-
Tables 1 and 2 elaborates on the novelty of our approach gui. Fig.4 shows the process of apk file disassembly.
and compare state-of-the-art methodologies in several cate-
gories. Table 1 focuses on the key novel categories in terms 1) ANDROIDMANIFEST.XML
of malware samples, feature sets, the method proposed, accu- In the root folder of any reverse-engineered application, there
racy, false-positive rate, the level of API (increased complex must be an android Manifest.xml file. The Manifest file gives
application behavior) and system environment for data pro- essential information to the Mobile application, which is
cessing. It also explains that our sample set and feature set required by the framework before executing any code for the
is larger and achieve satisfactory accuracy with 0.3% FPR, app. The authorization process should protect the applica-
depicting the lowest false positives other than DroidSieve. tion’s key elements, which include the Operation, Service,
Our contribution lands on the upgraded API levels with large Content Provider, and Broadcast Receivers. These results
sample sizes including enhanced feature sets to detect mal- mainly accomplished by affiliating these components with
ware. Table 2 elaborates a more in-depth approach and shows the relevant element in its manifest definition and making
the key features present in the proposed and other approaches Android dynamically implement the features in the closely
with also the time awareness of the data being collected. associated contexts [28].
Fig. 5 shows the taxonomy of the Android manifest compo-
A. REVERSE ENGINEERED APPLICATIONS nents, which contain all the requested permissions, packages,
CHARACTERISTICS intents and features for extraction.
As for Android apps, various apps have various functionali-
ties. If the app is to use the device tools, you must specify the B. FEATURE SET EXTRACTION
corresponding allowances in the Android Manifest format. Using feature filtering decreases the dimensions of data col-
Different program forms, therefore, have different declara- lection by deleting functions that are not useful for study.
tions of prior approval [37], [38]. System static analysis also We chose the characteristics based on their capability to
identifies an application as malicious or benevolent. In clas- display all data sets. Enhanced efficiency by reducing the
sification, they make rational choices using features. The dataset size and the hours wasted on the classification process
article shows the taxonomy diagram for the features present introduces an effective function selection process. Our pro-
in Android applications [39]. It comprises all the components cess does not support a revamped Android emulator, because
VOLUME 10, 2022 89035

TABLE 1. Relative techniques analysis on basis of multiple factors in comparison to proposed approach (PER: Permissions, STR: String, API: Application
Programming Interface, INT: Intents, PKG: Package, APP-C: App Components, SR: Services, RS: Receivers).
it’s not a convenient approach and we preferred our sys- impact the high dimensionality of the data being used. Sup-
tem for physical devices in the future. Jadx carries out the port vector machine and AdaBoost can handle relatively
modification and evaluation of source code. The system con- well than other algorithms because of their high dimensional
centrated on trying to hook the byte-level API calls [40]. space/hyperplane sectioning. Another suspension for our
For our dataset, features from over 1, 00,000 applications datasets was the tool used for extracting the given datasets.
are extracted containing around 56000 extracted features. Androguard implements parsing and analyzing automation
Functions and processes of opcode API features are removed to further break down components of application apk’s after
from the disassembled Smali and Manifest files of an APK decompiling and encourages weighting of the data into
file. The Smali file, segmented by the process and the opcode binary, making it easy to use relevant data for classifica-
frequency of Dalvik for every method, is determined by tion. It uses certain functionality to get useful features from
scanning Dalvik Bytecodes. To verify invocation of a haz- manifest files of these Android applications reducing the
ardous API in that form, it is also possible to determine the acquiring irrelevant features. Although the data in this study
hazardous frequency of an API invocation for each method works significantly well for evaluation, however, the datasets
during the byte code search. For string functions, strings are will be needed to upgrade in terms of forthcoming evolving
selected without the method of isolation from the entire Smali measures.
archives [41]. Certain other authors have presented many tools and pro-
We will never have a predictable response when the posals to deal with high dimensional data such as [42], [43],
number of features inside a dataset exceeds the num- inducing multiple methods such as filtering wrapping to
ber of occurrences. In other terms, when we don’t have enhance robustness.
enough data to train our machine on, generating a struc- The feature set of our model includes:
ture that could capture the association between both F1 → Permissions
the predictive variables and responses variable appears F2 → API Calls
problematic. F3 → Intents
The system used in this study also incorporates larger F4 → App Components
feature sets for classification. Although this problem arises F5 → Packages
in machine learning quite often to some extent choosing F6 → Services
the type of model for detection or classification can highly F7 → Receivers
89036 VOLUME 10, 2022

TABLE 2. Relative techniques analysis on basis of features and sample collected in comparison to proposed approach.
1) PERMISSIONS API libraries are used for both proprietary and third-party
Permission is a security feature that limits access to certain users [46]. Table 3 shows some dangerous permissions that
information on smartphone, with the role of preserving sen- pose problems to the reverse engineered applications.
sitive data and functions that might be exploited to harm
the user’s experience. A unique label is assigned for every 2) INTENTS
permit, which typically denotes a limited operation. The per- The message delivered among modules such as activities,
missions are further categorized into four parts by Google: content providers, broadcast receivers, and services is known
normal, dangerous, signature, and SignatureOrSystem. For as Android Intent. It’s commonly used along with the star-
evaluating Android permissions, researchers take a variety of tActivity() function to start activities, broadcast receivers,
methods [44]. Standard (also called secure) levels of secu- and other things. Individual intent counts are exploited as
rity permissions, such as VIBRATE and SET WALLPAPER, a continuous feature in categorization. To provide more
are permissions without risk. Android kit installer will not specificity, we divide the list of intents into further sec-
allow the user to approve these permissions. The dangerous tions, such as intentions including the phrases (android.net),
security standard will pose warnings to the user before imple- which are linked to the network manager, intents including
mentation and will require the user’s consent. The signature (com.android.vending), for billing transactions, and intents
and symbol Security stages of SignatureOrSystem cover the addressing framework components (com.android) and prov-
riskiest permits. Only applications with the same certificate, ing to be harmful elements in these apps.
as the certificate used to sign the request declaring approval,
are allowed to sign signature permissions [45]. It also acts as 3) API CALLS
a buffer in the middle of hardware and the rest of the stack. Safe APIs are tools that are only available by the operating
A variety of different C/C++ core libraries, such as libc system. GPS, camera, SMS, Bluetooth, and network or data
and SSL, are being used in libraries. Dalvik virtual machines are some examples. To make use of such resources, the
and key libraries are part of the Android Run Time. App application must identify them in its manifest [47]. The Cost-
Model defines classes for developing Android applications, sensitive APIs are those that can increase cost through their
as well as a standardized structure for hardware control usages, such as SMS, data or network, and NFC. Each version
and the management of user experience and app property. includes these APIs in the OS-controlled set of protected
VOLUME 10, 2022 89037

TABLE 3. Dangerous permissions (Malware probability).
TABLE 4. Frequently deployed malware sensitive API Calls. When a broadcast is sent, the system automatically directs
it to applications that have signed up to receive that sort of
broadcast. Services, unlike activities, do not have a graphi-
cal user interface. They’re used to build long-running back-
ground processes or a complex communications API that
other programs may access. In the manifest file, all services
are represented by < service > elements and they allow the
APIs that require the device’s user’s sole permission. API developer to invalidate the structure of the application.
calls that grant sensitive information or device resources
are commonly detected in malicious codes. These calls are C. CLASSIFICATION
isolated and compiled in a different feature set so they might The collection of chosen features in the signature database,
contribute to harmful activity. Table 4 elaborates dangerous separated into training and test data, and is used to recog-
API features: nize android malware apps by traditional machine learning
techniques [49]. There are three different computer frame-
4) API COMPONENTS works: supervised learning, unsupervised learning, and rein-
The program that requires access or activity e.g., a path from forcement learning. The supervised learning method is the
point A to point B on a route predicated on a user’s location focus of this paper, comprises algorithms that learn a model
from another application makes a call to its API, stating from externally provided instances of known data and known
the data/functionality demands. The other software includes results to produce a theoretical model so that the learned
the data/functionality that the first program requested. For model predicts feedback about previous occurrences over
privacy reasons, some API features must be declared and new data [50]. The deployment of ensemble techniques and
not used in these apps. These components relate to broadcast strong learning classifiers helps classification of our binary
features present in these applications. feature sets, resulting in correctly categorized malware and
benign samples. We believe that these classification mechan-
5) PACKAGES, SERVICES AND RECEIVERS ics produce efficient outputs because of their sorting nature.
The package manifest has always been found in the package’s Fig. 6 explains the process of the learning model.
root and provides information about the package, such as A comparative algorithm selection for our model based
its registered name and sequence number. It also specifies on AdaBoost, Naive Bayes, Decision Tree classifier,
crucial data to convey to the user, such as a consumer name K-Neighbor, Gaussian NB, Random forest classifier, and
for the program that displays in the User Interface (UI). The Support Vector Machine performing a relative review which
file format is in .json for packages. will give an accurate analysis of the algorithm for the predic-
According to a publication process model, Android apps tion of our model.
can transmit and receive messages from the Android system
and other Android apps. When a noteworthy event occurs, 1) ALGORITHM CHARACTERISTICS APPRAISAL
these broadcasts are sent out. The Android system, for exam- The assessment of suggested algorithms was carried out using
ple, sends broadcasts when different system events occur, Python. The use of FPR and Accuracy assess our compara-
such as the system booting up or the smartphone charging. tive algorithms trials [51]. These estimates, derived from the
Individuals can sign up to receive certain broadcasts [48]. following basic factors, are listed further down:
89038 VOLUME 10, 2022

FIGURE 6. Machine learning process.
• Accuracy: Accuracy is one criterion being used to eval-

uate classification techniques. True Positive (TP) refers
to the number of malicious apps which were misclas- FIGURE 7. Flow analysis of our research.
sified as malicious, and False Negative (FN) identifies
the number of safe applications which were misiden- benign, a general scenario and likewise. This is an example of
tified as malicious. The number True Negative (TN) linear classification, but not all classifications are this basic,
measures the truly benign applications and FN denotes and functional groups are needed to differentiate between
the number of irregular apps that were wrongly labelled groups [59], [60].
as normal [52].
• False Positive Rate: Determines the measuring factor of III. PROPOSED METHODOLOGY
a model’s ability to identify correct apps or the model’s The major goal of our research is to determine which criteria
ability to generate FP. are most helpful in detecting malware in cell phones, partic-
(TP)m,b + (TN )m,b ularly those running Android. We have taken up the task to
(Accuracy)m,b = (1) train up to six machine learning algorithms such as AdaBoost,
All samples
Support Vector Machine, Decision Tree, KNN, Navies Bayes
(FP)m,b
(FPR)m,b = (2) and Random Forest techniques and classify these machine
(FP)m,b + (TP)m,b learning algorithms accurately. The methodology section is
Equations (1) and (2) demonstrate the accuracy of the false divided in two sections; Pre-Processing (explaining the pre-
detection rate measuring formula applied to calculate the requisite processing) and the Proposed Model (Model func-
Detection Rate (DR) and precision whereas variables (m, tionalities and components).
b) represent the malicious or benign applications w.r.t. True
Positive (TP), True Negative (TN) and False Positive (FP). A. PRE-PROCESSING
Accuracy of the classification dataset, which contains both APK files from numerous apps were included in the resulting
benevolent and malicious applications, our models define datasets (containing malware and benign characteristics). A
a hyperplane that divides both categories with the largest Jadx-Gui decompiler is then used to reverse engineer the
probability. One class is synonymous with ransomware and apk files to extract features from the Android manifest file’s
the other with friendly applications [53]. The authors then feature set for further processing. These stages are regarded
assumed the research data to be unknown applications, which as pre-processes from before real assessments and are essen-
are classified by projecting them to subspace to determine tial parts before any kind of testing and training using any
if they are on the malicious or friendly side of the hyper- predictive models.
plane [54]. Then, using our model will correlate all the regres- Androguard, an open-source tool that extracts prioritised
sion findings to their original reports to assess the proposed features from files and converts them into binary values,
model’s malware identification accuracy [55]. Static features is used to extract features. For labelling the false or accurate
make for a pleasing accuracy and precision of more than android application, we employ binary search techniques, i.e.,
90%. What’s more noteworthy is that defining the usage of 1 or 0 for benign and 1 or 0 for malware. Figure 7 shows our
API calls in a single part of the Android platform allows for technique’s pre-processing framework and flow structures,
the creation of the most representative function space or the which must be accomplished before the classifiers are tested.
resources where malicious and benign can be distinguished The operations embedded in the rectangle are to be deter-
more easily [56], [57]. If the amount of the classification tar- mined beforehand, ensuring efficient data collection. The
get is greater than the probability estimates, the classification main role in this is by the decompiler and extractor which
target of the testing data is then calculated as that label [58]. improves and eases the model’s data classification efficiency
The objects are Blue or Red; the dividing lines identify the for detection of malware applications. Although our study
border, so an object on the right side is called blue, meaning discusses the challenge of multi-collinearity and the use of
VOLUME 10, 2022 89039

FIGURE 8. Proposed methodology of our system.
high dimensional data being implemented, we have discussed

the better output for high-dimensional data in our feature
extracted section but the issue of collinearity still stands and
can be done as a novel contribution as future work.
Succeeding the extraction process and the use of efficient
datasets accommodating useful features, the testing and train-
ing are administered. For our model, a comparative approach
will be adopted based on Naive Bayes, Decision Tree classi-
fier, K-Neighbour, Gaussian NB, Random Forest classifier,
Support Vector Machine and AdaBoost. The comparison
evaluation will provide an accurate assessment of the algo-
rithm used to forecast our model. The installation package is
a ZIP-compressed bundle of files that includes the manifest FIGURE 9. Training model Processing.
file (AndroidManifest.xml) and classes.dex. The manifest
are ready for classification by predictive models. We adopted
file describes an Android application, namely the activities,
Python to create a machine algorithm classification perfor-
services, broadcast receivers, and content providers that make
mance program after collecting the dataset, and then we’ll
up the system. The methodology and the classification are
employ the best accurate algorithms to train our models for
explained before in the related work section. The next section
malware and benign detection. The system’s approach and
describes the model functionality.
its operation are detailed in figure 8, which depicts the whole
B. PROPOSED MODEL methodology of our model and algorithm learning phase with
The model gathers information from many Android applica- the training model processing for detection.
tions (Google Play). These reverse-engineered (decompiled Figure 9 shows us the training cycle of the program and
through Jadx-Gui) apps are then subjected to static analysis how the model first is constructed and then evaluated. Then
to extract features. Our suggested approach in figure 8, for further on the data is cycled towards testing and that is the
the training phase, uses the retrieved characteristics to create data fed to the trained model for further prediction analysis
vector mapping parsed through Androguard. The contribution of the android applications.
is indicated by the proposed feature section that encompasses The future threats and predictions pointed out in the next
nearly 56,000 extracted features from the feature set seen in section state insecure android applications which contain
figure 8. Those collected features are then composed in a form unnecessary permissions, and opt for an easy way for an
of a dataset .csv file, stating the benign and malware prop- attacker to steal private data or launch major attacks, and later
erties in 1 or 0. After we generate the datasets, the features on, present the methodology of our research.
89040 VOLUME 10, 2022

FIGURE 10. Graph of application threat increase by 5%.
IV. FUTURE THREATS AND PREDICTION

By 2020, mobile applications will be installed onto consumer
devices over 205 billion times. Statistics by Marketing Land
FIGURE 11. Increase in android malware statistics.
suggest that 57 percent of the overall digital content time is
spent on mobile devices. Our daily activities always depend
on social networking, bank transfers, business operations,
and mobile managed services applications. Accommodating
over two billion individuals, almost 40% of the world’s total
population, Juniper Sources point to the number of those
using mobile banking services. These predictions and future
threats are based on theoretical data collected through exten-
sive survey of journals, online forums and research articles.
Developers devote close attention to the development of
software to provide us a comfortable and seamless experi-
ence and when someone enthusiastically installs these mobile
applications requiring personal information, the user stops
thinking about the security consequences. This is the reason
people don’t even look closely at the permissions or the
feature updates being asked by the applications [61]. They
simply download the application they want and, when asked
for installation, they overlook everything else and start using
the app. Most of these applications never even ask the consent FIGURE 12. Third-party well-known dangerous apps increase
of the consumer and the hackers are using their information from 2013 to 2020.
without their knowledge. The future threat rises, at the end of
that if nothing is done on time, these applications will increase
2020 and beginning of 2021:
up to a higher number in the future.
• 70% of Google Play Store applications require access to According to multiple tech reviews, each one published
one more ‘‘dangerous permission and packages, up from in 2021, states that according to research of 2,500 top-of-
66.6% in Q12020, which is a 5 percent raise’’. 69.4% of the-line and rising applications, over two portions of the
applications for children (13 years of age) claim at least most popular Android applications on Google Play request
one risky permit up from 68.8% in 2020 (a 1 percent excessive user permissions and access. These allow apps,
rise). among other unwanted behaviors, to launch harmful scripts
• Over 2.3 million applications altogether, over 2.1 mil- and access messages unnecessarily with unwanted features
lion applications for children need at least one harmful inbuilt [62]. They stated that with the increase in usage of
authorization. application components and features and also the release of
Figure 10 shows the percent hit in 2020, proceeding to new Android frameworks and APIs each year. It is most likely
2021 on both the application for permission criteria. As per that threats are surely to increase by 15% from 5%. The aver-
these statistics, the predicted rate in the coming years (till age Android user has roughly 80 applications loaded, thus at
2025) proposes that there could be a grave danger because least one app on the phone demands additional authorization
of these unnecessary access as per each level of the Android on the phone. It is likely that excessive authorizations may
API. Figure 10 shows the representation of both the factors, jeopardize user data and privacy or even allow device hacks.
application for everyone and the other for application kids for Figure 11 elaborates the dangerous malware increase till
the year 2019-2020. The graph shows the increase in 5% of 2020 with every newer version of API Level. Figure 12 shows
the applications with dangerous malware. This takes a great the most rising apps from 2016 to 2021 and the percent-
deal of application security and also depicts the futuristic way age of dangerous permissions, packages these applications
VOLUME 10, 2022 89041

gain [63]. These applications are used daily and if they are TABLE 5. Sample datasets.
involved in unnecessary and third-party access, then there is a
special need to apply countermeasures on these applications,
as this is going to be a major threat in the future.
Also, the Figure depicts the need to measure these threats
and devise countermeasures or at least present models to
provide more encoded procedures to carry out for these well-
known applications. These apps provide a lot of opportuni-
ties, but with an increase in private and intellectual property TABLE 6. Datasets ratio (Training & Testing), MalD (MalDroid), DefenseD
(DefenseDroid), GD (Generated Dataset), Pre-Pro (Pre-Processing).
stored in these apps, certain anecdotes need to be proposed.
V. EXPERIMENTAL RESULTS
In this section, the results of our experimentation are stated.
To start our experimentation discussion, we will elaborate on
the basic criteria for performing our implementation success-
fully and will also briefly discuss the data collection or the
dataset that we got and then further converse about the actual
contribution part.
A. EXPERIMENT SETUP
Our environment is based on Windows 8.1 Pro with Intel , R
Core (MT) i5-2450 CPU, 2.50 GHz as a processor. The
installed memory (RAM) of the system is 4.00 GB with a
64-bit Operating System (OS), x-64 based processor.
For the generated dataset Androguard 3.3.5 (latest release)
is used for decompiling and feature extraction, deployed
in regulated .csv files in binary vectors. We have installed
Python 3.8.12 (version 3.8) on our system for the implemen-
tation and execution of training and testing scripts of imported
machine learning models.
B. DATASET FIGURE 13. Boosting mechanism.

Three different datasets are used for our implementation,
mainly apps belonging to Google Play. The static features The next subsection elaborates the discussion and presen-
of our first two datasets containing API calls, permissions, tation of the programs for our machine learning algorithms.
intents, packages, receivers and services were collected from
MalDroid [64] and DefenseDroid [65] which includes around C. MACHINE LEARNING ALGORITHM AND ENSEMBLE
14,000 malware samples. The model also uses a third dataset LEARNING
of around 6000 malware samples and 2421 benign samples Six models have been selected to experiment with two strong
using our own generated application’s dataset. Applications classifiers (AdaBoost, SVM and Random Forest). The model
in the datasets were randomly selected from Google Play executes upon KNN, NB, RBF, Decision Tree, SVM and we
and then reverse-engineered by the Jadx-GUI tool to acquire have also performed AdaBoost with Decision Tree by cal-
their APK’s. The features present in our own selected appli- culating the weighted error of the Decision tree based on its
cations are then extracted using Androguard into binary data. data points. As the input parameters are not jointly optimized,
All the datasets from different platforms are combined to Adaboost is less prone to overfitting. Adaboost can help you
incorporate our multiple features sets more than state-of-the- to increase data performance of existing weak classifiers.
art approaches (explained in table 5) in a single training to After the higher weight of all the wrongly misclassified data
achieve higher accuracy and classification of malware. The points is rightly classified, the model can enhance model
datasets are first trained on every algorithm for comparative accuracy. Figure 13 shows the functioning of the boosting
classification analysis. After the accuracy of the algorithms technique.
are evaluated, the dataset is again trained and tested on the Since, there is a distinct boundary between two categories,
higher-performing algorithms to use as a feed, based on the ensemble methods and SVM perform rather well enough
features, inserted into the database and our model will then when dealing with clear aligned datasets following adequate
forecast the output for a given android application extracted extraction processes. Another significant benefit of the SVM
features. Table 6 represents the datasets training and testing Algorithm is that it can handle high-dimensional data, which
ratio and number of columns before and after pre-processing. comes in handy when it comes to its use and application in
89042 VOLUME 10, 2022

FIGURE 14. Representation of the modules of our program.
FIGURE 16. Fit and pred function for SVM.
FIGURE 17. Predictive measures for AdaBoost.

FIGURE 15. Program parameters and split functions.
the Machine Learning sector. As seen in the diagram above,

AdaBoost’s greater weighted property aids our weak learner
(Decision Trees) with achieving higher accuracy and wider
consumption for misclassified binary feature inputs. FIGURE 18. Results stored to acc variable and plotted by plt.bar function.
D. PROGRAM PARAMETERS
Our project is based on Python 3.9.7 and divided our execu- missions features through the dataset binary values (0.1) and
tion into two programs. The first program, written to compare specifies those results in function pred () As you can see in
the algorithms for the accuracy check of respective mod- the code below, the program uses a fit () function, which takes
els, based on AdaBoost, Decision Tree, KNN, SVM, Naive the training data as arguments that are fitted using the x and
Bayes, and Random Forest for the comparative analysis. The y parameters into testing data for our two models (AdaBoost
program uses different import and split functions to train the and SVM). All the variables were specified at the end that
models and then stores the result in a variable embedded was given to each of our algorithms in the program to the
for the testing model. The function sklearn.model_selection, variable acc. After executing the program, every algorithm
used for accessing the bundles of algorithms, accuracy_score will start accessing the dataset and start predicting the dataset
for accuracy readings, pandas to read the database, and value for the android features. Figures 16 and 17 represent
NumPy to convert the testing model data into rr format. the main key functions for our models AdaBoost and SVM,
The parameter on the x-axis is the features of the algo- which are discussed above.
rithms and on the y-axis is its label (figure 19), meaning the Figure 17 also explains the predictive procedure of the
accuracy percentage for these algorithms. The x (accuracy ensemble model with 1000 malware sample runs and given
of the models) and y (labels of the models) parameters of features to train for a single predictive classification output.
the program are configured to shuffle = True using the The same fit() function is used for dataset training. The
test_train_split function, so each algorithm takes a random model is placed for higher weights of decision trees algorithm
permission value from the dataset. Figures 14 and 15 show within row values and executed in yhat. Accuracy is then
the import modules and parameters values set in our program. accomplished by declaring the mean and standard deviation
First, all the algorithms are imported into the program (mean (n_acc_scores), std (n)acc_score))) for the binary clas-
to implement the training data for the model, meaning the sification output of malware. Further ahead, Figure 18 shows
machine is training based on the given datasets. The program the plotted assigned value for accuracy after the data is trained
will work as each algorithm will take up random binary value on the models.
of an app from the dataset and execute its feature’s accuracy Figure 19 shows the accuracy percentage for our models
score in another variable. After training the data, the program which is 96.24% and the graph displays the highest correct
passes the testing data to store into a predictive function. The predictive frequency out of all the algorithms, professing
program is designed to identify the normal and harmful per- the research work for greater validity. This graph is plotted
VOLUME 10, 2022 89043

TABLE 7. Shows the label values for each algorithm and their accuracy
percentage.
FIGURE 19. Models accuracy percentage w.r.t label.
by training the algorithms on the datasets to verify which

algorithm can classify the application’s features accurately.
Program 1 (python script for models accuracy) is scripted
to import all of the algorithms and execute them one by
one on these datasets to train the algorithms, producing the
most precise values after testing. In the case of AdaBoost,
we trained Decision Tree first on the dataset and then used
those classified values to train on the higher weights using
AdaBoost. AdaBoost takes those classified samples and fea-
tures used by decision trees and generates higher weights for
correct results after training on those features again. (x, y) are
the stored values by decision trees which are given as input
values for AdaBoost to enhance accuracy, hence the model
with the highest accuracy in fig. 19. This program performs
in a way that when all the models are done training, the script
FIGURE 20. Prediction function for SVM for testing data for the database.
generates a graph using the plt.bar command to display the
algo that classifies most applications correctly. Figure 19 and
Table 7 show the accuracy and the label value that depicts
the training data each algorithm randomly took and trained Following the import of the trained models, the
its model for. random_state = 0 and the testing data = 0.25 for the
algorithms. The import of sklearn.preprocessing_normalize
VI. MODEL PRECISION EVALUATION function, which takes samples separately according to the
After training the datasets on algorithms and achieving Normalize unit. Every set of data with one component or per-
accuracy percentage, individually developed another pro- haps more (each data matrix row), rescaled separately from
gram that uses the properties of the previous code to help other samples to the standard. The program also imports the
execute and predict the application state according to the function sklearn.features_extraction.text which transforms
input from the dataset. For this program, the algorithm with a text data array into a token count matrix and at the very
greater prediction capabilities is imported, i.e. AdaBoost end declares the accuracy score of these algorithms by using
and SVM using the function sklearn imports linear_svc and sklearn.metrics function, implementing loss, score, and util-
sklearn.ensemble import AdaBoost. The database stores input ity functions to quantify performance in the categorization of
features into the rr python module as a feeding factor for the the feature sets. Parameters for this program are the same as
trained models and designated 1 for the benign applications the previous program, but to fix features on every algorithm,
and 0 for the malware application, meaning the app which the x type is dedicated to the trained models for features
uses unnecessary features will give the output 0, helping the and y type for the prediction of the applications. So when
use understand that this is a malicious app. This will work in a the program executes it will work in the same manner and
way that, when the program executes, the algorithms will take this time gives us the precision value instead of the plotted
the input from the database and then categorize the features accuracy percentage of the algorithms and at last, the program
based on what we trained the algorithm upon. So, if there will print out the pred () function value which was declared
are malware applications fed as an input to the database, the to the model’s testing data. Figures 20 and 21 indicate the
trained model will predict the outcome and label the state of consideration of AdaBoost and SVM prediction for features
the application. extracted for single feature input.
89044 VOLUME 10, 2022

FIGURE 21. Prediction function for AdaBoost for testing data for the
database.
FIGURE 22. Output [1] representing the benign application (SVM).
FIGURE 23. Output [0] representing the malware application (SVM). FIGURE 26. Orange entries for hon-harmful applications in AdaBoost.
FIGURE 24. Output [1] representing the benign application (AdaBoost).
FIGURE 25. Output [0] representing the malware application (AdaBoost).
Further ahead, the prediction results of the program are

discussed. As the code executes, the models will take the
features from the dataset that was provided for a single FIGURE 27. Black entries for harmful applications in AdaBoost.
application. The result displayed in Figure 22 shows that it’s
a benign application. When permission features, again fed
as input the Figure 23 shows that it is a malware applica-
tion based on the features the highly trained models draw
out. In the same manner, the database is fed with feature
binary values and the model will predict the result in 1 or 0.
Figures 16 and 17 elaborate on the predictive function which
will allow AdaBoost and SVM to predict the basis of the
applications on the feeding input. Figures 22, 23, 24 and 25
are output screenshots of 1 showing benign and 0 for harmful
applications with random application features for respective
models.
A. RESULTS
After the forecast of our models, results show that the accu-
racy for our highest predictive systems is 96% and 92%. The
proposed model doesn’t peak in higher accuracy or predictive
FIGURE 28. Orange entries for non-harmful applications in SVM.
rate but it contributes by introducing enhanced and large fea-
ture sets (containing around 56000 newly extracted features)
with the latest API level applications datasets collected in sources/environment to process and generate these datasets
recent years than state-of-the-art approaches. Another point on our models. The novelty and contributions are explained
of view for a less predictive rate is the limitation of our in Tables 1 and 2.
VOLUME 10, 2022 89045

TABLE 8. Experimental results (AdaBoost and SVM), Selected, specify

features selected in the model, MalD (MalDroid), DefenseD
(DefenseDroid), GD (Generated Dataset), FPR (False Positive Rate), Acc
(Accuracy).
FIGURE 29. Black entries for Harmful applications in SVM.
Figures 26, 27, 28 and 29 show the runs performed on

the datasets on our trained model. The applications in orange We use Accuracy and FPR as evaluation markers in this
indicate not harmful apps and only passes sensitive features project. Precision is computed as the percentage of true
over the line, which doesn’t pose that much of a threat for the harmful samples in the malware tagged by the detection sys-
application, but it still shows the model issue for indicating tem, showing the system’s capacity to discriminate malware
true negatives for zero apps. The applications in black indi- properly in the field of malware detection. False Positive
cate harmful applications and the false positive rate (FPR) Rate (FPR) is the criteria to judge the model’s performance
of this category which falls over the non-harmful apps is in terms of establishing how many true indications a model
about 3-4 applications in case of AdaBoost and 6-7 in case of gives. Below are the experimental results in quantitative
SVM in our system for 1000 runs, as shown in figures above measures, presented in table 8, which explains the points
achieved with 96% and 92% accuracy of AdaBoost and SVM. based on accuracy, false positive rate and their predictive
All four figures are plotted in a hyperplane which describes measures after testing on binary input for 1000 runs on our
the applications classifications in two sections i.e. Harmful 2 higher predictive models depending on testing and training
and Non-harmful applications. The above line represents the of mixed datasets containing features and malware samples.
harmful apps section (Black and Red) and applications lying The operational speed advantage of AdaBoost is not apparent
below the line indicated non-harmful applications. The plot- when adopting the datasets for classification and prediction.
ted hyperplanes help in understanding the prediction appli- However, given AdaBoost structural features with parallel
cations perspective as shown in fig 27 and 29 showing suc- learning, we anticipate it will perform better while computing
cessful classification above the line and 3-4 apps below line bigger data sets. We reached the same conclusion after we
indicating misclassifications. The same process is for non- analyzed a much bigger data set with over 500,000 apps.
harmful apps in orange colors (fig 26, 28) and the above line In table 8, both models are compared and trained on
shows misclassifications but they don’t pose serious threats. datasets and specify the accuracy, FPR and features used and
The Forthcoming is the comparative review of both mali- selected corresponding to the composing samples. The FPR is
cious and benign applications of our models and experi- also presented in figures 26 to 28 above, specifying the calcu-
mental results with accumulative accuracy and FPR. The lative measures through a hyperplane. The accuracy and false
purpose to plot a comparative graph of malware detection is positives have been measured by the equation described in
to understand the relative perspective of both our parameters. section IV in algorithm characteristics for the number of runs
Figure 30 represents a comparative analysis of both models of the model. Results show 96.24% as the highest accuracy
in terms of malicious and benign applications. Triangles in for the model after experimentation and false-positive rate of
red represent the classification and detection of AdaBoost 0.3% in the case of the ensemble approach.
and in the square, the SVM is displayed. The graph shows a Related works explain the originality of our model and
malware section angle for the executive runs performed and exhibit the novel features and sample size. To conclude our
the values above the hyperplane shows the category of Non- model still lack fewer percentages in terms of accurate detec-
Harmful apps. The 0.7 misclassification rate of SVM and tion. To justify this fact, table 9 presents some properties of
0.3 of AdaBoost is plotted with malware applications falling similar studies with higher performance rates, indicating such
into the true positive category. elements which elaborated the efficiency of our system.
Nevertheless, the models perform with 96.24% accuracy [29] This model has exceptional computational/processing
by accurately predicting the applications categories. power with a much stronger environment to test and train
89046 VOLUME 10, 2022
FIGURE 30. Comparative analysis of malicious and benign in Adaboost and SVM.
TABLE 9. Relative resources (Pro = Processing), (Acc = Accuracy), (FPR = combination with methods based on AI or machine
False Positive Rate).
learning, such as inept learning, to make the detection
more sophisticated to make it easier to identify and
regulate app prediction rate.
3. Application behaviours in the malware ecosys-
tem encourage non-emerging threats. Our study
doesn’t incorporate the rider analysis or behaviour
of repackaged malware. The study simply uses the
reverse-engineered apk files and extracts the given
context to the AndroGuard and extracts features in
binary vectors. Although this is a major issue and a key
challenge with the advancement in Android malware.
This approach will be our advanced project to perform
differential or effective analysis on reverse applica-
tions, determining the effects of these applications and
their datasets. [24] Has somewhat of a similar resource with their results.
higher processing but their sample size is very limited in 4. The applications with time induce new features with
comparison to our model. A few other studies describe similar enhanced malware abilities which is why we would
technical advantages, thus, leaving us to work with restrictive have to upgrade the system whenever the model’s FPR
measures. Table 9 presents some key properties to elaborate rate after execution increases. The simplest explana-
on similar systems’ components. tion for how to identify if the model is degrading on
evolved features is that our datasets are designed in
VII. RESEARCH ISSUES AND CHALLENGES binary matrix extracted from features that are currently
This section highlights our experiment’s prevalent and crucial implemented in these applications and not features that
topics. These hurdles are based on various stages of our work will be present in evolved apps in coming years. With
and maybe gradually rectified in the work to be undertaken new features, we would have to reverse and extract
in the future. those features to form an updated dataset again to train
1. Features declared mostly on the device are more on these classifiers. 66], [67], [68] and [69] discuss
durable than the features specific to the applications the possible solutions for this key issue and propose
and therefore can usually automate malware detection. some possible solutions but for our model and given the
The range of android parameters for processing is resource we have only performed for current features.
rather big and difficult to detect properly if someone For future work, we will consider model sustainability
does not extract the features properly. and how to classify the malware that our system will
2. There is still a fast increase in the number of apps. be able to detect even if the features are not yet imple-
Malware apps can always be identified in potential in mented.
VOLUME 10, 2022 89047

5. The research mentions the problem of multicollinearity machine algorithms before employing them is also a promis-
in the introduction, depicting the rise of dependent vari- ing field.
ables in-between machine learning algorithms which
REFERENCES
cause interpretation in results. However, this field of [1] Android (GOOG) Just Hit a Record 88% Market Share of All
study can be taken as a future work for further testing Smartphones—Quartz. Accessed: Jan. 28, 2022. [Online]. Available:
https://qz.com/826672/android-goog-just-hit-a-record-88-market-share-
of several models handling multicollinearity because
of-all-smartphones/
our model itself is already performing high processing [2] A. O. Christiana, B. A. Gyunka, and A. Noah, ‘‘Android malware detection
detection schemes to generate accuracy for Android through machine learning techniques: A review,’’ Int. J. Online Biomed.
Eng., vol. 16, no. 2, p. 14, Feb. 2020, doi: 10.3991/ijoe.v16i02.11549.
applications features malware. We will foresee this [3] D. Ghimire and J. Lee, ‘‘Geometric feature-based facial expression recog-
issue and incorporate it to produce an efficient solution nition in image sequences using multi-class AdaBoost and support vec-
to the problem. Authors in [70], [71], [72] proposes tor machines,’’ Sensors, vol. 13, no. 6, pp. 7714–7734, Jun. 2013, doi:
10.3390/s130607714.
some solutions to tackle this challenge and can help
[4] R. Wang, ‘‘AdaBoost for feature selection, classification and its relation
understand viewers queries. with SVM, a review,’’ Phys. Proc., vol. 25, pp. 800–807, Jan. 2012, doi:
10.1016/j.phpro.2012.03.160.
A. LIMITATIONS [5] J. Sun, H. Fujita, P. Chen, and H. Li, ‘‘Dynamic financial distress prediction
with concept drift based on time weighting combined with Adaboost sup-
The technique in this paper is based on binary classification of port vector machine ensemble,’’ Knowl.-Based Syst., vol. 120, pp. 4–14,
lightweight code of static feature sets present in the Android Mar. 2017, doi: 10.1016/j.knosys.2016.12.019.
manifest file. The three major limitations of our method are: [6] A. Garg and K. Tai, ‘‘Comparison of statistical and machine learning meth-
ods in modelling of data with multicollinearity,’’ Int. J. Model., Identificat.
1. The research doesn’t include dynamic or runtime appli- Control, vol. 18, no. 4, p. 295, 2013, doi: 10.1504/IJMIC.2013.053535.
[7] C. P. Obite, N. P. Olewuezi, G. U. Ugwuanyim, and D. C. Bartholomew,
cation features. We will consider the potential dynamic ‘‘Multicollinearity effect in regression analysis: A feed forward artifi-
aspects of Android applications in the future, including cial neural network approach,’’ Asian J. Probab. Statist., vol. 6, no. 1,
real-time permissions and API requests and possible pp. 22–33, Jan. 2020, doi: 10.9734/ajpas/2020/v6i130151.
[8] W. Wang, M. Zhao, Z. Gao, G. Xu, H. Xian, Y. Li, and X. Zhang, ‘‘Con-
features extracted. We will evaluate the behavioural structing features for detecting Android malicious applications: Issues,
traits of the app using a mixture of dynamic and static taxonomy and directions,’’ IEEE Access, vol. 7, pp. 67602–67631, 2019,
evaluation to discover harmful tendencies. doi: 10.1109/ACCESS.2019.2918139.
[9] B. Rashidi, C. Fung, and E. Bertino, ‘‘Android malicious application detec-
2. Our system lags in future sustainable operative mea- tion using support vector machine and active learning,’’ in Proc. 13th Int.
sures, meaning the system will need to be upgraded in Conf. Netw. Service Manage. (CNSM), Tokyo, Japan, Nov. 2017, pp. 1–9,
terms of forthcoming API levels and malware collec- doi: 10.23919/CNSM.2017.8256035.
[10] J. Li, L. Sun, Q. Yan, Z. Li, W. Srisa-An, and H. Ye, ‘‘Significant
tion or terms of new innovative features present in these permission identification for machine-learning-based Android malware
Android applications. detection,’’ IEEE Trans. Ind. Informat., vol. 14, no. 7, pp. 3216–3225,
3. The constraint of a slow and low processing environ- Jul. 2018, doi: 10.1109/TII.2017.2789219.
[11] G. Suarez-Tangil, J. E. Tapiador, P. Peris-Lopez, and J. Blasco, ‘‘Dendroid:
ment is another motive for less accuracy and predictive A text mining approach to analyzing and classifying code structures in
measures of our model in comparison to a few other Android malware families,’’ Exp. Syst. Appl., vol. 41, no. 4, pp. 1104–1117,
peak detection techniques achieving higher accuracy. Mar. 2014, doi: 10.1016/j.eswa.2013.07.106.
[12] M. Magdum, ‘‘Permission based mobile malware detection system using
machine learning,’’ Techniques, vol. 14, no. 6, pp. 6170–6174, 2015.
VIII. CONCLUSION [13] M. Qiao, A. H. Sung, and Q. Liu, ‘‘Merging permission and API fea-
tures for Android malware detection,’’ in Proc. 5th IIAI Int. Congr. Adv.
In this research, we devised a framework that can detect Appl. Informat. (IIAI-AAI), Kumamoto, Japan, Jul. 2016, pp. 566–571, doi:
malicious Android applications. The proposed technique 10.1109/IIAI-AAI.2016.237.
takes into account various elements of machine learning and [14] D. O. Sahin, O. E. Kural, S. Akleylek, and E. Kilic, ‘‘New results on
permission based static analysis for Android malware,’’ in Proc. 6th Int.
achieves a 96.24% in identifying malicious Android appli- Symp. Digit. Forensic Secur. (ISDFS), Antalya, Turkey, Mar. 2018, pp. 1–4,
cations. We first define and pick functions to capture and doi: 10.1109/ISDFS.2018.8355377.
analyze Android apps’ behavior, leveraging reverse appli- [15] A. Mahindru and A. L. Sangal, ‘‘MLDroid—Framework for Android
malware detection using machine learning techniques,’’ Neural Comput.
cation engineering and AndroGuard to extract features into Appl., vol. 33, no. 10, pp. 5183–5240, May 2021, doi: 10.1007/s00521-
binary vectors and then use python build modules and split 020-05309-4.
shuffle functions to train the model with benign and malicious [16] X. Su, D. Zhang, W. Li, and K. Zhao, ‘‘A deep learning approach
to Android malware feature learning and detection,’’ in Proc. IEEE
datasets. Our experimental findings show that our suggested Trustcom/BigDataSE/ISPA, Tianjin, China, Aug. 2016, pp. 244–251, doi:
model has a false positive rate of 0.3 with 96% accuracy in 10.1109/TrustCom.2016.0070.
the given environment with an enhanced and larger feature [17] K. A. Talha, D. I. Alper, and C. Aydin, ‘‘APK auditor: Permission-based
Android malware detection system,’’ Digit. Invest., vol. 13, pp. 1–14,
and sample sets. The study also discovered that when dealing Jun. 2015, doi: 10.1016/j.diin.2015.01.001.
with classifications and high-dimensional data, ensemble and [18] A. Mahindru and P. Singh, ‘‘Dynamic permissions based Android
strong learner algorithms perform comparatively better. The malware detection using machine learning techniques,’’ in Proc. 10th
Innov. Softw. Eng. Conf., Jaipur, India, Feb. 2017, pp. 202–210, doi:
suggested approach is restricted in terms of static analy- 10.1145/3021460.3021485.
sis, lacks sustainability concerns, and fails to address a key [19] U. Pehlivan, N. Baltaci, C. Acarturk, and N. Baykal, ‘‘The analysis
multicollinearity barrier. In the future, we’ll consider model of feature selection methods and classification algorithms in permis-
sion based Android malware detection,’’ in Proc. IEEE Symp. Comput.
resilience in terms of enhanced and dynamic features. The Intell. Cyber Secur. (CICS), Orlando, FL, USA, Dec. 2014, pp. 1–8, doi:
issue of dependent variables or high intercorrelation between 10.1109/CICYBS.2014.7013371.
89048 VOLUME 10, 2022
[20] M. Kedziora, P. Gawin, M. Szczepanik, and I. Jozwiak, ‘‘Malware detec- [38] C. L. P. M. Hein, ‘‘Permission based malware protection model for Android
tion using machine learning algorithms and reverse engineering of Android application,’’ presented at the Int. Conf. Adv. Eng. Technol., Mar. 2014,
Java code,’’ Int. J. Netw. Secur. Appl., vol. 11, no. 1, pp. 1–14, Jan. 2019, doi: 10.15242/IIE.E0314102.
doi: 10.5121/ijnsa.2019.11101. [39] G. L. Scoccia, S. Ruberto, I. Malavolta, M. Autili, and P. Inver-
[21] X. Liu and J. Liu, ‘‘A two-layered permission-based Android mal- ardi, ‘‘An investigation into Android run-time permissions from the
ware detection scheme,’’ in Proc. 2nd IEEE Int. Conf. Mobile Cloud end users’ perspective,’’ in Proc. 5th Int. Conf. Mobile Softw. Eng.
Comput., Services, Eng., Oxford, U.K., Apr. 2014, pp. 142–148, doi: Syst., Gothenburg, Sweden, May 2018, pp. 45–55, doi: 10.1145/3197231.
10.1109/MobileCloud.2014.22. 3197236.
[22] Permission-Based Android Malware Detection | Semantic Scholar. [40] P. Topark-Ngarm, ‘‘Identifying Android malware using machine learning
Accessed: Oct. 31, 2021. [Online]. Available: https://www. based upon both static and dynamic features,’’ M.S. thesis, Victoria Univ.
semanticscholar.org/paper/Permission-Based-Android-Malware- Wellington, Wellington, New Zealand, 2014, p. 87. [Online].
Detection-Aung-Zaw/c8576b5df33813fe8938cbb19e35217ee21fc80b Available: https://ecs.wgtn.ac.nz/foswiki/pub/Main/IanWelch/pacharawit-
[23] D. Arp, M. Spreitzenbarth, M. Hübner, H. Gascon, and K. Rieck, ‘‘Drebin: thesis.pdf
Effective and explainable detection of Android malware in your pocket,’’ [41] N. Milosevic, A. Dehghantanha, and K.-K. R. Choo, ‘‘Machine learning
presented at the Netw. Distrib. Syst. Secur. Symp., San Diego, CA, USA, aided Android malware classification,’’ Comput. Electr. Eng., vol. 61,
2014, doi: 10.14722/ndss.2014.23247. pp. 266–274, Jul. 2017, doi: 10.1016/j.compeleceng.2017.02.013.
[24] H. Cai, N. Meng, B. G. Ryder, and D. Yao, ‘‘DroidCat: Effective Android [42] V. Bolón-Canedo, N. Sánchez-Maroño, and A. Alonso-Betanzos, Feature
malware detection and categorization via app-level profiling,’’ IEEE Trans. Selection for High-Dimensional Data. Cham, Switzerland: Springer, 2015,
Inf. Forensics Security, vol. 14, no. 6, pp. 1455–1470, Jun. 2019, doi: doi: 10.1007/978-3-319-21858-8.
10.1109/TIFS.2018.2879302. [43] B. Pes, ‘‘Ensemble feature selection for high-dimensional data: A stability
[25] P. Rovelli and Ý. Vigfússon, ‘‘PMDS: Permission-based malware detection analysis across multiple domains,’’ Neural Comput. Appl., vol. 32, no. 10,
system,’’ in Information Systems Security, vol. 8880, A. Prakash and pp. 5951–5973, May 2020, doi: 10.1007/s00521-019-04082-3.
R. Shyamasundar, Eds. Cham, Switzerland: Springer, 2014, pp. 338–357, [44] A. Hamidreza and N. Mohammed, ‘‘Permission-based analysis of Android
doi: 10.1007/978-3-319-13841-1_19. applications using categorization and deep learning scheme,’’ in Proc.
[26] M. S. Alam and S. T. Vuong, ‘‘Random forest classification for MATEC Web Conf., vol. 255, 2019, p. 05005, doi: 10.1051/matec-
detecting Android malware,’’ in Proc. IEEE Int. Conf. Green Comput. conf/201925505005.
Commun. IEEE Internet Things IEEE Cyber, Phys. Social Comput., [45] T. Boksasp and E. Utnes, ‘‘Android apps and permissions: Security
Beijing, China, Aug. 2013, pp. 663–669, doi: 10.1109/GreenCom- and privacy risks,’’ M.S. thesis, Dept. Telematics, Norwegian Sci.
iThings-CPSCom.2013.122. Technol., Trondheim, Norway, 2012, p. 143. [Online]. Available:
[27] D. Congyi and S. Guangshun, ‘‘Method for detecting Android malware https://ntnuopen.ntnu.no/ntnu-xmlui/bitstream/handle/11250/262677/
based on ensemble learning,’’ in Proc. 5th Int. Conf. Mach. Learn. 566356_FULLTEXT01.pdf?sequence=1
Technol., Beijing, China, Jun. 2020, pp. 28–31, doi: 10.1145/3409073. [46] N. Yadav, A. Sharma, and A. Doegar, ‘‘A survey on Android malware
3409084. detection,’’ Int. J. New Technol. Res., vol. 2, no. 12, p. 7, 2016.
[28] W. Li, J. Ge, and G. Dai, ‘‘Detecting malware for Android platform:
[47] F. I. Abro, ‘‘Investigating Android permissions and intents for malware
An SVM-based approach,’’ in Proc. IEEE 2nd Int. Conf. Cyber Secur.
detection,’’ City Univ., London, U.K., Tech. Rep., 2014, p. 5 and 169.
Cloud Comput., New York, NY, USA, Nov. 2015, pp. 464–469, doi:
[48] M. Magdum and S. K. Wagh, ‘‘Permission based Android malware detec-
10.1109/CSCloud.2015.50.
tion system using machine learning approach,’’ Int. J. Comput. Sci. Inf.
[29] G. Suarez-Tangil, S. K. Dash, M. Ahmadi, J. Kinder, G. Giacinto,
Secur., vol. 14, no. 6, Jun. 2016.
and L. Cavallaro, ‘‘DroidSieve: Fast and accurate classification of
[49] T. Kim, B. Kang, M. Rho, S. Sezer, and E. G. Im, ‘‘A multimodal deep
obfuscated Android malware,’’ in Proc. 7th ACM Conf. Data Appl.
learning method for Android malware detection using various features,’’
Secur. Privacy, Scottsdale, AZ, USA, Mar. 2017, pp. 309–320, doi:
IEEE Trans. Inf. Forensics Security, vol. 14, no. 3, pp. 773–788, Mar. 2019,
10.1145/3029806.3029825.
doi: 10.1109/TIFS.2018.2866319.
[30] Z. Yuan, Y. Lu, and Y. Xue, ‘‘Droiddetector: Android malware char-
acterization and detection using deep learning,’’ Tsinghua Sci. Tech- [50] H. A. Alatwi, ‘‘Android malware detection using category-
nol., vol. 21, no. 1, pp. 114–123, Feb. 2016, doi: 10.1109/TST.2016. based machine learning classifers,’’ M.S. thesis, Rochester Inst.
7399288. Technol., Rochester, NY, USA, 2016, p. 62. [Online]. Available:
[31] S. Ilham, G. Abderrahim, and B. A. Abdelhakim, ‘‘Permission https://scholarworks.rit.edu/theses/9069/
based malware detection in Android devices,’’ in Proc. 3rd Int. [51] P. Basavaraju and A. S. Varde, ‘‘Supervised learning techniques in mobile
Conf. Smart City Appl., Tetouan, Morocco, Oct. 2018, pp. 1–6, doi: device apps for Androids,’’ Dept. Comput. Sci., Montclair State Univ.,
10.1145/3286606.3286860. Montclair, NJ, USA, Tech. Rep., 2017, p. 12, vol. 19, no. 2.
[32] O. Yildiz and I. A. Doğru, ‘‘Permission-based Android malware detec- [52] R. N. Romli, M. F. Zolkipli, and M. Z. Osman, ‘‘Efficient feature
tion system using feature selection with genetic algorithm,’’ Int. J. selection analysis for accuracy malware classification,’’ J. Phys., Conf.
Softw. Eng. Knowl. Eng., vol. 29, no. 2, pp. 245–262, Feb. 2019, doi: Ser., vol. 1918, no. 4, Jun. 2021, Art. no. 042140, doi: 10.1088/1742-
10.1142/S0218194019500116. 6596/1918/4/042140.
[33] J. Garcia, M. Hammad, B. Pedrood, A. Bagheri-Khaligh, and S. Malek, [53] J. Abah, O. V. Waziri, M. B. Abdullahi, U. M. Arthur, and O. S. Adewale,
‘‘Obfuscation-resilient, efficient, and accurate detection and family identi- ‘‘A machine learning approach to anomaly-based detection on Android
fication of Android malware,’’ Dept. Comput. Sci., George Mason Univ., platforms,’’ Int. J. Netw. Secur. Appl., vol. 7, no. 6, pp. 15–35, Nov. 2015,
Fairfax, VA, USA, Tech. Rep. GMU-CS-TR-2015-10, 2015, vol. 202. doi: 10.5121/ijnsa.2015.7602.
[34] A. Senawi, H.-L. Wei, and S. A. Billings, ‘‘A new maximum relevance- [54] I. K. Aksakalli, ‘‘Using convolutional neural network for Android malware
minimum multicollinearity (MRmMC) method for feature selection detection,’’ Dept. Comput. Eng., Erzurum, Turkey, 2019.
and ranking,’’ Pattern Recognit., vol. 67, pp. 47–61, Jul. 2017, doi: [55] I. Martín, J. A. Hernández, A. Muñoz, and A. Guzmán, ‘‘Android
10.1016/j.patcog.2017.01.026. malware characterization using metadata and machine learning tech-
[35] R. Tamura, K. Kobayashi, Y. Takano, R. Miyashiro, K. Nakata, and niques,’’ Secur. Commun. Netw., vol. 2018, pp. 1–11, Jul. 2018, doi:
T. Matsui, ‘‘Best subset selection for eliminating multicollinearity,’’ 10.1155/2018/5749481.
J. Oper. Res. Soc. Jpn., vol. 60, no. 3, pp. 321–336, 2017, doi: [56] S. Fallah and A. J. Bidgoly, ‘‘Benchmarking machine learning algorithms
10.15807/jorsj.60.321. for Android malware detection,’’ Jordanian J. Comput. Inf. Technol., vol. 5,
[36] A. Farrell, G. Wang, S. A. Rush, J. A. Martin, J. L. Belant, A. B. Butler, and no. 3, p. 15, 2019.
D. Godwin, ‘‘Machine learning of large-scale spatial distributions of wild [57] X. Jiang, B. Mao, J. Guan, and X. Huang, ‘‘Android malware detection
turkeys with high-dimensional environmental data,’’ Ecol. Evol., vol. 9, using fine-grained features,’’ Sci. Program., vol. 2020, pp. 1–13, Jan. 2020,
no. 10, pp. 5938–5949, May 2019, doi: 10.1002/ece3.5177. doi: 10.1155/2020/5190138.
[37] S. Niu, R. Huang, W. Chen, and Y. Xue, ‘‘An improved permission [58] H. Yuan, Y. Tang, W. Sun, and L. Liu, ‘‘A detection method for
management scheme of Android application based on machine learn- Android application security based on TF-IDF and machine learning,’’
ing,’’ Secur. Commun. Netw., vol. 2018, pp. 1–12, Oct. 2018, doi: PLoS ONE, vol. 15, no. 9, Sep. 2020, Art. no. e0238694, doi: 10.1371/
10.1155/2018/2329891. journal.pone.0238694.
VOLUME 10, 2022 89049

[59] A. M. García, ‘‘Machine learning techniques for Android malware detec- MUNAM ALI SHAH received the B.Sc. and
tion and classification,’’ Ph.D. thesis, Auton. Univ. Madrid, Madrid, M.Sc. degrees in computer science from the Uni-
Spain, 2019, p. 170. [Online]. Available: https://dialnet.unirioja.es/ versity of Peshawar, Pakistan, in 2001 and 2003,
servlet/tesis?codigo=221389 respectively, the M.S. degree in security technolo-
[60] S. Y. Yerima, M. K. Alzaylaee, and S. Sezer, ‘‘Machine learning- gies and applications from the University of Sur-
based dynamic analysis of Android apps with improved code cover- rey, U.K., in 2010, and the Ph.D. degree from the
age,’’ EURASIP J. Inf. Secur., vol. 2019, no. 1, p. 4, Dec. 2019, doi: University of Bedfordshire, U.K., in 2013. Since
10.1186/s13635-019-0087-1.
July 2004, he has been associated with the Depart-
[61] Y. Dong, ‘‘Android malware prediction by permission analysis and data
ment of Computer Science, COMSATS University
mining,’’ M.S. thesis, Dept. Comput. Inf. Sci., Univ. Michigan-Dearborn,
Dearborn, MI, USA, 2017, p. 71. [Online]. Available:https://deepblue.lib. Islamabad, Pakistan. He is the author of more than
umich.edu/bitstream/handle/2027.42/136197/YouchaoDong_Thesis_ 225 research articles published in international conferences and journals.
0327.pdf%3Fsequence%3D1%26isAllowed%3Dy His research interests include the IoT protocol design, QoS, and security
[62] D. V. Priya and P. Visalakshi, ‘‘Detecting Android malware using issues in wireless communication systems and applications of machine
an improved filter based technique in embedded software,’’ Micropro- learning. He received the Best Paper Award of the International Conference
cessors Microsyst., vol. 76, Jul. 2020, Art. no. 103115, doi: 10.1016/ on Automation and Computing, in 2012.
j.micpro.2020.103115.
[63] A. Hemalatha and D. S. S. Brunda, ‘‘Detection of mobile malwares using
improved deep convolutional neural network,’’ vol. 7, no. 14, p. 7, 2020.
[64] S. Mahdavifar, A. F. A. Kadir, R. Fatemi, D. Alhadidi, and A. A. Ghorbani,
‘‘Dynamic Android malware category classification using semi-supervised
deep learning,’’ in Proc. IEEE Int. Conf. Dependable, Autonomic Secure
CARSTEN MAPLE (Member, IEEE) is currently
Comput., Int. Conf. Pervasive Intell. Comput., Int. Conf. Cloud Big Data
Comput., Int. Conf. Cyber Sci. Technol. Congr. (DASC/PiCom/CBDCom/ a Professor of cyber systems engineering at the
CyberSciTech), Calgary, AB, Canada, Aug. 2020, pp. 515–522, doi: WMG’s Cyber Security Centre (CSC), University
10.1109/DASC-PICom-CBDCom-CyberSciTech49142.2020.00094. of Warwick. He is also the director of research
[65] Android Malware Detection | Kaggle. Accessed: Nov. 14, 2021. [Online]. in cyber security working with organizations in
Available: https://www.kaggle.com/defensedroid/android-malware-detec key sectors, such as manufacturing, healthcare,
tion financial services, and the broader public sector to
[66] X. Fu and H. Cai, ‘‘On the deterioration of learning-based malware address the challenges presented by today’s global
detectors for Android,’’ in Proc. IEEE/ACM 41st Int. Conf. Softw. Eng., cyber environment. His interests include informa-
Companion Proc. (ICSE-Companion), Montreal, QC, Canada, May 2019, tion security and trust and authentication in dis-
pp. 272–273, doi: 10.1109/ICSE-Companion.2019.00110. tributed systems. He is a member of several professional societies, including
[67] K. Xu, Y. Li, R. Deng, K. Chen, and J. Xu, ‘‘DroidEvolver: Self- the Council of Professors and Heads of Computing (CPHC) whose remit is
evolving Android malware detection system,’’ in Proc. IEEE Eur. Symp. to promote public education in computing and its applications and to provide
Secur. Privacy (EuroS P), Stockholm, Sweden, Jun. 2019, pp. 47–62, doi: a forum for those responsible for management and research in university
10.1109/EuroSP.2019.00014.
computing departments. He is also an elected member to the Committee
[68] H. Cai, ‘‘Assessing and improving malware detection sustainability
of this body. He is an Education Advisor for TIGA’s the trade association
through app evolution studies,’’ ACM Trans. Softw. Eng. Methodol., vol. 29,
no. 2, pp. 1–28, Apr. 2020, doi: 10.1145/3371924. representing the U.K. games industry. He is a fellow of the British Computer
[69] X. Zhang, Y. Zhang, M. Zhong, D. Ding, Y. Cao, Y. Zhang, M. Zhang, Society and the Chartered Institute for IT. He is a Chartered IT professional.
and M. Yang, ‘‘Enhancing state-of-the-art classifiers with API semantics He also holds two Professorships in China, including a position at one of the
to detect evolved Android malware,’’ in Proc. ACM SIGSAC Conf. Comput. top two control engineering departments in China.
Commun. Secur., New York, NY, USA, Oct. 2020, pp. 757–770, doi:
10.1145/3372297.3417291.
[70] A. Katrutsa and V. Strijov, ‘‘Comprehensive study of feature selec-
tion methods to solve multicollinearity problem according to evalua-
tion criteria,’’ Expert Syst. Appl., vol. 76, pp. 1–11, Jun. 2017, doi:
10.1016/j.eswa.2017.01.048. MUHAMMAD KAMRAN ABBASI received the
[71] R. Grewal, J. A. Cote, and H. Baumgartner, ‘‘Multicollinearity and mea-
Ph.D. degree in computer science from the Univer-
surement error in structural equation models: Implications for theory
sity of Bedfordshire, U.K. He is currently working
testing,’’ Marketing Sci., vol. 23, no. 4, pp. 519–529, Nov. 2004, doi:
10.1287/mksc.1040.0070. as an Associate Professor with the Department
[72] M. S. Devi, A. Poornima, J. Kosanam, and T. Hari S. Prashanth, ‘‘Outlier of Distance Continuing and Computer Education,
multicollinearity free fish weight prediction using machine learning,’’ University of Sindh. His research interests include
Mater. Today, Proc., p. 7, Mar. 2021, doi: 10.1016/j.matpr.2021.02.773. unsupervised machine learning, informatics, and
educational technology.
BEENISH UROOJ received the bachelor’s degree

in computer science from COMSATS University
Islamabad, Wah Campus, Pakistan, in 2019, where SIDRA RIASAT received the bachelor’s degree
she is currently pursuing the master’s degree in in computer science from Fatima Jinnah Women
information security with the Department of Com- University. She is currently pursuing the master’s
puter Science. She is also working as a part-time degree in information security with the Depart-
Graphic Designer and a Freelancer. Her confer- ment of Computer Science, COMSATS Univer-
ence paper about Security in SCADA Systems was sity, Islamabad. Her research interests include
declared runner up in best developmental research cyber security, block chain smart cities, and
in 2021 (soon to be published). Her research inter- SCADA networks.
ests include cyber security, threat hunting, and security in industrial control
systems (ICS).
89050 VOLUME 10, 2022

Malware Detection: A Framework For Reverse Engineered Android Applications Through Machine Learning Algorithms

Uploaded by

Copyright:

Available Formats

Malware Detection: A Framework For Reverse Engineered Android Applications Through Machine Learning Algorithms

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Malware Detection: A Framework For Reverse Engineered Android Applications Through Machine Learning Algorithms

Uploaded by

Copyright:

Available Formats

Received 30 December 2021, accepted 27 January 2022, date of publication 4 February 2022, date of current version 29 August 2022.

Digital Object Identifier 10.1109/ACCESS.2022.3149053

Malware Detection: A Framework for Reverse

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License.

Byte Code and the complex analysis of a managed environ-

89032 VOLUME 10, 2022

FIGURE 2. Static binary matrix extraction.

including the latest Android API level than state-of-the-

VOLUME 10, 2022 89033

89034 VOLUME 10, 2022

Another technique achieves satisfactory accuracy but there

VOLUME 10, 2022 89035

89036 VOLUME 10, 2022

VOLUME 10, 2022 89037

TABLE 3. Dangerous permissions (Malware probability).

89038 VOLUME 10, 2022

FIGURE 6. Machine learning process.

• Accuracy: Accuracy is one criterion being used to eval-

VOLUME 10, 2022 89039

FIGURE 8. Proposed methodology of our system.

high dimensional data being implemented, we have discussed

89040 VOLUME 10, 2022

FIGURE 10. Graph of application threat increase by 5%.

IV. FUTURE THREATS AND PREDICTION

VOLUME 10, 2022 89041

B. DATASET FIGURE 13. Boosting mechanism.

89042 VOLUME 10, 2022

FIGURE 14. Representation of the modules of our program.

FIGURE 16. Fit and pred function for SVM.

FIGURE 17. Predictive measures for AdaBoost.

the Machine Learning sector. As seen in the diagram above,

VOLUME 10, 2022 89043

FIGURE 19. Models accuracy percentage w.r.t label.

by training the algorithms on the datasets to verify which

89044 VOLUME 10, 2022

FIGURE 22. Output [1] representing the benign application (SVM).

FIGURE 24. Output [1] representing the benign application (AdaBoost).

FIGURE 25. Output [0] representing the malware application (AdaBoost).

Further ahead, the prediction results of the program are

VOLUME 10, 2022 89045

TABLE 8. Experimental results (AdaBoost and SVM), Selected, specify

FIGURE 29. Black entries for Harmful applications in SVM.

Figures 26, 27, 28 and 29 show the runs performed on

VOLUME 10, 2022 89047

VOLUME 10, 2022 89049

BEENISH UROOJ received the bachelor’s degree

89050 VOLUME 10, 2022

You might also like