Droidfusion: A Novel Multilevel Classifier Fusion Approach For Android Malware Detection

Download as pdf or txt
Download as pdf or txt
You are on page 1of 14

This article has been accepted for inclusion in a future issue of this journal.

Content is final as presented, with the exception of pagination.

IEEE TRANSACTIONS ON CYBERNETICS 1

DroidFusion: A Novel Multilevel Classifier Fusion


Approach for Android Malware Detection
Suleiman Y. Yerima , Member, IEEE, and Sakir Sezer, Member, IEEE

Abstract—Android malware has continued to grow in volume malware samples with nearly 2.5 million new samples discov-
and complexity posing significant threats to the security of mobile ered every year [2]. Android malware can be embedded in
devices and the services they enable. This has prompted increas- a variety of applications such as banking apps, gaming apps,
ing interest in employing machine learning to improve Android
malware detection. In this paper, we present a novel classi- lifestyle apps, educational apps, etc. These malware-infected
fier fusion approach based on a multilevel architecture that apps can then compromise security and privacy by allowing
enables effective combination of machine learning algorithms for unauthorized access to privacy-sensitive information, rooting
improved accuracy. The framework (called DroidFusion), gener- devices, turning devices into remotely controlled bots, etc.
ates a model by training base classifiers at a lower level and then Zero-day Android malware have the ability to evade
applies a set of ranking-based algorithms on their predictive accu-
racies at the higher level in order to derive a final classifier. The traditional signature-based defences. Hence, there is an
induced multilevel DroidFusion model can then be utilized as an urgent need to develop more effective detection methods.
improved accuracy predictor for Android malware detection. We Recently, machine learning-based methods are increasingly
present experimental results on four separate datasets to demon- being applied to Android malware detection. However, clas-
strate the effectiveness of our proposed approach. Furthermore, sifier fusion approaches have not been extensively explored
we demonstrate that the DroidFusion method can also effec-
tively enable the fusion of ensemble learning algorithms for as they have been in other domains like network intrusion
improved accuracy. Finally, we show that the prediction accuracy detection.
of DroidFusion, despite only utilizing a computational approach In this paper, we present and investigate a novel classi-
in the higher level, can outperform stacked generalization, a well- fier fusion approach that utilizes a multilevel architecture to
known classifier fusion method that employs a meta-classifier increase the predictive power of machine learning algorithms.
approach in its higher level.
The framework, called DroidFusion, is designed to induce a
Index Terms—Android malware detection, classifier fusion, classification model for Android malware detection by train-
ensemble learning, machine learning, mobile security, stacked ing a number of base classifiers at the lower level. A set of
generalization.
ranking-based algorithms are then utilized to derive combi-
nation schemes at the higher level, one of which is selected
to build a final model. The framework is capable of lever-
I. I NTRODUCTION aging not only traditional singular learning algorithms like
N RECENT years, Android has become the leading mobile decision trees or naive Bayes, but also ensemble learning algo-
I operating system with a substantially higher percentage
of the global market share. Over 1 billion Android devices
rithms like random forest, random subspace, boosting, etc. for
improved classification accuracy.
have been sold with an estimated 65 billion app downloads In order to demonstrate the effectiveness of the DroidFusion
from Google Play alone [1]. The growth in the popularity approach, we performed extensive experiments on four
of Android and the proliferation of third party app markets datasets derived from extracting features from two publicly
has also made it a popular target for malware. Last year, available and widely used malware samples collection (i.e.,
McAfee reported that there were more than 12 million Android Android Malgenome project [3] and DREBIN [4]) and a
collection of samples provided by Intel Security (formerly,
Manuscript received June 3, 2017; revised September 11, 2017; accepted McAfee). The unique contributions of this paper can be
November 11, 2017. This work was supported by the U.K. Engineering summarized as follows.
and Physical Sciences Research Council through the Centre for Secure
Information Security (CSIT-2) under Grant EP/N508664/1. This paper was 1) We propose a novel general-purpose classifier fusion
recommended by Associate Editor P. P. Angelov. (Corresponding author: approach (DroidFusion) and present its evaluation on
Suleiman Y. Yerima.) four different datasets. DroidFusion can be applied
S. Y. Yerima was with the Centre for Secure Information Technologies,
Queen’s University Belfast, Belfast BT3 9DT, Northern Ireland. He is now to not only traditional learners but also ensemble
with the Faculty of Technology, De Montfort University, Leicester LE1 9BH, learners.
U.K. (e-mail: [email protected]). 2) We propose four ranking-based algorithms that enable
S. Sezer is with the Centre for Secure Information Technologies, Queen’s
University Belfast, Belfast, Northern Ireland (e-mail: [email protected]). classifier fusion within the DroidFusion framework. The
This paper has supplementary downloadable multimedia material available algorithms are utilized in building a final improved
at http://ieeexplore.ieee.org provided by the authors. classification model for Android malware detection.
Color versions of one or more of the figures in this paper are available
online at http://ieeexplore.ieee.org. 3) We present the results of extensive experiments to
Digital Object Identifier 10.1109/TCYB.2017.2777960 demonstrate the effectiveness of our proposed approach.
2168-2267 c 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

2 IEEE TRANSACTIONS ON CYBERNETICS

The results of experiments with singular classifiers and SVM, decision tree, k-NN, and naive Bayes with information
ensemble classifiers are presented. priors and hierarchical mixture of naive Bayes.
4) Furthermore, we present results of a performance com- Wang et al. [52] applied logistic regression, linear SVM,
parison of DroidFusion with stacked generalization (or decision tree, and random forest with static analysis for the
stacking), a well-known classifier fusion method that is detection of malicious apps. They utilized app-specific static
also based on a multilevel architecture. features and platform-specific static features for training the
5) Datasets that we created from the feature extraction pro- machine learning algorithms. The authors reported a maximum
cess with DREBIN and Malgenome project malware true positive rate (TPR) of 96% and false positive rate (FPR)
samples are released in the supplementary material. of 0.06% with the logistic regression classifier based on
The rest of this paper is structured as follows. Section II dis- experiments conducted on 18 363 malware apps and 217 619
cusses related work while Section III presents the DroidFusion benign apps.
framework. The investigation methodology is presented in Other research papers that have investigated static fea-
Section IV, while Section V presents results with analyses tures with machine learning for Android malware detection
and discussion. Finally, the conclusion is given in Section VI. include [21]–[23], [45], [47], [48], and [54].

II. R ELATED W ORK B. Dynamic and Hybrid Analysis With Traditional Classifiers
In this section, we review related work on machine learning- Some of the detection methods utilized dynamic fea-
based Android malware detection. Static and/or dynamic tures with machine learning, for example AntiMalDroid [24].
analysis is used to extract model training features, and both AntiMalDroid is a dynamic analysis behavior-based mal-
methods have pros and cons. Static analysis is prone to obfus- ware detection framework that uses logged behavior sequence
cation [5], but is generally faster and less resource intensive as features with SVM. DroidDolphin [25] also employed
than dynamic analysis. Dynamic analysis is resistant to obfus- SVM with dynamically obtained features. Afonso et al. [26]
cation but can be hampered by anti-virtualization [6]–[9] and utilized dynamic API calls and system call traces and
code coverage limitations [10], [34]. investigated SVM, J48, IBk (an instance-based classifier),
BayesNet K2, BayesNet TAN, random forest, and naive
A. Static Analysis With Traditional Classifiers Bayes. Alzaylaee et al. [27] investigated SVM, naive Bayes,
PART, random forest, J48, multilayer perceptron (MLP), and
Recent Android malware detection work that employ
simple logistic by comparing their performances on real
machine learning with static features include the fol-
phones versus emulators using dynamically obtained features.
lowing. DroidMat [11] proposed applying k-means and
Ni et al. [46] proposed a real-time malicious behavior detec-
k-nearest neighbor (k-NN) algorithms based on static fea-
tion system that records API calls, permission uses, and other
tures from permissions, intents, and application program
real-time features such as user operations. In their paper, they
interface (API) calls, to classify apps as benign or malware.
used SVM and naive Bayes algorithms for detection with these
Arp et al. [4] proposed SVM based on permissions, API
run-time features.
calls, network access, etc. for lightweight on-device detec-
Mahindru and Singh [53] extracted 123 dynamic per-
tion. Yerima et al. [12], [14] proposed an eigenpsace analysis
missions from 11 000 Android applications which were
approach, as well as random forest ensemble learning models.
subsequently applied to several individual machine learning
The machine learning-based detection proposed in the papers
classifiers including naive Bayes, decision tree, random for-
were based on API calls, intents, permissions, and embed-
est, simple logistic, and k-star. In their experiments, simple
ded commands. Varsha et al. [15] investigated SVM, random
logistic was found to perform marginally better than the oth-
forest, and rotation forests on three datasets; their detection
ers but the malware classification accuracy of random forest,
method employed static features extracted from the manifest
decision tree (J48), and simple logistic were comparable.
and application executable files.
Other works such as MARVIN [28], adopt a hybrid static
Sharma and Dash [16] utilized API calls and permis-
and dynamic feature-based approach with machine learning
sions to build naive Bayes and k-NN-based detection systems.
(SVM and L2 regularized linear classifier). MARVIN assesses
In [17], API classes were used with random forest, J48, and
the risk associated with unknown Android apps in the form of
SVM classifiers. Wang et al. [18] evaluated the usefulness of
a malice score ranging from 0 to 10. Similarly, Su et al. [49]
risky permissions for malware detection using SVM, decision
adopted a hybrid static and dynamic feature approach by per-
trees, and random forest. DAPASA [19] focused on detecting
forming experiments on 1200 (900 clean and 300 malware)
malware piggybacked onto benign apps by utilizing sensi-
samples. Several machine learning algorithms were investi-
tive subgraphs to construct five features depicting invocation
gated including Bayes net, naive Bayes, k-NN, J48, and SVM.
patterns. The features are fed into machine learning algo-
The best overall accuracy of 91.1% was attained with SVM.
rithms, i.e., random forest, decision tree, k-NN, and PART,
with random forest yielding the best detection performance.
Cen et al. [20] proposed a detection method based on API C. Android Malware Detection With Classifier Fusion
calls from decompiled code and permissions. Their proposed Previous works in intrusion detection systems such
method applies a probabilistic discriminative model based on as [29]–[32] investigated classifier fusion for improving
regularized logistic regression (RLR). RLR is compared to detection accuracy. This method is also being applied
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

YERIMA AND SEZER: DroidFusion: NOVEL MULTILEVEL CLASSIFIER FUSION APPROACH FOR ANDROID MALWARE DETECTION 3

TABLE I
to the detection of Android malware. For example, OVERVIEW OF S OME OF THE PAPERS T HAT A PPLY C LASSIFIER F USION
Milosevic et al. [50] investigated classifier fusion approach FOR A NDROID M ALWARE D ETECTION . NB = NAIVE BAYES ; SL =
with static analysis based on Android permissions and source S IMPLE L OGISTIC ; LR = L INEAR R EGRESSION ; DT = D ECISION T REE ;
VP = VOTED P ERCEPTRON ; AVE P = AVERAGE OF P ROBABILITIES ;
code-based analysis. They used SVM, C.45, decision trees, P ROD P = P RODUCT OF P ROBABILITIES ; AND M AX P =
random tree, random forests, JRip, and linear regression classi- M AXIMUM P ROBABILITY
fiers. The authors experimented with ensembles that contained
odd combinations of three and five classifiers using the major-
ity voting fusion method. The best fusion model achieved an
accuracy rate of 95.6% using the source-code-based features.
However, the number of samples used in the experiments were
limited (387 samples for the permissions-based experiments
and 368 for source code-based analysis).
Yerima et al. [13] compared several classifier fusion meth-
ods, i.e., majority vote, product of probabilities, maximum
probability, and average of probabilities using J48, naive
Bayes, PART, RIDOR, and simple logistic classifiers. The
classifiers were trained with static features extracted from
6863 app samples, and in the experiments presented, the fused
models performed better than the single classifiers.
Wang et al. [51] extracted 11 types of static features
and employed multiple classifiers in a majority vote fusion
approach. The classifiers include SVM, k-NN, naive Bayes, produce different randomly induced models that are subse-
classification and regression tree (CART), and random for- quently combined). At the lower level, the (DroidFusion)
est. Their experiments on 116 028 app samples showed more base classifiers are trained on a training set using a strati-
robustness with the majority voting ensemble than with the fied N-fold cross-validation technique to estimate their relative
individual base classifiers. predictive accuracies. The outcomes are utilized by four differ-
Idrees et al. [55] utilized permissions and intents as fea- ent ranking-based algorithms (in the higher layer) that define
tures to train machine learning models and applied classifier certain criteria for the selection and subsequent combination
fusion for improved performance. Their experiments were per- of a subset (or all) of the applicable base classifiers. The out-
formed on 1745 app samples starting with a performance comes of the ranking algorithms are combined in pairs in
comparison between MLP, decision table, decision tree, ran- order to find the strongest pair, which is subsequently used
dom forest, naive Bayes, and sequential minimal optimization to build the final DroidFusion model (after testing against an
classifiers. The decision table, MLP, and decision tree classi- unweighted parallel combination of the base classifiers).
fiers were then combined using three schemes: 1) average of
probabilities; 2) product of probabilities; and 3) majority vot- A. DroidFusion Model Construction
ing. Coronado-De-Alba et al. [33] proposed and investigated The model building, i.e., training process is distinct from the
a classifier fusion method based on random forest and ran- prediction or testing phase, as the former utilizes a training-
dom committee ensemble classifiers. Their approach embeds validation set to build a multilevel ensemble classifier which is
random forest within random committee to produce a meta- then evaluated on a separate test set in the latter phase. Fig. 1
ensemble model. The meta-model outperformed the individual illustrates the two-level architecture of DroidFusion. It shows
classifiers in experiments performed with 1531 malware and the training paths (solid arrows) and the testing/prediction path
1531 benign samples. Table I summarizes papers that have (dashed arrows). First, at the lower level each base classi-
investigated classifier fusion for Android malware detection. fier undergoes an N-fold cross-validation-based estimate of
In contrast to all of the existing Android malware detection class performance accuracies. Let the N-fold cross validated
works, this paper proposes a novel classifier fusion approach predictive accuracies for K base classifiers be expressed by
that utilizes four ranking-based algorithms within a multilevel Pbase , a K-tuple of the class accuracies of the K base classifiers
framework (DroidFusion). We evaluated DroidFusion exten-
sively and compared its performance to stacking and other Pbase = {[P1m , P1b ], [P2m , P2b ], . . . , [PKm , PKb ]}. (1)
classifier fusion methods. Next, we present DroidFusion.
The elements of Pbase are applied to the ranking-based algo-
rithms average accuracy-based (AAB) ranking scheme, class
III. D ROID F USION : G ENERAL P URPOSE F RAMEWORK differential-based (CDB) ranking scheme, ranked aggregate
FOR C LASSIFIER F USION of per class performance-based (RAPC) scheme, and ranked
The DroidFusion framework consists of a multilevel archi- aggregate of average accuracy and class differential-based
tecture for classifier fusion. It is designed as a general (RACD) scheme described later in Section III-B. Let X be
purpose classifier fusion system, so that it can be applied the total number of instances with M malware and B benign
to both traditional singular classifiers and ensemble classi- instances, where the M instances possess a label L = 1 denot-
fiers (which themselves employ a base classifier usually to ing malware and the B instances from X possess a label L = 0
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

4 IEEE TRANSACTIONS ON CYBERNETICS

Fig. 1. DroidFusion 2-layer model architecture.

denoting benign. All X instances are also represented by fea- reclassification is accomplished using V̇(x), x ∈ X based on
ture vectors with f binary representations, where f is the the criteria defined by the schemes in S using Pbase . Each
number of features extracted from the given app. The fea- scheme in S derives a set of Z weights that will be applied
tures in the vectors take on 0 or 1 representing the absence with V̇(x), x ∈ X for every instance during the reclassification
or presence the given feature. Additionally, after the N-fold process.
cross-validation process (as shown in Fig. 1), a set of K-tuple Let ωi , i ∈ {1, . . . , Z}, Z ≤ K be the set of weights derived
class predictions are derived for every instance x, given by for a particular scheme in S. Then, to reclassify an instance
x according to the scheme’s criterion, its class prediction will
V(x) = {v1 , v2 , . . . , vk }, ∀k ∈ {1, . . . , K}. (2)
be given by
Note that v1 , v2 , . . . , vk could be crisp predictions or prob-  Z
ωi vi
ability estimates from the base classifiers. Adding the original 1 : if i=1 ≥ 0.5
CSj (x) = Z
i=1 ωi (5)
(known) class label, l, we obtain 0 : otherwise ∀j ∈ {1, 2, 3, 4}.
V̇(x) = {v1 , v2 , . . . , vk , l}, ∀k ∈ {1, . . . , K}, l ∈ {0, 1}. (3) Hence, the benign class accuracy performance for the given
Pbase and V̇(x), ∀x ∈ X will be utilized in the level-2 scheme is calculated from
computation during the DroidFusion model construction. Let X
(CSj (x) + 1)|CSj (x) = 0, l(x) = 0
us denote the set of four ranking-based schemes by S = PSj = x=1
ben
(6)
B
{S1, S2, S3, S4}. The pairwise combinations of the elements
of S will result in six possibilities where B is the number of benign instances, while the malware
accuracy performance is calculated from
φ = {S1S2, S1S3, S1S4, S2S3, S2S4, S3S4}. (4) X
CSj (x)|CSj (x) = 1, l(x) = 1
Our goal is to select the best pair of ranking-based schemes PSj = x=1
mal
. (7)
X−B
from S, and if its performance exceeds that of an unweighted
combination of the original base classifiers, it would be Thus the average performance accuracy is simply
selected to construct the final DroidFusion model. In the B · Pben
Sj + (X − B) · PSj
mal
event that the unweighted combination performance is greater, ṖSj = . (8)
DroidFusion will be configured to apply a majority vote (or X
average of probabilities) of the base classifiers in the final con- Likewise, to determine the performance of each pairwise com-
structed model. In order to estimate the accuracy performance bination in φ: let ωi , i ∈ {1, . . . , Z}, Z ≤ K be the first set
of each scheme in S or each pairwise combination in set φ, of weights derived for the first scheme in the pair, and let
a reclassification of the X instances (in the training-validation μi , i ∈ {1, . . . , Z}, Z ≤ K be those derived for the second
set) is performed for each scheme or pair of schemes. The scheme in the pair. Then, to reclassify the X instances in the
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

YERIMA AND SEZER: DroidFusion: NOVEL MULTILEVEL CLASSIFIER FUSION APPROACH FOR ANDROID MALWARE DETECTION 5

training-validation set according to the combination pair, the B. Proposed Ranking-Based Algorithms
class prediction of each instance x will be given by The design of our proposed algorithms is influenced by the
⎧ Z Z observation that most typical classifiers perform differently for
⎪ i=1 ωi vi +i=1 μi vi

⎪ 1 : if  ≥ 0.5 both classes. That is, class accuracy performance for benign
⎨ Z Z
i=1 ωi + i=1 μi
CSjSn (x) = 0 : otherwise (9) and malware are very rarely equal in magnitude. The proposed

⎪ ∀j ∈ {1, 2, 3, 4}, ∀n ∈ {1, 2, 3, 4} ranking-based algorithms include the following.


j = n, SjSn ≡ SnSj. 1) An AAB ranking scheme.
2) A CDB ranking scheme.
Therefore, computing benign class accuracy and malware class 3) An RAPC-based scheme.
accuracy will utilize 4) An RACD-based scheme.
X  1) Average Accuracy-Based Ranking Scheme: With the
x=1 CSjSn (x) + 1 |CSjSn (x) = 0, l(x) = 0
Pben
SjSn = (10) AAB method, the ranking is designed to be directly propor-
B tional to the average prediction accuracies across the classes.
and In this case, base classifiers with larger overall accuracy
X performance will rank higher. AAB does not take into account
x=1 CSjSn (x)|CSjSn (x) = 1, l(x) = 1
Pmal = (11) how well a base classifier performs for a particular class. Let
SjSn
X−B AAB be the first scheme S1, from set S. The algorithm is
respectively. The average performance accuracy for the pair- summarized as follows.
wise schemes will then be given by Let Pbase be the set of performance accuracies Pk,c ∈ Pbase
of K base classifiers. If m denotes malware and b, benign then
B · Pben
SjSn + (X − B) · PSjSn
mal
the average accuracy of the kth base classifier is given by
ṖSjSn = . (12)
X
ak = 0.5 × Pk,c |k ∈ {1, . . . , K}, 0 < Pk,c ≤ 1. (18)
∀j ∈ {1, 2, 3, 4}, ∀n ∈ {1, 2, 3, 4}, j = n, SjSn ≡ SnSj. c=m,b

Equivalently, the unweighted majority vote class predictions Let A ← ak , ∀k ∈ {1, . . . , K} be a set of the average predictive
for instance x is given by accuracies, to which a ranking function Rankdesc (.) is applied
 K Ā ← Rankdesc (A). (19)
k=1 vi
Cmv (x) = 1 : if K ≥ 0.5 (13) Thus, Ā contains an ordered ranking of the level-1 base classi-
0 : otherwise ∀k ∈ {1, . . . , K}.
fiers average predictive accuracies in descending order. Next,
Hence, the benign class accuracy performance for the the top Z rankings are utilized in weight assignments as
unweighted scheme will be given by follows:
X ω1 = Z, ω2 = Z − 1, . . . , ωZ = 1, Z ≤ K. (20)
x=1 (Cmv (x) + 1)|Cmv (x) = 0, l(x) = 0
Pben
mv = . (14)
B Thus, the AAB class prediction C(x) for instance x in the
Likewise, the malware class accuracy performance for the training-validation set is given by (5) or given by (9) when
unweighted scheme is given by used in the pairwise combination with another scheme.
2) Class Differential-Based Ranking Scheme: With the
X
Cmv (x)|Cmv (x) = 1, l(x) = 1 CDB method, the ranking is directly proportional to the aver-
Pmv = x=1
mal
. (15) age predictive accuracy and inversely proportional to the abso-
X−B
lute value of the performance difference between the classes.
Finally, the average accuracy performance for the unweighted Assuming a binary classification problem, this approach will
scheme is given by be less likely to favor the decision from a base classifier that
B · Pben exhibits much higher accuracy in one class over the other but
mv + (X − B) · Pmv
mal
Ṗmv = . (16) will assign larger weights to good classifiers that perform rel-
X
atively well in both classes. The CDB procedure is described
After all the reclassifications are completed, and the aver- as follows.
age accuracies computed, the applicable scheme that will be Suppose the CDB method is taken as scheme S2, let the
utilized to construct the DroidFusion model is selected thus average accuracy of each base classifier be given by ak in (18)

argφ max(Ṗφ ), and define D̄ with cardinality K as a set of ordered rankings


(17) in descending order of magnitude. Calculate dk proportional
φ = {S1S2, S1S3, S1S4, S2S3, S2S4, S3S4, mv}.
to average accuracies and inversely proportional to absolute
Suppose that S1 and S3 pair are selected by the operation in difference of interclass accuracies
(17), then the class of a given unlabeled instance during the ak
dk = , k ∈ {1, . . . , K}. (21)
testing or unknown instances prediction (during model deploy- Pk,m − Pk,b
ment) will be computed by (9) with j = 1 and n = 3. Next,
Let D ← dk , ∀k ∈ {1, . . . , K} be a set of the dk values, to
we describe the four ranking-based algorithms underpinning
which the ranking function Rankdesc (.) is applied
the schemes in set S that utilize Pbase to accomplish all of the
above described DroidFusion level-2 steps. D̄ ← Rankdesc (D). (22)
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

6 IEEE TRANSACTIONS ON CYBERNETICS

With D̄ containing the ordered rankings of dk values, the top Then, for each base classifier, aggregate the values and apply
Z rankings are also utilized to assigned weights according the ranking function Rankdesc (.)
to (20). Thus, the S2 = CDB class prediction for an instance

x is determined from (5). Whenever S2 = CDB is used (in hk ← Ak + Gk


, Ak ∈ Ā, Gk ∈ Ḡ, ∀k ∈ {1, . . . , K} (29)
conjunction with another scheme) within a pair in the set H ← hk
expressed by (4), then (9) will be used for the class prediction H̄ ← Rankdesc (H). (30)
of the instance.
3) Ranked Aggregate of Per Class Accuracies-Based Thus, H̄ is the set containing the ranked values of H in
Scheme: In the RAPC method, the ranking is directly pro- descending order of magnitude. The top Z rankings are then
portional to the sum of the initial per class rankings of the used according to (20) to assign the weights.
accuracies of the base classifiers. This method is more likely
to assign a larger weight to a base classifier that performs very
C. Model Complexity
well in both classes. RAPC is summarized as follows.
With F̄ defined as the set of ordered rankings with cardi- As mentioned earlier, the base classifiers initial accuracies
nality K, given the initial performance accuracies of Pk,c of are estimated using a stratified N-fold cross-validation tech-
the K base classifiers nique. This procedure will be performed only once during

training (on the training-validation set) and the preliminary


Pm ← Pk,c where c = b
, k ∈ {1, . . . , K}, c ∈ {m, b}. (23) predictions for all x instances in X for every base classifier
Pb ← Pk,c where c = m will be determined from the procedure. The configurations
We then apply the ranking function Rankdesc (.) to both (weights) computed from each algorithm is applied together

with these initial (base classifier) predictions to reclassify
P̄m ← Rankdesc (Pm ) each instance accordingly. Since level-2 training of instances
(24)
P̄b ← Rankdesc (Pb ). requires only reclassification using V̇(x), ∀x ∈ X, the time
complexity for utilizing R level-2 algorithms to predict the
The per-class rankings for each base classifier are aggregated
classes of X instances using (5) will be given by O(RX). The
and then ranked again
pairwise class predictions also involve reclassification, thus

fk ← P̄k,m + P̄k,b the complexity involved for predicting the class


 of X instances
, ∀k ∈ {1, . . . , K} (25) using (9) will be given by O(JX), where J = R2 . Likewise, for
F ← fk
F̄ ← Rankdesc (F). (26) the unweighted majority vote the complexity will be O(X) as
reclassification is involved also. Since we utilize unweighted
Finally, from the set F̄ comprising k ordered values of F, majority vote and pairwise combinations for final model build-
we select the top Z rankings and use them to assign weights ing using (17) the total training time complexity in level-2 is
according to (20). Suppose the RAPC scheme is taken as therefore given by O(X)+O(JX) = O((J+1)X) where J = R2
S3, we can determine the class prediction for an instance x for the R level-2 ranking-based algorithms.
from (5). If S3 = RAPC is used (in conjunction with another
scheme) within a pair in the set expressed by (4), then (9) will
be employed for the class prediction of the instance. IV. I NVESTIGATION M ETHODOLOGY
4) Ranked Aggregate of Average Accuracy and Class A. Automated Static Analyzer for Feature Extraction
Differential Scheme: With RACD, the ranking is directly pro- The features used in the experimental evaluation of the
portional to the sum of the initial rankings of the average DroidFusion system are obtained using an automated static
performance accuracies and the initial rankings of the dif- analysis tool developed with Python. The tool enables us to
ference in performance between the classes. This method is extract permissions and intents from the application mani-
designed to assign a larger weight to the base classifiers with fest file after decompiling with AXMLprinter2 (a library for
good initial overall accuracy that also have a relatively smaller decompiling Android manifest files). In addition, API calls are
difference in performance between the classes. The algorithm extracted through reverse engineering the .dex files by means
is described as follows. of Baksmali disassembler. The static analyzer also searches
Suppose, we take the RACD method as scheme S4, define for dangerous Linux commands from the application files and
a set H̄ for ordered values with cardinality K. Given A, the checks for the presence of embedded .dex, .jar, .so, and .exe
set of computed average accuracies for each base classifier files within the application. Previous works [35] have shown
(determined in the AAB scheme) compute the class differential that these set of static application attributes provide discrim-
for each corresponding classifier as follows: inative features for machine learning-based Android malware
detection, hence, we utilized them for DroidFusion experi-
gk ← |Pk,m − Pk,b |, k ∈ {1, . . . , K}. (27)
ments. Furthermore, while extracting API calls, third party
Define G ← gk , ∀k ∈ {1, . . . , K} as the ordered set of gk libraries are excluded using the list of popular ad libraries
values to which a ranking function Rankascen (.) is applied to obtained from [36]. Fig. 2 shows an overview of the feature
rank gk in ascending order of magnitude extraction process using our the static app analyzer. The fea-
tures are represented in binary form and labeled with class
Ḡ ← Rankascen (G). (28) values in all the datasets.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

YERIMA AND SEZER: DroidFusion: NOVEL MULTILEVEL CLASSIFIER FUSION APPROACH FOR ANDROID MALWARE DETECTION 7

2) FPR: The ratio of incorrectly classified benign instances


to the total number of benign instances, given by
FP
FPR = (35)
TN + FP
where false positives (FP) is the number of incorrect
predictions of benign classifications while TN (true neg-
atives) is the number of correct predictions of benign
instances.
3) Precision: It is also known as positive predictive rate is
calculated as follows:
Fig. 2. Overview of the python-based static analyzer for automated feature TP
extraction. Precision = . (36)
TP + FP
4) F-Measure: This metric combines precision and recall
B. Feature Selection as follows:
Feature ranking and selection is usually applied for dimen- 2 · precision · recall
FM = . (37)
sionality reduction which in turn lowers model computational precision + recall
cost. The study in this paper utilized four datasets for evaluat- In [20], is has been shown that (especially for unbal-
ing DroidFusion. One of the datasets is derived from feature anced datasets) F-measure is a better metric than the
reduction of an initial set of 350 features down to 100 by area under curve for the receiver operating cost which
applying the information gain (IG) feature ranking approach uses values of TPR and FPR to plot a graph for dif-
to rank the features and then selecting the top n features. IG ferent thresholds. Thus, in our experiments we utilize
evaluates the features by calculating the IG achieved by each F-measure as the main indicator of predictive power.
feature. Specifically, given a feature X, IG is expressed as Note that precision and recall can be calculated for both
malware and benign classes. Hence, if Fm and Fb are
IG = E(X) − E(X/Y) (31) the F-measures for malware and benign classes, respec-
tively, while Nm and Nb are the number of instances
where E(X) and E(X/Y) represent the entropy of the feature in each class, the combined metric known as weighted
X before and after observing the feature Y, respectively. The F-measure is the sum of F-measures weighted by the
entropy of feature X is given by number of instances in each class, given by
Fm · Nm + Fb · Nb
E(X) = − p(x) log2 (p(x)) (32) WFM = . (38)
x∈X Nm + Nb
where p(x) is the marginal probability density function for the 5) Time taken to test the model. This is the time in sec-
random variable X. Similarly, the entropy of X relative to Y onds to test a constructed model from the testing set.
is given by [38] All models were evaluated on a Windows 7 Enterprise
64-bit PC with 32 GB of RAM and Intel Xeon CPU
E(X/Y) = − p(x) p(x|y) log2 (p(x|y)) (33) 3.10-GHz speed.
x∈X x∈X

where p(x|y) is the conditional probability of x given y. The D. Datasets Description


higher the reduction of the entropy of feature X, the greater The experiments performed to evaluate DroidFusion was
the significance of the feature. done using four datasets from three collections of Android app
samples. Table II shows the details of each of the datasets.
C. Model Evaluation Metrics The first one (Malgenome-215) consists of feature vectors
from 3799 app samples, where 2539 were benign and 1260
The following performance metrics are considered in the
were malware samples from the Android malware genome
evaluation of the models.
project [3], a reference malware samples collection widely
1) TPR: The ratio of correctly classified malicious apps to
used by the malware research community. This dataset con-
the total number of malicious apps. This is given by
tains 215 features. The second dataset (Drebin-215) also
TP consists of vectors of 215 features from 15 036 app samples;
TPR = (34) of these, 9476 were benign samples while the remaining 5560
TP + FN
were malware samples from the Drebin project [4]. The Drebin
where true positives (TP) is the number of correct samples are also publicly available and widely used in the
predictions of malware classification and FN (False neg- research community. Both Drebin-215 and Malgenome-215
atives) is the number of misclassified malware instances datasets are made available in the supplementary material.
in the set. TPR is also synonymous with recall and The final two datasets come from the same source of sam-
sensitivity. ples. These are McAfee-350 and McAfee-100 in the table.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

8 IEEE TRANSACTIONS ON CYBERNETICS

TABLE II TABLE III


DATASETS U SED FOR THE D ROID F USION E VALUATION E XPERIMENTS M ALGENOME 215 T RAIN -VALIDATION S ET R ESULTS AND L EVEL -2
A LGORITHM -BASED R ANKINGS FOR THE BASE C LASSIFIERS
(5 = H IGHEST R ANK AND 1 = L OWEST )

They both have 36 183 instances of feature vectors derived


from 13 805 malware samples and 22 378 benign samples
made available to us by Intel Security. Dataset #3 has 350
features, while dataset #4 has the top 100 features with the
largest IG from the original 350 features in dataset #3. In the TABLE IV
M ALGENOME 215 T RAIN -VALIDATION S ET L EVEL -2
experiments presented, datasets #1–#3 are used to investigate C OMBINATION S CHEMES I NTERMEDIATE R ESULTS
DroidFusion with singular base classifiers, while dataset #4
is used to study the fusion of ensemble base classifiers with
DroidFusion. Note that all of the features were extracted using
our static app analysis tool described in Section IV-A.

V. R ESULTS AND D ISCUSSION


In this section, we present and discuss the results of four
sets of experiments performed to evaluate DroidFusion
performance. We utilized the open source Waikato model using the training-validation set. Table III shows the
Environment for Knowledge Analysis (WEKA) toolkit [37] per-class accuracies of each of the five base classifiers result-
to implement and evaluate DroidFusion. Feature ranking and ing from tenfold cross-validation on the training-validation set.
reduction of dataset #3 into dataset #4 was also done with The subsequent rankings determined from AAB, CDB, RAPC,
WEKA. In all the experiments we set K = 5, i.e., five base and RACD are also presented. Each of the algorithms induced
classifiers are utilized. Also, we take N = 10 and Z = 3 for a different set of rankings from the base classifiers accuracies.
the cross-validation and weight assignments, respectively. In After applying (9) to the instances in the training-validation
the first three sets of experiments, nonensemble base classi- set and computing the accuracies with (10)–(12), we obtained
fiers were used, which were: J48, REPTree, voted perceptron, the performances of the pairwise combinations of the level-2
and random tree. The random tree learner was used to build algorithms as shown in Table IV.
two separate classifier models using different configurations, The results in Table IV clearly depict the overall
i.e., random tree-100 and random tree-9. With random trees, performance improvement achieved by the level-2 combi-
the number of variables selected at each split during tree nation schemes over the individual base classifiers. From
construction is a configuration parameter which by default Table III, J48 has the best malware recall of 0.975 but its
(in WEKA) is given by: log2 f + 1, where f is the number of recall for benign class is 0.983. On the other hand, voted per-
features (# variables = 9 for f = 350 with the McAfee-350 ceptron had the best recall of 0.991 for the benign class, but its
dataset). The same configuration is used in the Drebin-215 recall for the malware class is 0.971 (on the training-validation
and Malgenome-215 experiments for consistency. Thus, set). On the training-validation set, the best combination is
selecting 100 and 9 for random tree-100 and random tree-9, AAB+RAPC (i.e., S1S3 pair) having 0.984 recall for mal-
respectively, results in two different base classifier models. ware and 0.992 recall for benign class, and a weighted
Random tree, REPTree, J48, and voted perceptron were F-measure of 0.9893. J48 and voted perceptron had weighted
selected as example base classifiers (out of 12 base classi- F-measures of 0.9804 and 0.9843, respectively. These were
fiers) because of their combined accuracy and training time below all of the weighted F-measures achieved by the combi-
performance as determined from preliminary investigation; nation schemes shown in Table IV. Hence, these intermediate
a different set of learning algorithms can be used with training-validation set results already show the capability of
DroidFusion since it designed to be general-purpose, and not the DroidFusion approach to produce stronger models from
specific to a particular type of machine learning algorithm. the weaker base classifiers.
After the model has been built with the help of the
A. Performance of DroidFusion With the training-validation set, the full DroidFusion model (featur-
Malgenome-215 Dataset ing AAB+RAPC in level-2) was evaluated on the test set.
In order to evaluate DroidFusion on the Malgenome-215 For comparison, the base classifier models were retrained on
dataset, we split the dataset into two parts, one for testing the complete training-validation set and then tested on the
and another for training-validation. The ratio was training- same test set. The results are shown in Table V. Fig. 3, is
validation: 80% and testing: 20%. The stratified tenfold cross- a graph of the respective weighted F-measures. The results
validation approach was used to construct the DroidFusion of DroidFusion are also compared to those of three classifier
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

YERIMA AND SEZER: DroidFusion: NOVEL MULTILEVEL CLASSIFIER FUSION APPROACH FOR ANDROID MALWARE DETECTION 9

TABLE V TABLE VI
M ALGENOME 215 C OMPARISON OF D ROID F USION W ITH BASE DREBIN 215 T RAIN -VALIDATION S ET R ESULTS AND L EVEL -2
C LASSIFIERS AND T RADITIONAL C OMBINATION A LGORITHM -BASED R ANKINGS FOR THE BASE C LASSIFIERS
S CHEMES ON T EST S ET (5 = H IGHEST R ANK AND 1 = L OWEST )

TABLE VII
DREBIN 215 T RAIN -VALIDATION S ET L EVEL -2 C OMBINATION
S CHEMES I NTERMEDIATE R ESULTS

accuracies on the five nonensemble base classifiers on the


training-validation set during DroidFusion model training. The
split ratios for the training-validation and testing sets was
90%:10% and the tenfold cross-validation procedure was uti-
lized during training. The rankings induced by AAB, CDB,
RAPC, and RACD algorithms are also shown. Again, applying
Fig. 3. Weighted F-measure results from the Malgenome-215 dataset (9) to the instances in the training-validation set and comput-
experiments.
ing accuracies with (10)–(12) the performances of the pairwise
combinations of the level-2 algorithms are shown in Table VII.
From Table VI, random tree-100 had the best recall rate for
combination methods: 1) majority vote; 2) maximum probabil-
the malware class (i.e., 0.968) while J48 had the best recall
ity; and 3) average of probabilities [13], and a meta learning
rate for the benign class (0.983). On the training-validation
method known as multischeme. The multischeme approach
set, the weighted F-measure for random tree-100 was 0.9762,
evaluates a given number of base classifiers in order to select
while that of J48 was 0.9741. Looking at Table VII, all of the
the best model. In WEKA, it can be configured to use cross-
combination schemes had better weighted F-measures (than
validation or to build its model on the entire training set. In
the base classifiers) indicating accuracy performance enhance-
our experiments we selected tenfold cross-validation config-
ment potential at this stage. The best combination is the
uration for the multischeme learner to enable a comparative
RAPC+RACD (S3S4 pair) scheme, whose configuration is
equivalent to DroidFusion. Time T(s) depicts the testing time
selected to build the final DroidFusion model.
on the entire instances in the test set for each of the methods
After the full DroidFusion model was built, it was then
in Table V.
evaluated on the test set. The base classifiers were retrained
On the test set, random tree-100 recorded the best weighted
on the entire training-validation set and tested on the test
F-measure out of the five base classifiers. Table V shows
set for comparison. The results are presented in Table VIII,
that higher precision, recall (for both classes), and a larger
where random tree-100 can be seen to have the best weighted
weighted F-measure was obtained with DroidFusion compared
F-measure (0.9824) out of the five base classifiers. The
to all of the base classifiers. DroidFusion also performed bet-
DroidFusion model recorded the best precision and recall
ter than MultiScheme and all the three combination schemes.
(for both classes) compared to the base classifiers result-
These results with Malgenome-215 dataset demonstrate the
ing in a weighted F-measure of 0.9872. Fig. 4 illustrates
effectiveness of the DroidFusion approach.
the graph of F-measures for the test set results. DroidFusion
can be seen to also perform better than majority vote, maxi-
B. Performance of DroidFusion With the Drebin-215 Dataset mum probability, average of probabilities, and multischeme.
In this section, we present the evaluation of DroidFusion These results clearly demonstrate the effectiveness of the
on the Drebin-215 dataset. Table VI shows the predictive DroidFusion approach.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

10 IEEE TRANSACTIONS ON CYBERNETICS

TABLE VIII TABLE IX


DREBIN 215 C OMPARISON OF D ROID F USION W ITH BASE C LASSIFIERS M C A FEE T RAIN -VALIDATION S ET R ESULTS AND L EVEL -2
AND T RADITIONAL C OMBINATION S CHEMES ON T EST S ET A LGORITHM -BASED R ANKINGS FOR THE BASE C LASSIFIERS
(5 = H IGHEST R ANK AND 1 = L OWEST )

TABLE X
M C A FEE 350 T RAIN -VALIDATION S ET L EVEL -2 C OMBINATION
S CHEMES I NTERMEDIATE R ESULTS

graphs in Fig. 5 clearly show that DroidFusion increases


performance accuracy over the single-algorithm base classi-
fiers. DroidFusion results are equal to that of majority vote
and average of probabilities but perform better than the max-
imum probability and multischeme methods. This is because,
(17) selected mv as the strongest classifier over any of the
Fig. 4. Weighted F-measure results from the Drebin-215 dataset experiments. pairs based on the computations on the initial N-fold cross-
validation predictions of the base classifiers. The mv scheme
in this case achieved a W-FM of 0.9735 compared to 0.9724
C. Performance of DroidFusion With the obtained by CDB+RAPC (S2S3 pair) and RAPC+RACD
McAfee-350 Dataset (S3S4 pair). Therefore, DroidFusion was configured to use
In this section, the results of experiments on the McAfee- (13)–(16) on the test set. However, if either of the strongest
350 dataset are presented. The same split ratios for training- pairs had been used, it would result in a weighted F-measure
validation/testing and the procedures applied in the previous performance of 0.9777 on the test set; which still surpasses the
experiment was adopted. The rankings from AAB, CDB, weighted F-measures from maximum probability (0.9423) and
RAPC, and RACD are shown in Table IX alongside the per- those of the five original base classifiers. These results once
class accuracy performances on the validation set that induced again confirm the effectiveness of the proposed DroidFusion
the rankings. Just like in the previous experiments, we apply approach. In the next section, we presents results obtained
(9) to the instances in the training-validation set and compute from experiments investigating ensemble learners as base
the accuracies with (10)–(12). The resulting performances of classifiers.
the pairwise combinations of the level-2 algorithms are shown
in Table X. D. Performance of DroidFusion With the McAfee-100
From Table IX (training-validation set results for the base Dataset Using Ensemble Learners As Base Classifiers
classifiers), J48 had the best benign class recall of 0.973 In this section, we present results of experiments per-
amongst the five base classifiers. Random tree-100 had the formed to investigate the feasibility of utilizing DroidFusion to
best malware class recall of 0.948 out of the five base clas- enhance accuracy performance by combining ensemble clas-
sifiers. J48 had the highest weighted F-measure of 0.9684. sifiers rather than traditional singular classifiers. Ensemble
This is less than the weighted F-measure of all combination learners have been shown to perform well in classification
schemes (shown in Table X) except the AAB+CDB scheme problems [14], [33]. Our goal is to investigate whether by
which had a weighed F-measure of 0.9618. These intermediate using DroidFusion for fusion of ensemble classifiers, further
results of the DroidFusion approach demonstrate the potential accuracy improvements can be achieved. For our ensemble
performance improvement obtainable in the final model. learning-based experiments, we reduced the number of fea-
Table XI shows the results of the base classifiers and the tures from 350 down to 100 using the IG feature ranking tech-
final DroidFusion model on the test set. The table and the nique (31)–(33). The ensemble learners considered as example
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

YERIMA AND SEZER: DroidFusion: NOVEL MULTILEVEL CLASSIFIER FUSION APPROACH FOR ANDROID MALWARE DETECTION 11

TABLE XI TABLE XII


M C A FEE 350 C OMPARISON OF D ROID F USION W ITH BASE C LASSIFIERS M C A FEE 100 T RAIN -VALIDATION S ET R ESULTS AND L EVEL -2
AND T RADITIONAL C OMBINATION S CHEMES ON T EST S ET A LGORITHM -BASED R ANKINGS FOR THE (E NSEMBLE ) BASE
C LASSIFIERS (5 = H IGHEST R ANK AND 1 = L OWEST )

TABLE XIII
M C A FEE -100 T RAIN -VALIDATION S ET L EVEL -2 C OMBINATION
S CHEMES I NTERMEDIATE R ESULTS

Fig. 5. Weighted F-measure results from the McAfee-350 dataset experi-


ments.

base classifiers include: random forest [39], AdaBoost [40]


(with random tree base classifier), random committee (with
random tree base classifier), random subspace [41] (with ran-
dom tree base classifier), and random subspace with REPTree
base classifier. Note that the two random subspace learners Fig. 6. Weighted F-measure results from the McAfee-100 dataset experi-
ments with ensemble base classifiers.
with different base classifiers yield completely different mod-
els. In terms of number of iterations for the ensemble learners
the configurations used were: AdaBoost (25 iterations), ran-
dom forest, random committee, and random subspace (ten rankings are also depicted. Similar to the previous experi-
iterations each). Our choice of random tree as base learner ments, the level-2 combination schemes performance improves
for the ensemble (base) classifiers comes from our prelimi- on that of the individual ensemble classifiers. This is also
nary experiments (omitted due to space constraint) which also indicative of potential performance improvement obtainable
confirms previous suggestion that it produces the strongest when the final model is constructed. In this case, AAB+RAPC
classifiers for most ensemble methods [42]. In the prelimi- (S1S3 pair) is the recommended configuration as seen from
nary experiments, it was also found that by taking the top 100 Table XIII results.
features only a marginal drop in performance was observed for In Table XIV, the test set results of the ensemble clas-
the ensemble base classifiers. Hence, this enabled us undertake sifiers and those of DroidFusion are given. The results of
the experiments with ensemble classifiers using a significantly multischeme, majority vote, average of probabilities, and max-
reduced dimension while using the same number of instances imum probabilities are also shown. DroidFusion improves
(i.e., 36 183). benign recall rates over all of the ensemble models in the
Table XII shows the accuracy performance of the five base classifier level. The overall weighted F-measure of
ensemble models used as the DroidFusion base classifiers on DroidFusion is the highest as shown in Table XIV and Fig. 6
the training-validation set instances (using the tenfold cross- graphs. This shows that the DroidFusion approach can also be
validation). The corresponding AAB, CDB, RAPC, and RACD effectively applied for fusion of ensemble classifiers.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

12 IEEE TRANSACTIONS ON CYBERNETICS

TABLE XIV
M C A FEE 100 C OMPARISON OF D ROID F USION W ITH (E NSEMBLE ) BASE
C LASSIFIERS AND T RADITIONAL C OMBINATION S CHEMES ON T EST S ET

Fig. 7. DroidFusion versus stacking results (weighted F-measure).

TABLE XV
D ROID F USION V ERSUS S TACKED G ENERALIZATION
FOR THE F OUR DATASETS

E. Performance of DroidFusion Versus


Stacked Generalization
Stacked generalization [43], has a similar (multilevel) archi-
tecture to DroidFusion. It is also a well-known framework
for classifier fusion which has been extensively studied and
applied to many machine learning problems. For this reason,
we compared our proposed approach to the stacked generaliza-
tion method. One noticeable difference between our approach
and stacked generalization is that instead of training with a
meta- learner in level-2, we utilized a computational approach
where ranking algorithms are used to combine the outcomes
of the lower level classifiers. We used the StackingC imple-
mentation in WEKA which uses a linear regression meta
TABLE XVI
classifier in level-2. Note that this is considered to be the A NALYSIS OF A PP P ROCESSING T IME
most effective stacked generalization configuration [44] (given
that any learning algorithm can be chosen as the meta classi-
fier). The StackingC learner is also configured to use tenfold
cross-validation when combining the base learners.
Applying the stacked generalization algorithm to the same
base classifiers and with the same four datasets the results
are given in Fig. 7 and Table XV. From Fig. 7, the weighted
F-measures comparative results for the four datasets showed
that StackingC achieved a better performance only in the
case of the Malgenome-215 dataset. On all the other three app which can range between a few kilobytes to a several
datasets, DroidFusion performed better. A notable advantage megabytes. Hence, the average unzipping and disassembly
of DroidFusion over Stacking is that it provides a wider range time was 0.739 s while the average time to analyze the man-
of criteria for weighting and fusion of base classifiers through ifest and extract permissions, intents, etc. was 0.0048 s. The
the use of four separate algorithms; by contrast, stacking (with rest of the processing involves mining the disassembled files
liner regression meta classifier) effectively combines classifiers and scanning for other attributes. This took on average 6.4 s.
based on only one criterion (i.e., weighting the base classi- The total average processing time for the apps was therefore
fiers according to their relative strengths (overall performance approximately 7.145 s. During the experiments the feature vec-
accuracies) [44]). tors were fed into trained models for testing. The DroidFusion
model testing times were 0.07 s (for 759 instances), 0.38 s
(for 1503 instances), 7.02 s (for 3618 instances), and 0.22 s
F. Analysis of Time Performance (for 3618 instances) in the four sets of experimental results
As mentioned earlier, the app processing to extract features presented earlier. These figures clearly illustrate the scalabil-
was done using our bespoke Python-based tool described in ity of static-based features solution with only an average of
Section IV-A. Table XVI presents an overview of app pro- just over 7 s required to process an app and classify it using a
cessing time estimates. This is dependent on the size of the trained DroidFusion model. Thus, it is feasible in practice to
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

YERIMA AND SEZER: DroidFusion: NOVEL MULTILEVEL CLASSIFIER FUSION APPROACH FOR ANDROID MALWARE DETECTION 13

deploy the system for scenarios requiring rapid analyses for [10] S. R. Choudhary, A. Gorla, and A. Orso, “Automated test input gen-
large scale vetting or screening of apps. eration for Android: Are we there yet?” in Proc. 30th IEEE/ACM Int.
Conf. Autom. Softw. Eng. (ASE), Nov. 2015, pp. 429–440.
Note that although this paper is based on specific static [11] D.-J. Wu, C.-H. Mao, T.-E. Wei, H.-M. Lee, and K.-P. Wu, “DroidMat:
features, classifiers trained from other types of features can Android malware detection through manifest and API calls tracing,” in
also be combined using DroidFusion. Basically, DroidFusion Proc. 7th Asia Joint Conf. Inf. Security (Asia JCIS), 2012, pp. 62–69.
[12] S. Y. Yerima, S. Sezer, and I. Muttik, “Android malware detection: An
is agnostic to the feature engineering process. eigenspace analysis approach,” in Proc. Sci. Inf. Conf. (SAI), London,
U.K., Jul. 2015, pp. 1236–1242.
[13] S. Y. Yerima, S. Sezer, and I. Muttik, “Android malware detection using
G. Limitations of DroidFusion parallel machine learning classifiers,” in Proc. 8th Int. Conf. Next Gener.
Although the proposed general-purpose DroidFusion Mobile Apps Services Technol. (NGMAST), Oxford, U.K., Sep. 2014,
pp. 37–42
approach has been demonstrated empirically to enable
[14] S. Y. Yerima, S. Sezer, and I. Muttik, “High accuracy Android mal-
improved accuracy performance by classifier fusion, there ware detection using ensemble learning,” IET Inf. Security, vol. 9, no. 6,
is scope for further improvement. The current DroidFusion pp. 313–320, Nov. 2015.
design is aimed at binary classification. Future work could [15] M. V. Varsha, P. Vinod, and K. A. Dhanya, “Identification of malicious
Android app using manifest and opcode features,” J. Comput. Virol.
investigate extending the algorithms in the DroidFusion Hacking Tech., vol. 13, no. 2, pp. 125–138, 2017.
framework to handle multiclass problems. [16] A. Sharma and S. K. Dash, “Mining API calls and permissions for
Android malware detection,” in Cryptology and Network Security. Cham,
Switzerland: Springer Int., 2014, pp. 191–205.
VI. C ONCLUSION [17] P. P. K. Chan and W.-K. Song, “Static detection of Android malware
by using permissions and API calls,” in Proc. Int. Conf. Mach. Learn.
In this paper, we proposed a novel general purpose Cybern., vol. 1. Lanzhou, China, Jul. 2014, pp. 82–87.
multilevel classifier fusion approach (DroidFusion) for [18] W. Wang et al., “Exploring permission-induced risk in Android appli-
Android malware detection. The DroidFusion framework cations for malicious application detection,” IEEE Trans. Inf. Forensics
Security, vol. 9, no. 11, pp. 1869–1882, Nov. 2014.
is based on four proposed ranking-based algorithms that [19] M. Fan et al., “DAPASA: Detecting Android piggybacked apps through
enable higher-level fusion using a computational approach sensitive subgraph analysis,” IEEE Trans. Inf. Forensics Security, vol. 12,
rather than the traditional meta classifier training that is no. 8, pp. 1772–1785, Aug. 2017.
[20] L. Cen, C. S. Gates, L. Si, and N. Li, “A probabilistic discriminative
used for example in stacked generalization. We empiri- model for Android malware detection with decompiled source code,”
cally evaluated DroidFusion using four separate datasets. The IEEE Trans. Depend. Secure Comput., vol. 12, no. 4, pp. 400–412,
results presented demonstrates its effectiveness for improving Jul./Aug. 2015.
performance using both nonensemble and ensemble base clas- [21] Westyarian, Y. Rosmansyah, and B. Dabarsyan, “Malware detec-
tion on Android smartphones using API class and machine learn-
sifiers. Furthermore, we showed that our proposed approach ing,” in Proc. Int. Conf. Elect. Eng. Informat. (ICEEI), Aug. 2015,
can outperform stacked generalization whilst utilizing only pp. 294–297.
computational processes for model building rather than train- [22] F. Idrees and M. Rajarajan, “Investigating the Android intents and per-
missions for malware detection,” in Proc. 10th IEEE Int. Conf. Wireless
ing a meta classifier at the higher level. Mobile Comput. Netw. Commun. (WiMob), Oct. 2014, pp. 354–358.
[23] B. Kang, S. Y. Yerima, S. Sezer, and K. McLaughlin, “N-gram opcode
analysis for Android malware detection,” Int. J. Cyber Situational
R EFERENCES Awareness, vol. 1, no. 1, pp. 231–254, Nov. 2016.
[1] Smartphone OS Market Share Worldwide 2009-2015 Statistics, [24] M. Zhao, F. Ge, T. Zhang, and Z. Yuan, “AntiMalDroid: An effi-
Statista, Hamburg, Germany, 2017. [Online]. Available: cient SVM-based malware detection framework for Android,” in
https://www.statista.com/statistics/263453/global-market-share-held- Communications in Computer and Information Science, vol. 243, C. Liu,
by-smartphone-operating-systems J. Chang, and A. Yang, Eds. Heidelberg, Germany: Springer, 2011.
[2] McAfee Labs Threat Predictions Report, McAfee Labs, Santa Clara, CA, pp. 158–166.
USA, Mar. 2016. [25] W.-C. Wu and S.-H. Hung, “DroidDolphin: A dynamic Android malware
[3] Y. Zhou and X. Jiang, “Dissecting Android malware: Characterization detection framework using big data and machine learning,” in Proc. ACM
and evolution,” in Proc. IEEE Symp. Security Privacy (SP), Conf. Res. Adapt. Convergent Syst. (RACS), Towson, MD, USA, 2014,
San Francisco, CA, USA, May 2012, pp. 95–109. pp. 247–252.
[4] D. Arp, M. Spreitzenbarth, M. Hubner, H. Gascon, and K. Rieck, [26] V. M. Afonso, M. F. de Amorim, A. R. A. Grégio, G. B. Junquera, and
“Drebin: Efficient and explainable detection of Android malware in your P. L. de Geus, “Identifying Android malware using dynamically obtained
pocket,” in Proc. 20th Annu. Netw. Distrib. Syst. Security Symp. (NDSS), features,” J. Comput. Virol. Hacking Tech., vol. 11, no. 1, pp. 9–17,
San Diego, CA, USA, Feb. 2014, pp. 1–15. 2014.
[5] A. Apvrille and R. Nigam. (Jul. 2014). Obfuscation in Android [27] M. K. Alzaylaee, S. Y. Yerima, and S. Sezer, “EMULATOR vs
Malware and How to Fight Back Virus Bulletin. Accessed: Sep. 2017. REAL PHONE: Android malware detection using machine learning,”
[Online]. Available: https://www.virusbulletin.com/virusbulletin/ in Proc. 3rd ACM Int. Workshop Security Privacy Anal. (IWSPA),
2014/07/obfuscation-android-malware-and-how-fight-back Scottsdale, AZ, USA, Mar. 2017, pp. 65–72.
[6] Y. Jing, Z. Zhao, G.-J. Ahn, and H. Hu, “Morpheus: Automatically gen- [28] M. Lindorfer, M. Neugschwandtner, and C. Platzer, “MARVIN:
erating heuristics to detect Android emulators,” in Proc. 30th Annu. Efficient and comprehensive mobile app classification through static
Comput. Security Appl. Conf. (ACSAC), New Orleans, LA, USA, and dynamic analysis,” in Proc. IEEE 39th Annu. Comput. Softw. Appl.
Dec. 2014, pp. 216–225. Conf. (COMPSAC), 2015, pp. 422–433.
[7] T. Vidas and N. Christin, “Evading Android runtime analysis via sand- [29] D. Gaikwad and R. Thool, “DAREnsemble: Decision tree and rule
box detection,” in Proc. 9th ACM Symp. Inf. Comput. Commun. Security, learner based ensemble for network intrusion detection system,”
Kyoto, Japan, Jun. 2014, pp. 447–458. in Proc. 1st Int. Conf. Inf. Commun. Technol. Intell. Syst., 2016,
[8] T. Petsas, G. Voyatzis, E. Athanasopoulos, M. Polychronakis, and pp. 185–193.
S. Ioannidis, “Rage against the virtual machine: Hindering dynamic [30] A. Balon-Perlin and B. Gambäck, “Ensembles of decision trees for
analysis of Android malware,” in Proc. 7th Eur. Workshop Syst. Security network intrusion detection systems,” Int. J. Adv. Security, vol. 6,
(EuroSec), Amsterdam, The Netherlands, Apr. 2014, p. 5. nos. 1–2, pp. 62–77, 2013.
[9] F. Matenaar and P. Schulz. (Aug. 2012). Detecting Android [31] M. Panda and M. R. Patra, “Ensembling rule based classifiers for
Sandboxes. Accessed: Nov. 2017. [Online]. Available: detecting network intrusions,” in Proc. Int. Conf. Adv. Recent Technol.
http://www.dexlabs.org/blog/btdetect Commun. Comput., 2009, pp. 19–22, doi: 10.1109/ARTCom.2009.121.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

14 IEEE TRANSACTIONS ON CYBERNETICS

[32] A. Zainal, M. A. Maarof, S. M. Shamsuddin, and A. Abraham, [53] A. Mahindru and P. Singh, “Dynamic permissions based Android mal-
“Ensemble of one-class classifiers for network intrusion detection ware detection using machine learning techniques,” in Proc. 10th Innov.
system,” in Proc. 4th Int. Conf. Inf. Assurance Security, 2008, Softw. Eng. Conf., Jaipur, India, Feb. 2017, pp. 202–210.
pp. 180–185, doi: 10.1109/IAS.2008.35. [54] M. Yang, S. Wang, Z. Ling, Y. Liu, and Z. Ni, “Detection of mali-
[33] L. D. Coronado-De-Alba, A. Rodriguez-Mota, and cious behavior in Android apps through API calls and permission uses
P. J. Escamilla-Ambrosio, “Feature Selection and ensemble of analysis,” Concurrency Comput. Pract. Exp., vol. 29, no. 19, 2017,
classifiers for Android malware detection,” in Proc. 8th IEEE Latin Art. no. e4172, doi: 10.1002/cpe.4172.
Amer. Conf. Commun. (LATINCOM), Nov. 2016, pp. 1–6. [55] F. Idrees, M. Rajarajan, M. Conti, T. M. Chen, and Y. Rahulamathavan,
[34] M. K. Alzaylaee, S. Y. Yerima, and S. Sezer, “Improving dynamic anal- “PIndroid: A novel Android malware detection system using ensemble
ysis of Android apps using hybrid test input generation,” in Proc. Int. learning methods,” Comput. Security, vol. 68, pp. 36–46, Jul. 2017.
Conf. CyberSecurity Protect. Digit. Services (Cyber Security), London,
U.K., Jun. 2017, pp. 1–8.
[35] Y. Aafer, W. Du, and H. Yin, “DroidAPIMiner: Mining API-level
features for robust malware detection in Android,” in Proc. 9th Int.
Conf. Security Privacy Commun. Netw. (SecureComm), Sydney, NSW,
Australia, Sep. 2013, pp. 86–103. Suleiman Y. Yerima (M’04) received the B.Eng.
[36] T. Book, A. Pridgen, and D. S. Wallach, “Longitudinal analysis of degree (First Class) in electrical and computer engi-
Android ad library permissions,” in Proc. Mobile Security Technol. neering from the Federal University of Technology,
Conf. (MoST), San Francisco, CA, USA, May 2013. Minna, Nigeria, the M.Sc. degree (with distinction)
[37] M. Hall et al., “The WEKA data mining software: An update,” ACM in personal, mobile, and satellite communications
SIGKDD Explor. Newslett., vol. 11, no. 1, pp. 10–18, Jun. 2009. from the University of Bradford, Bradford, U.K.,
[38] T. M. Cover and J. A. Thomas, Elements of Information Theory, 2nd ed. and the Ph.D. degree in mobile computing and
Hoboken, NJ, USA: Wiley, 2006, p. 41. communications from the University of South
[39] L. Breiman, “Random forests,” Mach. Learn., vol. 45, no. 1, pp. 5–32, Wales, Pontypridd, U.K. (formerly, the University
2001. of Glamorgan), in 2009.
[40] Y. Freund and R. E. Schapire, “Experiments with a new boosting He is a Senior Lecturer of cyber security with
algorithm,” in Proc. 13th Int. Conf. Mach. Learn., Bari, Italy, 1996, De Montfort University, Leicester, U.K. He was a Research Fellow with
pp. 148–156. the Centre for Secure Information Technologies, Queen’s University Belfast,
[41] T. K. Ho, “The random subspace method for constructing decision Belfast, Northern Ireland, where he led the mobile security research
forests,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 20, no. 8, theme, from 2012 to 2017. He was a member of the Mobile Computing
pp. 832–844, Aug. 1998. Communications and Networking Research Group with University of
[42] T. K. Ho, “Random decision forests,” in Proc. 3rd Int. Conf. Document Glamorgan, from 2005 to 2009. From 2010 to 2012, he was with the
Anal. Recognit., 1995, pp. 278–282. U.K.—India Advanced Technology Centre of Excellence in Next Generation
[43] D. H. Wolpert, “Stacked generalization,” Neural Netw., vol. 5, no. 2, Networks, Systems and Services, University of Ulster, Coleraine, Northern
pp. 241–259, 1992. Ireland.
[44] K. M. Ting and I. H. Witten, “Issues in stacked generalization,” J. Artif. Dr. Yerima is a member of IAENG professional societies. He is also a
Intell. Res., vol. 10, no. 1, pp. 271–289, Jan. 1999. Certified Information Systems Security Professional and a Certified Ethical
[45] T. Ban, T. Takahashi, S. Guo, D. Inoue, and K. Nakao, “Integration of Hacker. He was the recipient of the 2017 IET Information Security premium
multi-modal features for Android malware detection using linear SVM,” (best paper) award.
in Proc. 11th Asia Joint Conf. Inf. Security, 2016, pp. 141–146.
[46] Z. Ni, M. Yang, Z. Ling, J.-N. Wu, and J. Luo, “Real-time detection of
malicious behavior in Android apps,” in Proc. Int. Conf. Adv. Cloud Big
Data (CBD), Chengdu, China, 2016, pp. 221–227.
[47] Z. Wang, J. Chai, S. Chen, and W. Li, “DroidDeepLearner: Identifying Sakir Sezer (M’00) received the Dipl.Ing. degree
Android malware using deep learning,” in Proc. IEEE 37th Sarnoff in electrical and electronic engineering from RWTH
Symp., Newark, NJ, USA, 2016, pp. 160–165. Aachen University, Aachen, Germany, and the Ph.D.
[48] S. Wu, P. Wang, X. Li, and Y. Zhang, “Effective detection of Android degree from Queens University Belfast, Belfast,
malware based on the usage of data flow APIs and machine learning,” Northern Ireland, in 1999.
Inf. Softw. Technol., vol. 75, pp. 17–25, Jul. 2016. He is currently the Secure Digital Systems
[49] M.-Y. Su, J.-Y. Chang, and K.-T. Fung, “Machine learning on merg- Research Director and the Head of network
ing static and dynamic features to identify malicious mobile apps,” in security research with the School of Electronics
Proc. 9th Int. Conf. Ubiquitous Future Netw. (ICUFN), Milan, Italy, Electrical Engineering and Computer Science.
Jul. 2017, pp. 863–867. Queens University Belfast. He is also the Cofounder
[50] N. Milosevic, A. Dehghantanha, and K.-K. R. Choo, “Machine learning and the CTO of Titan IC Systems, Belfast. His
aided Android malware classification,” Comput. Elect. Eng., vol. 61, research is leading major (patented) advances in the field of high-performance
pp. 266–274, Jul. 2017. content processing and is currently commercialized by Titan IC Systems. He
[51] W. Wang, Y. Li, X. Wang, J. Liu, and X. Zhang, “Detecting Android has co-authored over 120 conference and journal papers in the areas of high-
malicious apps and categorizing benign apps with ensemble of classi- performance network, content processing, and system on chip.
fiers,” Future Gener. Comput. Syst., vol. 78, pp. 987–994, Jan. 2017. Prof. Sezer was a recipient of number of prestigious awards, including
[52] X. Wang et al., “Characterizing Android apps’ behavior for effec- InvestNI, Enterprise Ireland and Intertrade Ireland innovation and Enterprise
tive detection of malapps at large scale,” Future Gener. Comput. Syst., Awards, and the InvestNI Enterprise Fellowship. He is a member of the IEEE
vol. 75, pp. 30–45, Oct. 2017. International System-on-Chip Conference Executive Committee.

You might also like