0% found this document useful (0 votes)
7 views9 pages

Sig Camera Ready

The document presents a novel Android malware detection system utilizing a deep convolutional neural network (CNN) that performs static analysis on raw opcode sequences from disassembled applications. This approach automates feature learning, eliminating the need for hand-engineered malware signatures, and allows for efficient scanning of numerous files on GPUs. The proposed method aims to improve scalability and performance over traditional n-gram based malware detection techniques by enabling the detection of longer n-gram-like features without exhaustive enumeration.

Uploaded by

rachanar1581
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views9 pages

Sig Camera Ready

The document presents a novel Android malware detection system utilizing a deep convolutional neural network (CNN) that performs static analysis on raw opcode sequences from disassembled applications. This approach automates feature learning, eliminating the need for hand-engineered malware signatures, and allows for efficient scanning of numerous files on GPUs. The proposed method aims to improve scalability and performance over traditional n-gram based malware detection techniques by enabling the detection of longer n-gram-like features without exhaustive enumeration.

Uploaded by

rachanar1581
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 9

Deep Android Malware Detection

McLaughlin, N., Martinez del Rincon, J., Kang, B., Yerima, S., Miller, P., Sezer, S., Safaeisemnani, Y., Trickel,
E., Zhao, Z., Doupé, A., & Joon Ahn, G. (2017). Deep Android Malware Detection. In Proceedings of the ACM
Conference on Data and Applications Security and Privacy (CODASPY) 2017 Association for Computing
Machinery. https://doi.org/10.1145/3029806.3029823

Published in:
Proceedings of the ACM Conference on Data and Applications Security and Privacy (CODASPY) 2017

Document Version:
Peer reviewed version

Queen's University Belfast - Research Portal:


Link to publication record in Queen's University Belfast Research Portal

Publisher rights
© 2017 The Authors.
This work is made available online in accordance with the publisher’s policies. Please refer to any applicable terms of use of the publisher.

General rights
Copyright for the publications made accessible via the Queen's University Belfast Research Portal is retained by the author(s) and / or other
copyright owners and it is a condition of accessing these publications that users recognise and abide by the legal requirements associated
with these rights.

Take down policy


The Research Portal is Queen's institutional repository that provides access to Queen's research output. Every effort has been made to
ensure that content in the Research Portal does not infringe any person's rights, or applicable UK laws. If you discover content in the
Research Portal that you believe breaches copyright or violates any law, please contact [email protected].

Open Access
This research has been made openly available by Queen's academics and its Open Research team. We would love to hear how access to
this research benefits you. – Share your feedback with us: http://go.qub.ac.uk/oa-feedback

Download date:06. Mar. 2025


Deep Android Malware Detection


Niall McLaughlin , Jesus Martinez del Rincon, BooJoong Kang, Suleiman Yerima,
Paul Miller, Sakir Sezer
Centre for Secure Information Technologies (CSIT)
Queen´s University Belfast, UK
Yeganeh Safaei, Erik Trickel, Ziming Zhao, Adam Doupe, Gail Joon Ahn
Center for Cybersecurity and Digital Forensics
Arizona State University, USA

ABSTRACT of known malware programs in order to design malware sig-


In this paper, we propose a novel android malware detec- natures by hand. This process does not easily scale to large
tion system that uses a deep convolutional neural network numbers of applications, especially given the static nature of
(CNN). Malware classification is performed based on static signature based malware detection, meaning that new mal-
analysis of the raw opcode sequence from a disassembled ware can be designed to evade existing signatures. Con-
program. Features indicative of malware are automatically sequently, there has recently been a large volume of work
learned by the network from the raw opcode sequence thus on automatic malware detection using ideas from machine
removing the need for hand-engineered malware features. learning. Various methods have been proposed based on
The training pipeline of our proposed system is much sim- examining the dynamic application behavior [18, 21], re-
pler than existing n-gram based malware detection methods, quested permissions [14, 16, 19] and the n-grams present
as the network is trained end-to-end to jointly learn appro- in the application byte-code [7, 11, 10]. However many of
priate features and to perform classification, thus removing these methods are reliant on expert analysis to design the
the need to explicitly enumerate millions of n-grams during discriminative features that are passed to the machine learn-
training. The network design also allows the use of long ing system used to make the final classification decision.
n-gram like features, not computationally feasible with ex- Recently, convolutional networks have been shown to per-
isting methods. Once trained, the network can be efficiently form well on a variety of tasks related to natural language
executed on a GPU, allowing a very large number of files to processing [12, 26]. In this work we investigate the appli-
be scanned quickly. cation of convolutional networks to malware detection by
treating the disassembled byte-code of an application as a
text to be analyzed. This approach has the advantage that
CCS Concepts features are automatically learned from raw data, and hence
•Security and privacy → Malware and its mitigation; removes the need for malware signatures to be designed by
Software and application security; •Computing method- hand. Our proposed malware detection method is computa-
ologies → Neural networks; tionally efficient as training and testing time is linearly pro-
portional to the number of malware examples. The detec-
Keywords tion network can be run on a GPU, which is now a standard
component of many mobile devices, meaning a large number
Malware Detection, Android, Deep Learning of malware files can be scanned per-second. In addition, we
expect that as more training data is provided the accuracy
1. INTRODUCTION of malware detection will improve because neural networks
Malware detection is a growing problem, especially in mo- have been shown to have a very high learning capacity, and
bile platforms. Given the proliferation of mobile devices and hence can benefit from very large training-sets [20].
their associated app-stores, the volume of new applications is Our proposed malware detection method takes inspiration
too large to manually examine each application for malicious from existing n-gram based methods [7, 11, 10], but unlike
behavior. Malware detection has traditionally been based on existing methods there is no need to exhaustively enumerate
manually examining the behavior and/or de-compiled code a large number of n-grams during training. This is because
∗Corresponding author: [email protected] the convolutional network can intrinsically learn to detect
n-gram like signatures by learning to detect sequences of
opcodes that are indicative of malware. In addition, our
Permission to make digital or hard copies of all or part of this work for personal or proposed method allows very long n-gram type signatures
classroom use is granted without fee provided that copies are not made or distributed
for profit or commercial advantage and that copies bear this notice and the full cita-
to be discovered, which would be impractical if explicit enu-
tion on the first page. Copyrights for components of this work owned by others than meration of all n-grams was required. The malware signa-
ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or re- tures found by the proposed method may be complementary
publish, to post on servers or to redistribute to lists, requires prior specific permission to those discovered by hand as the automated system will
and/or a fee. Request permissions from [email protected].
have different strengths and biases from human analysts,
therefore they could be valuable for use in conjunction with
conventional malware signatures databases. Once our sys-
tem has been trained, large numbers of files can be efficiently often designed to detect when it is being run in a virtual
scanned using a GPU implementation, and given that new environment in order to evade detection. Other existing
malware is constantly appearing, a useful feature of our pro- neural network based malware detection methods use hand-
posed method is that it can be re-trained with new malware designed features, which may not be the optimal way to
samples to adapt to the changing malware environment. detect malware [17]. We will attempt to address the limi-
tations of existing neural network based malware detection
2. RELATED WORK methods, by using a novel static analysis method based on a
CNN architecture that automatically learns an appropriate
2.1 Malware Detection feature representation from raw data.
In this work we apply convolutional neural networks to the
Learning based approaches using hand-designed features
problem of malware detection. The CNN learns to detect
have been applied extensively to both dynamic [18, 21] and
patterns in the disassembled byte-code of applications that
static [23, 22, 25] malware detection. A variety of simi-
are indicative of malware. Our approach has several advan-
lar approaches to static malware detection have used manu-
tages over existing methods of malware detection, such as
ally derived features, such as API calls, intents, permissions
those based on high-level hand-designed features and those
and commands, with different classifiers such as support vec-
based on detection of n-grams. Scalability and performance
tor machine (SVM) [5], Naive Bayes, and k-Nearest Neigh-
are major drawbacks of existing n-gram based approaches,
bor [19]. Malware detection approaches have also been pro-
as the length of the feature vector grows rapidly when in-
posed that use static features derived exclusively from the
creasing the n-gram length. In contrast, our approach elim-
permissions requested by the application [14, 16].
inates the need for counting and storing millions of n-grams
In contrast with approaches using high-level hand-designed
during training and can learn longer n-grams than conven-
features, n-grams based malware detection uses sequences of
tional methods used for malware detection. The improved
low-level opcodes as features. The n-grams features can be
efficiency makes it possible to use our proposed method with
used to train a classifier to distinguish between malware and
larger datasets, where the use of traditional methods would
benign software [10]. Perhaps surprisingly, even a 1-gram
be intractable. Our whole system is jointly optimized to
based feature, which is simply a histogram of the number
perform feature extraction and classification simultaneously
of times each opcode is used, can distinguish malware from
by showing the system a large number of labeled samples.
benign software [7]. The length of the n-gram used [10] and
This removes the need for hand-designed features, as fea-
number of n-gram sequences used in classification [7] can
tures are automatically learned during supervised network
both have an effect on the accuracy of the classifier. How-
training, and removes the need for an ad-hoc pipeline con-
ever increasing either parameter can massively increase the
sisting of feature-extraction, feature-selection and classifica-
computational resources needed [7], which is clearly a dis-
tion, as feature extraction and classification are optimized
advantage of standard n-gram based malware detection ap-
together. The existence of a fully end-to-end system also
proaches. N-grams method also require feature selection to
saves time when the system is presented with new malware
reduce the length of the feature-vector, which would other-
to be recognized, as the network can easily be updated by
wise be millions of elements long in the case of long n-grams.
simply increasing the size of the training-set, which may also
In this work we propose a method that allows very long n-
improve its overall accuracy. Finally, the features discovered
grams features to be used, and allows an n-grams classifier
by our method may be different from, and complementary
to be trained in a much more efficient manner, based on
to, those discovered by manual analysis.
neural networks.

2.2 Neural Networks 3. METHOD


Recently, convolutional neural networks (CNNs) have shown In this work we propose a malware detection method that
state-of-the-art performance for object recognition in im- uses a convolutional network to process the raw Dalvik byte-
ages [20] and natural language processing (NLP) [12]. In code of an Android application. The overall structure of the
NLP, local patterns of symbols, known as n-grams, have malware detection network is shown in Fig. 2. In the follow-
been used as features for a variety of tasks [27]. It has ing section we will first explain how an Android application
recently been shown that if sufficient training data is avail- is disassembled to give a sequence of raw Dalvik byte-codes,
able, very deep CNNs can outperform traditional NLP meth- and then explain how this byte-code sequence is processed
ods [26] across a range of text classification tasks. We pos- by the convolutional network.
tulate that static malware analysis has much in common
with NLP as the analysis of the disassembled source code 3.1 Disassembly of Android Application
of a given program can be understood as a form of textual In our system, the preprocessing of an application consists
processing. Therefore, techniques such as CNNs have huge of disassembling the application and extracting opcode se-
potential to be applied in the field of malware detection. quences for static malware analysis, as shown in Fig.1. An
A variety of approaches to malware detection using other Android application is an apk file, which is a compressed file
neural network architectures have been proposed. Several containing the code files, the AndroidManifest.xml file, and
of the proposed methods are based on learning which se- the application resource files. A code file is a dex file that
quences of operating system calls or API calls are indicative can be transformed into smali files, where each smali file
of malware [15, 9, 8] during dynamic analysis. The exist- represents a single class and contains the methods of such
ing neural network based approaches to malware detection a class. Each method contains instructions and each in-
differ from our proposed method as they make use of a vir- struction consists of a single opcode and multiple operands.
tual machine to capture dynamic behavioural features [15, We disassemble each application using baksmali [1] to obtain
9, 8]. This may prove problematic given that malware is the smali files that contain the human-readable Dalvik byte-
code of the application, then extracting the opcode sequence to perform such semantic mapping, hence using more dimen-
from each method, discarding the operands. As the result of sions may, up to a point, give the network greater flexibility
the preprocessing we obtain all the opcode sequences from in learning the expected highly non-linear mapping from se-
all the classes of the application. The opcode sequences from quences of opcodes to classification decisions.
all classes are then concatenated to give a single sequence of
opcodes representing whole application. 3.2.2 Convolutional Layers
In our proposed network we use one or more convolutional
layers, numbered from 1 to L, where l refers to the l’th con-
volutional layer. The first convolutional layer receives the
n × k program embedding matrix P as input, while deeper
convolutional layers receive the output of the previous con-
volutional layer as input. Each convolutional layer has ml
filters, which are of size s1 × k in the first layer, and of size
sl × ml−1 in deeper layers. This means filters in the first
layer can potentially detect sequences of up to s1 opcodes.
During the forward pass of an example through a convolu-
tional layer, each of the ml convolutional filters produces
an activation map al,m of size n × 1, which can be stacked
together to produce, a matrix, Al , of size n × ml . Note
Figure 1: Work-flow of how an Android application that before applying the convolutional filters we zero-pad
is disassembled to produce an opcode sequence. the start and end of the input by sl /2 to ensure that the
length of the output matrix from the convolutional layer is
3.2 Network Architecture the same as the length of its input. The convolution of the
first layer filters with program embedding matrix P can be
3.2.1 Opcode Embedding Layer denoted as follows
Let X = {x1 ...xn } be a sequence of opcode instructions
encoded as one-hot vectors, where xn is the one-hot vector al,m = relu(Conv(P )Wl,m ,bl,m ) (2)
for the n’th opcode in the sequence. To form a one-hot vec-
tor we associate each opcode with a number in the range
1 to D. In the case of Dalvik, where there are currently Al = [al,1 | al,2 | ... | al,m ] (3)
218 defined opcodes, D = 218 [2]. The one-hot vector xn
is a vector of zeros, of length D, with a ’1’ in the position where wl,m and bl,m are the respective weight and bias pa-
corresponding with the n’th opcode’s integer mapping. Any rameters of the m’th convolutional filter of convolution layer
operands associated with the opcodes were discarded during l, where Conv represents the mathematical operation of con-
disassembly and preprocessing, meaning malware classifica- volution of the filter with the input, and where the recti-
tion is based only on patterns in the sequence of opcodes. fied linear activation function, relu(x) = max{0, x}, is used.
Opcodes in X are projected into an embedding space by In deeper layers the convolution operation is similar, how-
multiplying each one-hot vector by a weight matrix, WE ∈ ever we replace input matrix P in Eq. 2 by the output
RD×k , where k is the dimensionality of the embedding-space matrix from the previous convolutional layer, Al−1 . Given
as follows output matrix AL from the final convolutional layer, max-
pooling [27] is then used over the program length dimension
as follows
pi = xi WE (1)
projection of all opcodes in X, the program is represented by f = [max(aL,1 ) | max(aL,2 ) | ... | max(aL,m )] (4)
a matrix, P , of size n × k, where each row, pi , corresponds
to the representation of opcode xi . The weights in WE , to give a vector f of length mL , which contains the maxi-
and hence the representation for each opcode, are initialized mum activation of each convolutional filter over the program
randomly at first then updated by back-propagation during length. Using max-pooling over the length of the opcode
training along with the rest of the network’s parameters. sequence allows a program of arbitrarily length to be repre-
The purpose of representing the program as a list of one- sented by a fixed-length feature vector. Moreover, selecting
hot vectors then projecting into an embedding space, is that the maximum activation of each convolutional filter using
it allows the network to learn an appropriate representation max-pooling also focuses the attention of the classification
for each opcode as a vector in a k-dimensional continuous layer on parts of the opcode sequence that are most relevant
vector space, Rk where relationships between opcodes can be to the classification task.
represented. The embedding space may encode semantic in-
formation for example, during training the network may dis- 3.2.3 Classification Layers
cover that certain opcodes have similar meanings or perform Finally, the resulting vector f is passed to a multi-layer
equivalent operations, and hence should be treated similarly perceptron (MLP), which consists of a full-connected hidden
by deeper network layers for classification purposes. This layer and a full-connected output layer. The purpose of the
can be achieved by projecting those opcodes to nearby points MLP is to output the probability that the current example
in the embedding space, while very different opcodes will be is malware. The use of the MLP with hidden layer allows
projected to distant points. The number of dimensions used high-order relationships between the features extracted by
in the embedding space may influence the network’s ability the convolutional layer to be detected [6] and used for clas-
Opcode Seq. Embedding Activation Maps Activation Maps
X Projection Layer 1 Layer 2
P A1 A2

1
01 Filter m1
47
58 Filter ml
78
45 Wh Wi
bh bi

Max over Length dim.


14
.. Filter ml
Wl,m

Softmax Classification
.. WE Wl,m
.. bl,m bl,m class
.. z
y

Connected Layer
Convolutional

Convolutional

Hidden Fully
Embedding

Layer
Layer

Layer

Layer
N al,1
al,m
Max pooling
Layer

Figure 2: Malware Detection Network Architecture.

sification. We can write the hidden layer as follows network to example training example X (j) , where y (j) is the
provided correct label for the example X (j) , and where 1{x}
z = relu(Wh f + bh ) (5)
is an indicator function that is 1 if its argument x is true and
where Wh , bh , are the parameters of the fully-connected hid- is 0 otherwise. The cost is dependent on both the parameters
den layer, and where the rectified linear activation function of the neural network, Θ, i.e. the weights and bias across all
is used. Finally, the output, z, from the MLP is passed to a layers -WE , wl,m , bl,m ,Wh , bh ,wi , and bi - and on the current
soft-max classifier function, which gives the probability that training sample. The objective during training is to update
program X is malware, denoted as follows the network’s parameters, which are initialized randomly
before training begins, to reduce the cost. This update is
exp(wiT z + bi ) performed stochastically by computing the gradient of the
p(y = i|z) = PI T
(6) cost function with respect to the parameters, ∂C ∂Θ
, given the
i0 =1 exp(wi0 z + bi )
0
current batch of samples, and using this gradient to update
where wi and bi denote the parameters of the classifier for the parameters after every batch to reduce the cost as follows
class i ∈ I, and the label y indicates whether the current
sample is either malware or benign. The softmax classifier ∂C
outputs the normalized probability of the current sample Θ(t+1) = Θ(t) − α
(8)
∂Θ
belonging to each class. As malware classification is a two
class problem (benign/malware) i.e., I = 2 and z is a two where α is a small positive real number called the learning
element vector. Other applications such as the problem of rate. During training the network is repeatedly presented
malware family classification, could be targeted by increas- with batches of training samples in randomized order until
ing the number of classes, I, to be equal to the number of the parameters converge.
malware families to be classified. To deal with an imbalance in the number of training sam-
ples available for the malware and benign classes, the gra-
3.3 Learning Process dients used to update the network parameters are weighted
Given the above definitions, the cost function to be min- depending on the label of the current training sample. This
imized during training for a batch of b training samples, helps to reduce classifier bias towards predicting the more
{X (1) . . . X (b) }, can be written as follows populous class. Let the number of malware samples in the
training-set be M and number of benign samples in the
b I
training-set be B. Assuming there are more samples of be-
1 XX nign software than malware, the weight for malware sam-
C=− 1{y (j) = i}log p(y (j) = i|z (j) ) (7)
b j=1 i=1 ples is 1 − M/(M + B) and the weight for benign samples
is M/(M + B) i.e. the gradients are weighted in inverse
where z (j) is the vector output after applying the neural proportion to the number of samples for each class.
Note that a consideration when designing our proposed ar- accuracy, precision, recall and f-score. The key indicator of
chitectures was to keep the number of parameters relatively performance is f-score, because the number of samples in the
low, in order to help prevent over-fitting given the relatively malware and benign classes is not equal. In this situation,
small number of training samples usually available. A typi- classification accuracy is too influenced by the number of
cal deep network may have millions of parameters [20], while samples in each class. For example if the majority of samples
our malware detection network has only tens of thousands were of class x, and the classifier simply reported x in all
of parameters, which drastically reduces the need for large cases, the classification accuracy would be high, although
numbers of training samples. the classifier would not be useful. However, given the same
conditions, the f-score, which is based on the precision and
4. RESULTS recall, would be low.
Our neural network software was developed using the Torch
scientific computing environment [4]. During training the
In order to evaluate the performance of our approach a
network parameters were optimized using RMSProp [3] with
set of experiments was designed. The architecture used in
a learning rate of 1e-2, for 10 epochs, using a mini-batch
all experiments had only a single convolutional layer. This
size of 16. The network weights were randomly initialized
architecture was used because the available datasets have
using the default Torch initialization. We used an Nvidia
a relatively small number of training samples which means
GTX 980 GPU for development of the network, and training
that networks with large numbers of parameters could be
the network to perform malware classification takes around
prone to over-fitting. Convolutional networks with only a
25 minutes on the large dataset (which contains approxi-
single convolutional layer have been shown to perform well
mately 6000 example programs). Once the network has been
on natural language text classification tasks [27]. In this
trained our implementation can classify approximately 3000
architecture, the remaining hyperparameters, such as the
files per-second on the GPU.
dimension of the embedding space and the length and the
number of convolutional filters, are set empirically using 10-
fold cross validation on the validation-set of the small and
4.1 Computational Efficiency
large dataset. The resulting values are a 8-dimensional em- In this experiment we compare the computational effi-
bedding space, 64 convolutional filters of length 8, and 16 ciency of our proposed malware classification system with
neurons in the hidden fully connected layer. our implementation of a conventional n-gram based mal-
Our experiments were carried out on three different datasets. ware classification system [10]. Note that when reporting
The first dataset consists of malware from the Android Mal- the results we do not include the time take to disassemble
ware Genome project [28] and has been widely used [10, the malware files as this is constant for both systems. The
11]. This dataset has a total of 2123 applications, of which results in Table 2 are presented in terms of both the average
863 are benign and 1260 are malware from 49 different mal- time to reach a classification decision for a single malware
ware families. Labels are provided for the malware family file, and the corresponding average number of programs that
of each sample. The benign samples in this dataset were can be classified per second.
collected from the Google play store and have been checked It can be seen from Table 2 that our system can produce
using virusTotal to ascertain that they were highly probable a much higher number malware classification decisions per
to be malware free. We refer to this dataset as the ’Small second than the n-gram based system. The n-gram based
Dataset’. system also experiences exponential slow-down as the length
The second dataset was provided by McAfee Labs (now of the n-gram features are increased. This severely limits the
Intel Security) and comes from the vendor’s internal reposi- use of longer n-grams, which are necessary for improved clas-
tory of Android malware. After discarding empty files or sification accuracy. Our proposed system is not limited in
files that are less than 8 opcodes long, the dataset con- the same way, and in fact, the features extracted by the first
tains 2475 malware samples and 3627 benign samples. This layer of the CNN can be thought of as n-grams where n = 8.
dataset does not include malware family labels and may in- Use of such features with a conventional n-gram based sys-
clude malware and/or benign applications present in the tem would be much too computationally expensive. Our
small dataset. Hence to ensuring training hygiene i.e. to proposed neural network system is implemented on a desk-
ensure we do not train on the testing-set, the network is top GPU, specifically an Nvidia GTX-980, however it could
trained and tested on each dataset separately without cross- easily be moved to the GPU of a mobile device, allowing for
contamination. We refer to this dataset as the ’Large Dataset’. fast and efficient malware classification of Android applica-
We also have an additional dataset provided by McAfee tions.
Labs containing approximately 18,000 android programs, Finally, the memory usage required to execute the trained
and which was collected more recently than the first two neural network is constant. Increasing the length or number
datasets. This was used for testing the final system after of convolutional filters, or increasing the number of training
setting the hyper-parameters using the smaller datasets. Af- examples linearly increases memory usage. Whereas with
ter discarding short files, the dataset contains 9268 benign n-gram based systems, increasing the training-set size dra-
files and 9902 malware files. We refer to this dataset as the matically increases the number of unique n-grams and hence
’V. Large Dataset’. memory usage. For instance, with the small dataset there
Each dataset was split into 90% for training and validation are 213 unique 1-grams, 1891 unique 2-grams, and 286471
and the remaining 10% was held-out for testing. Care was unique 3-grams. This means our proposed neural network
taken to ensure that the ratio of positive to negative samples based system also more efficient in terms of memory usage
in the validation and testing sets was the same as in the during training.
dataset as a whole.
Results are reported using the mean of the classification 4.2 Classification Accuracy
Classification System Feature Types Benign Malware Acc. Prec. Recall F-score
Ours (Small DS) CNN applied to raw opcodes 863 1260 0.98 0.99 0.95 0.97
Ours (Large DS) CNN applied to raw opcodes 3627 2475 0.80 0.72 0.85 0.78
Ours (V. Large DS) CNN applied to raw opcodes 9268 9902 0.87 0.87 0.85 0.86
opcode n-grams (n=1) 863 1260 0.95 0.95 0.95 0.95
n-grams (Small DS) opcode n-grams (n=2) 863 1260 0.98 0.98 0.98 0.98
opcode n-grams (n=3) 863 1260 0.98 0.98 0.98 0.98
opcode n-grams (n=1) 3627 2475 0.80 0.81 0.80 0.80
n-grams (Large DS) opcode n-grams (n=2) 3627 2475 0.81 0.83 0.82 0.82
opcode n-grams (n=3) 3627 2475 0.82 0.83 0.82 0.82
DroidDetective [13] Perms. combination 741 1260 0.96 0.89 0.96 0.92
Yerima [23] API calls, Perms., intents, cmnds 1000 1000 0.91 0.94 0.91 0.92
Jerome [10] opcode n-grams 1260 1246 - - - 0.98
Yerima [25] * API calls, Perms., intents, cmnds 2925 3938 0.97 0.98 0.97 0.97
Yerima (2) [24]* API calls, Perms., intents, cmnds. 2925 3938 0.96 0.96 0.96 0.96

Table 1: Malware classification results for our system on both the small and large datasets compared with
results from the literature. Results from the literature marked with a (*) use malware from the McAfee
Labs dataset i.e. our large dataset, while all others use malware sampled from the Android Malware Genome
project [28] dataset i.e. our small dataset

System Time per program (s) Programs per second as the application’s requested permissions or API calls [25].
Ours 0.000329 3039.8 In contrast, our proposed method needs only the raw op-
1-gram 0.000569 1758.3 codes, which avoids the need for features manually designed
2-gram 0.010711 93.4 by domain experts. Moreover, our proposed method has
3-gram 0.172749 5.8 the advantage over existing methods of being very compu-
tational efficient, as it is capable of classifying approximately
3000 files per-second.
Table 2: Comparing the time taken to reach a clas- The results on the v. large dataset, which was obtained
sification decision and number of programs that can from the same source as the large dataset and hence likely
be classified per second, for our proposed neural net- shares similar characteristics, shows that our system’s per-
work system and a conventional n-gram based sys- formance improves as more training data is provided. This
tem. phenomenon has been observed when training neural net-
works in other domains, where performance is highly corre-
lated with the number of training samples. We expect that
In this experiment, the network’s performance is measured these results can be further improved given greater quan-
in terms of accuracy. The network was trained using the tities of training data, which will also allow more complex
complete training and validation set, then tested on the held- network architectures to be explored. Unfortunately com-
out test-set that was not seen during hyper-parameter tun- parisons with the baseline n-gram system on the v. large
ing. We compare the performance of our proposed system dataset were not possible due to computational cost associ-
with our own implementation of an n-gram based malware ated with the n-gram method.
detection method [10]. For both datasets we measured the
performance of this system using 1, 2 and 3-gram features. 4.3 Learning Curves
The same training and testing samples were used for both In this experiment we aim to understand the system’s per-
systems in order to allow for direct comparison of their per- formance as a function of the quantity of training data, with
formance. The results for the small and large and v. large the aim of predicting how its performance is likely to change
datasets are shown in Table 1. We have endeavored to select if more training data were to be made available.
papers from the literature that use similar Android malware This experiment was performed on the V. Large dataset.
datasets to give as fair a comparison as possible. As in previous experiments, the dataset is split into train-
In the small dataset our proposed method clearly achieves ing and validation sets. Throughout the experiment the
state-of-the-art performance, and is comparable to methods validation-set remains fixed. An artificially reduced size
such as [10] and [23]. It achieves better performance than training-set is constructed by randomly sub-sampling from
our baseline n-gram system with 1-gram features and near the complete set of training examples. The network is then
identical performance to the baseline with 2 and 3-gram fea- trained from scratch on this reduced size training-set, and
tures. the system’s performance measured on both the training and
The large dataset is more challenging due to the greater validation sets. This process is repeated for several different
variably of malware present. Our system achieves similar sizes of training-set, ranging from a small number of exam-
performance to the baseline n-gram system, while having far ples up to the complete set of all training-examples. The
greater computational efficiency (See Section 4.1). Although system’s performance on the validation-set and training-set
other methods have achieved better performance on similar are then plotted as a function of the training-set size. Per-
tests, they make use of additional outside information such formance is recorded in terms of 1 − f-score, meaning that
perfect performance would produce a value of zero. a cloud architecture consisting of 29 machines running in
In figure 3, we can see that when only a small number of parallel, in a process which took around 11 hours. Clas-
training-examples are provided, training-set performance is sification of the opcode sequences was performed using an
perfect, while validation-set performance is very poor. This Nvidia GTX 1080 GPU, and took an hour to complete.
is to be expected as with such a small number of training- Note that for this experiment we assume that all APKs
examples the system will over-fit to the training-set and in the Google Play dataset are benign, and all the APKs in
the learned parameters will not generalize to the unseen the malicious dataset are malicious. Of course, this may be
validation-set. However, as more training-examples are pro- a naive assumption, as it is possible for malicious apps to
vided the validation-set error decreases, showing that the exist on Google Play.
system has learned to generalize from the training-set. We Cross validation testing was performed on our new dataset.
can predict from the learning curves in figure 3 that if more In each cross validation fold approximately 24,000 malware
training-examples were to be provided, the validation-set er- applications and 24,000 benign application were used. There-
ror would continue to decrease. fore, in order to present all applications to the network four-
These results suggest that our system benefits from larger fold cross validation was used. The results of this experiment
quantities of training-data as expected with neural networks [20]. are reported in Table 3.
They also show that the poor performance on the ’Large
Dataset’, which was obtained from the same source as the Classification System Acc. Prec. Recall F-score
’V. Large dataset’ and hence shares similar characteristics, Ours 0.69 0.67 0.74 0.71
is caused by lack of data. This is indicated by the gap be-
tween the validation and testing-set errors when only ap- Table 3: Malware classification results of our sys-
proximately 6000 training examples are provided. tem tested on an independent dataset of benign and
malware Android applications.
1

Validation-set error
We can see from the results in Table 3 that although the f-
0.9
Training-set error score is lower than previous experiments, our system has the
0.8 potential to work in realistic environments. This is because
our new testing dataset is much larger than the one used
0.7
Error (1 - F-score)

for training the network and contains greater variability of


0.6 applications. The results of this experiment show that the
0.5
network has learned features with the ability to generalise
to realistic data. In future work we hope to take advan-
0.4
tage of our new dataset to explore more complex network
0.3 architectures that can be learned given more training data.
0.2

0.1
5. CONCLUSIONS
In this paper we have presented a novel Android mal-
0
10 1 10 2 10 3 10 4
ware detection system based on deep neural networks. This
Number of Training Examples innovative application of deep learning to the field of mal-
ware analysis has shown good performance and potential in
comparison with other state-of-art techniques, and has been
Figure 3: Learning curves for the Validation-set and validated in four different Android malware datasets. Our
Training-set as the number of training examples is system is capable of simultaneously learning to perform fea-
varied. Note the log-scale on the x-axis. ture extraction and malware classification given only the raw
opcode sequences of a large number of labeled malware sam-
ples. The main advantages of our system are that it removes
4.4 Realistic Testing the need for hand-engineered malware features, it is much
In order to assess the potential of our proposed classi- more computationally efficient than existing n-gram based
fication technique in realistic environments we apply our malware classification systems, and can be implemented to
trained network to a completely new dataset. This allows us run on the GPU of mobile devices.
to demonstrate the real-world potential of our classification As future work, we would like to extend our methodology
technique when applied to an unknown and realistic dataset to both dynamic and static malware analysis in different
at a bigger scale. The network used in this experiment was platforms. Our proposed method is general enough that it
trained on the V. Large dataset, introduced in Section 4. could be applied to other types of malware analysis with only
Our new dataset consists of 96,412 benign apps and 24,103 minor changes to the network architecture. For instance, the
malware apps. The benign apps were randomly selected network could process sequences of instructions produced by
from the Google Play store, and were collected during July dynamic analysis software. Similarly, by changing the dis-
and August 2016. To represent a distinct set of malicious assembly preprocessing step the same network architecture
apps, we used another dataset containing known malware could be applied to malware analysis on different platforms.
apps, including those from the Android Malware Genome Another open problem for malware classification, which
project [28], but removing the ones overlapping with the may allow networks with more parameters, and hence greater
training set of the network. discriminative power, to be used, is data augmentation. Data
Approximately 1 TB of APKs were used in this experi- augmentation is a way to artificially increase the size of the
ment. The APKs were converted to opcode sequences using training-set, by slightly modifying existing training-examples.
The transformations used in data augmentation are usually Conf. CISIS´12-ICEUTE´12-SOCO´12, pages
chosen to simulate variations that occur in real world data, 289–298, 2013.
but which may not be extensively covered by the available [17] J. Saxe and K. Berlin. Deep neural network based
training-set. We would like to investigate the design of data- malware detection using two dimensional binary
augmentation schemes appropriate to malware detection. program features. In 2015 10th International
Conference on Malicious and Unwanted Software
6. REFERENCES (MALWARE), pages 11–20, Oct 2015.
[18] A. Shabtai, U. Kanonov, Y. Elovici, C. Glezer, and
[1] Baksmali. https://github.com/JesusFreke/smali. Y. Weiss. “ andromaly”: a behavioral malware
Accessed: 2015-02-15. detection framework for android devices. Journal of
[2] Dalvik bytecode. https://source.android.com/devices/ Intelligent Information Systems, 38(1):161–190, 2012.
tech/dalvik/dalvik-bytecode.html. Accessed: [19] A. Sharma and S. K. Dash. Mining api calls and
2015-02-01. permissions for android malware detection. In
[3] RMSProp. www.cs.toronto.edu/˜tijmen/csc321/ Cryptology and Network Security, pages 191–205. 2014.
slides/lecture slides lec6.pdf. Slide 29. [20] K. Simonyan and A. Zisserman. Very deep
[4] Torch. http://torch.ch/. convolutional networks for large-scale image
[5] D. Arp, M. Spreitzenbarth, M. Hubner, H. Gascon, recognition. arXiv preprint arXiv:1409.1556, 2014.
and K. Rieck. Drebin: Effective and explainable [21] X. Su, M. C. Chuah, and G. Tan. Smartphone dual
detection of android malware in your pocket. In defense protection framework: Detecting malicious
NDSS, 2014. applications in android markets. In Mobile Ad-hoc and
[6] C. M. Bishop. Neural networks for pattern recognition. Sensor Networks (MSN), 2012 Eighth Int. Conf. on,
Oxford university press, 1995. pages 153–160, 2012.
[7] G. Canfora, F. Mercaldo, and C. A. Visaggio. Mobile [22] D.-J. Wu, C.-H. Mao, T.-E. Wei, H.-M. Lee, and K.-P.
malware detection using op-code frequency Wu. Droidmat: Android malware detection through
histograms. In Proc.of Int. Conf. on Security and manifest and api calls tracing. In Information Security
Cryptography (SECRYPT), 2015. (Asia JCIS), 2012 7th Asia Joint Conf. on, pages
[8] G. E. Dahl, J. W. Stokes, L. Deng, and D. Yu. 62–69, 2012.
Large-scale malware classification using random [23] S. Y. Yerima, S. Sezer, G. McWilliams, and I. Muttik.
projections and neural networks. In Acoustics, Speech A new android malware detection approach using
and Signal Processing (ICASSP), 2013 IEEE Int. bayesian classification. In Advanced Information
Conf. on, pages 3422–3426, 2013. Networking and Applications (AINA), 2013 IEEE
[9] O. E. David and N. S. Netanyahu. Deepsign: Deep 27th Int.l Conf. on, pages 121–128, 2013.
learning for automatic malware signature generation [24] S. Y. Yerima, S. Sezer, and I. Muttik. Android
and classification. In Neural Networks (IJCNN), 2015 malware detection: An eigenspace analysis approach.
Int. Joint Conf. on, pages 1–8, 2015. In Science and Information Conference (SAI), 2015,
[10] Q. Jerome, K. Allix, R. State, and T. Engel. Using pages 1236–1242, 2015.
opcode-sequences to detect malicious android [25] S. Y. Yerima, S. Sezer, and I. Muttik. High accuracy
applications. In Communications (ICC), 2014 IEEE android malware detection using ensemble learning.
Int. Conf. on, pages 914–919, 2014. Information Security, IET, 9(6):313–320, 2015.
[11] B. Kang, B. Kang, J. Kim, and E. G. Im. Android [26] X. Zhang, J. Zhao, and Y. LeCun. Character-level
malware classification method: Dalvik bytecode convolutional networks for text classification. In
frequency analysis. In Proc. of the 2013 Research in Advances in Neural Information Processing Systems,
Adaptive and Convergent Systems, pages 349–350, pages 649–657, 2015.
2013. [27] Y. Zhang and B. Wallace. A sensitivity analysis of
[12] Y. Kim. Convolutional neural networks for sentence (and practitioners’ guide to) convolutional neural
classification. arXiv preprint arXiv:1408.5882, 2014. networks for sentence classification. arXiv preprint
[13] S. Liang and X. Du. Permission-combination-based arXiv:1510.03820, 2015.
scheme for android mobile malware detection. In [28] Y. Zhou and X. Jiang. Dissecting android malware:
Communications (ICC), 2014 IEEE Int. Conf. on, Characterization and evolution. In Security and
pages 2301–2306, 2014. Privacy (SP), 2012 IEEE Symp. on, pages 95–109,
[14] X. Liu and J. Liu. A two-layered permission-based 2012.
android malware detection scheme. In Mobile Cloud
Computing, Services and Engineering (MobileCloud),
2014 2nd IEEE Int. Conf. on, pages 142–148, 2014.
[15] R. Pascanu, J. W. Stokes, H. Sanossian,
M. Marinescu, and A. Thomas. Malware classification
with recurrent networks. In Acoustics, Speech and
Signal Processing (ICASSP), 2015 IEEE Int. Conf.
on, pages 1916–1920, 2015.
[16] B. Sanz, I. Santos, C. Laorden, X. Ugarte-Pedrero,
P. G. Bringas, and G. Álvarez. Puma: Permission
usage to detect malware in android. In Int. Joint

You might also like