Sig Camera Ready
Sig Camera Ready
McLaughlin, N., Martinez del Rincon, J., Kang, B., Yerima, S., Miller, P., Sezer, S., Safaeisemnani, Y., Trickel,
E., Zhao, Z., Doupé, A., & Joon Ahn, G. (2017). Deep Android Malware Detection. In Proceedings of the ACM
Conference on Data and Applications Security and Privacy (CODASPY) 2017 Association for Computing
Machinery. https://doi.org/10.1145/3029806.3029823
Published in:
Proceedings of the ACM Conference on Data and Applications Security and Privacy (CODASPY) 2017
Document Version:
Peer reviewed version
Publisher rights
© 2017 The Authors.
This work is made available online in accordance with the publisher’s policies. Please refer to any applicable terms of use of the publisher.
General rights
Copyright for the publications made accessible via the Queen's University Belfast Research Portal is retained by the author(s) and / or other
copyright owners and it is a condition of accessing these publications that users recognise and abide by the legal requirements associated
with these rights.
Open Access
This research has been made openly available by Queen's academics and its Open Research team. We would love to hear how access to
this research benefits you. – Share your feedback with us: http://go.qub.ac.uk/oa-feedback
∗
Niall McLaughlin , Jesus Martinez del Rincon, BooJoong Kang, Suleiman Yerima,
Paul Miller, Sakir Sezer
Centre for Secure Information Technologies (CSIT)
Queen´s University Belfast, UK
Yeganeh Safaei, Erik Trickel, Ziming Zhao, Adam Doupe, Gail Joon Ahn
Center for Cybersecurity and Digital Forensics
Arizona State University, USA
1
01 Filter m1
47
58 Filter ml
78
45 Wh Wi
bh bi
Softmax Classification
.. WE Wl,m
.. bl,m bl,m class
.. z
y
Connected Layer
Convolutional
Convolutional
Hidden Fully
Embedding
Layer
Layer
Layer
Layer
N al,1
al,m
Max pooling
Layer
sification. We can write the hidden layer as follows network to example training example X (j) , where y (j) is the
provided correct label for the example X (j) , and where 1{x}
z = relu(Wh f + bh ) (5)
is an indicator function that is 1 if its argument x is true and
where Wh , bh , are the parameters of the fully-connected hid- is 0 otherwise. The cost is dependent on both the parameters
den layer, and where the rectified linear activation function of the neural network, Θ, i.e. the weights and bias across all
is used. Finally, the output, z, from the MLP is passed to a layers -WE , wl,m , bl,m ,Wh , bh ,wi , and bi - and on the current
soft-max classifier function, which gives the probability that training sample. The objective during training is to update
program X is malware, denoted as follows the network’s parameters, which are initialized randomly
before training begins, to reduce the cost. This update is
exp(wiT z + bi ) performed stochastically by computing the gradient of the
p(y = i|z) = PI T
(6) cost function with respect to the parameters, ∂C ∂Θ
, given the
i0 =1 exp(wi0 z + bi )
0
current batch of samples, and using this gradient to update
where wi and bi denote the parameters of the classifier for the parameters after every batch to reduce the cost as follows
class i ∈ I, and the label y indicates whether the current
sample is either malware or benign. The softmax classifier ∂C
outputs the normalized probability of the current sample Θ(t+1) = Θ(t) − α
(8)
∂Θ
belonging to each class. As malware classification is a two
class problem (benign/malware) i.e., I = 2 and z is a two where α is a small positive real number called the learning
element vector. Other applications such as the problem of rate. During training the network is repeatedly presented
malware family classification, could be targeted by increas- with batches of training samples in randomized order until
ing the number of classes, I, to be equal to the number of the parameters converge.
malware families to be classified. To deal with an imbalance in the number of training sam-
ples available for the malware and benign classes, the gra-
3.3 Learning Process dients used to update the network parameters are weighted
Given the above definitions, the cost function to be min- depending on the label of the current training sample. This
imized during training for a batch of b training samples, helps to reduce classifier bias towards predicting the more
{X (1) . . . X (b) }, can be written as follows populous class. Let the number of malware samples in the
training-set be M and number of benign samples in the
b I
training-set be B. Assuming there are more samples of be-
1 XX nign software than malware, the weight for malware sam-
C=− 1{y (j) = i}log p(y (j) = i|z (j) ) (7)
b j=1 i=1 ples is 1 − M/(M + B) and the weight for benign samples
is M/(M + B) i.e. the gradients are weighted in inverse
where z (j) is the vector output after applying the neural proportion to the number of samples for each class.
Note that a consideration when designing our proposed ar- accuracy, precision, recall and f-score. The key indicator of
chitectures was to keep the number of parameters relatively performance is f-score, because the number of samples in the
low, in order to help prevent over-fitting given the relatively malware and benign classes is not equal. In this situation,
small number of training samples usually available. A typi- classification accuracy is too influenced by the number of
cal deep network may have millions of parameters [20], while samples in each class. For example if the majority of samples
our malware detection network has only tens of thousands were of class x, and the classifier simply reported x in all
of parameters, which drastically reduces the need for large cases, the classification accuracy would be high, although
numbers of training samples. the classifier would not be useful. However, given the same
conditions, the f-score, which is based on the precision and
4. RESULTS recall, would be low.
Our neural network software was developed using the Torch
scientific computing environment [4]. During training the
In order to evaluate the performance of our approach a
network parameters were optimized using RMSProp [3] with
set of experiments was designed. The architecture used in
a learning rate of 1e-2, for 10 epochs, using a mini-batch
all experiments had only a single convolutional layer. This
size of 16. The network weights were randomly initialized
architecture was used because the available datasets have
using the default Torch initialization. We used an Nvidia
a relatively small number of training samples which means
GTX 980 GPU for development of the network, and training
that networks with large numbers of parameters could be
the network to perform malware classification takes around
prone to over-fitting. Convolutional networks with only a
25 minutes on the large dataset (which contains approxi-
single convolutional layer have been shown to perform well
mately 6000 example programs). Once the network has been
on natural language text classification tasks [27]. In this
trained our implementation can classify approximately 3000
architecture, the remaining hyperparameters, such as the
files per-second on the GPU.
dimension of the embedding space and the length and the
number of convolutional filters, are set empirically using 10-
fold cross validation on the validation-set of the small and
4.1 Computational Efficiency
large dataset. The resulting values are a 8-dimensional em- In this experiment we compare the computational effi-
bedding space, 64 convolutional filters of length 8, and 16 ciency of our proposed malware classification system with
neurons in the hidden fully connected layer. our implementation of a conventional n-gram based mal-
Our experiments were carried out on three different datasets. ware classification system [10]. Note that when reporting
The first dataset consists of malware from the Android Mal- the results we do not include the time take to disassemble
ware Genome project [28] and has been widely used [10, the malware files as this is constant for both systems. The
11]. This dataset has a total of 2123 applications, of which results in Table 2 are presented in terms of both the average
863 are benign and 1260 are malware from 49 different mal- time to reach a classification decision for a single malware
ware families. Labels are provided for the malware family file, and the corresponding average number of programs that
of each sample. The benign samples in this dataset were can be classified per second.
collected from the Google play store and have been checked It can be seen from Table 2 that our system can produce
using virusTotal to ascertain that they were highly probable a much higher number malware classification decisions per
to be malware free. We refer to this dataset as the ’Small second than the n-gram based system. The n-gram based
Dataset’. system also experiences exponential slow-down as the length
The second dataset was provided by McAfee Labs (now of the n-gram features are increased. This severely limits the
Intel Security) and comes from the vendor’s internal reposi- use of longer n-grams, which are necessary for improved clas-
tory of Android malware. After discarding empty files or sification accuracy. Our proposed system is not limited in
files that are less than 8 opcodes long, the dataset con- the same way, and in fact, the features extracted by the first
tains 2475 malware samples and 3627 benign samples. This layer of the CNN can be thought of as n-grams where n = 8.
dataset does not include malware family labels and may in- Use of such features with a conventional n-gram based sys-
clude malware and/or benign applications present in the tem would be much too computationally expensive. Our
small dataset. Hence to ensuring training hygiene i.e. to proposed neural network system is implemented on a desk-
ensure we do not train on the testing-set, the network is top GPU, specifically an Nvidia GTX-980, however it could
trained and tested on each dataset separately without cross- easily be moved to the GPU of a mobile device, allowing for
contamination. We refer to this dataset as the ’Large Dataset’. fast and efficient malware classification of Android applica-
We also have an additional dataset provided by McAfee tions.
Labs containing approximately 18,000 android programs, Finally, the memory usage required to execute the trained
and which was collected more recently than the first two neural network is constant. Increasing the length or number
datasets. This was used for testing the final system after of convolutional filters, or increasing the number of training
setting the hyper-parameters using the smaller datasets. Af- examples linearly increases memory usage. Whereas with
ter discarding short files, the dataset contains 9268 benign n-gram based systems, increasing the training-set size dra-
files and 9902 malware files. We refer to this dataset as the matically increases the number of unique n-grams and hence
’V. Large Dataset’. memory usage. For instance, with the small dataset there
Each dataset was split into 90% for training and validation are 213 unique 1-grams, 1891 unique 2-grams, and 286471
and the remaining 10% was held-out for testing. Care was unique 3-grams. This means our proposed neural network
taken to ensure that the ratio of positive to negative samples based system also more efficient in terms of memory usage
in the validation and testing sets was the same as in the during training.
dataset as a whole.
Results are reported using the mean of the classification 4.2 Classification Accuracy
Classification System Feature Types Benign Malware Acc. Prec. Recall F-score
Ours (Small DS) CNN applied to raw opcodes 863 1260 0.98 0.99 0.95 0.97
Ours (Large DS) CNN applied to raw opcodes 3627 2475 0.80 0.72 0.85 0.78
Ours (V. Large DS) CNN applied to raw opcodes 9268 9902 0.87 0.87 0.85 0.86
opcode n-grams (n=1) 863 1260 0.95 0.95 0.95 0.95
n-grams (Small DS) opcode n-grams (n=2) 863 1260 0.98 0.98 0.98 0.98
opcode n-grams (n=3) 863 1260 0.98 0.98 0.98 0.98
opcode n-grams (n=1) 3627 2475 0.80 0.81 0.80 0.80
n-grams (Large DS) opcode n-grams (n=2) 3627 2475 0.81 0.83 0.82 0.82
opcode n-grams (n=3) 3627 2475 0.82 0.83 0.82 0.82
DroidDetective [13] Perms. combination 741 1260 0.96 0.89 0.96 0.92
Yerima [23] API calls, Perms., intents, cmnds 1000 1000 0.91 0.94 0.91 0.92
Jerome [10] opcode n-grams 1260 1246 - - - 0.98
Yerima [25] * API calls, Perms., intents, cmnds 2925 3938 0.97 0.98 0.97 0.97
Yerima (2) [24]* API calls, Perms., intents, cmnds. 2925 3938 0.96 0.96 0.96 0.96
Table 1: Malware classification results for our system on both the small and large datasets compared with
results from the literature. Results from the literature marked with a (*) use malware from the McAfee
Labs dataset i.e. our large dataset, while all others use malware sampled from the Android Malware Genome
project [28] dataset i.e. our small dataset
System Time per program (s) Programs per second as the application’s requested permissions or API calls [25].
Ours 0.000329 3039.8 In contrast, our proposed method needs only the raw op-
1-gram 0.000569 1758.3 codes, which avoids the need for features manually designed
2-gram 0.010711 93.4 by domain experts. Moreover, our proposed method has
3-gram 0.172749 5.8 the advantage over existing methods of being very compu-
tational efficient, as it is capable of classifying approximately
3000 files per-second.
Table 2: Comparing the time taken to reach a clas- The results on the v. large dataset, which was obtained
sification decision and number of programs that can from the same source as the large dataset and hence likely
be classified per second, for our proposed neural net- shares similar characteristics, shows that our system’s per-
work system and a conventional n-gram based sys- formance improves as more training data is provided. This
tem. phenomenon has been observed when training neural net-
works in other domains, where performance is highly corre-
lated with the number of training samples. We expect that
In this experiment, the network’s performance is measured these results can be further improved given greater quan-
in terms of accuracy. The network was trained using the tities of training data, which will also allow more complex
complete training and validation set, then tested on the held- network architectures to be explored. Unfortunately com-
out test-set that was not seen during hyper-parameter tun- parisons with the baseline n-gram system on the v. large
ing. We compare the performance of our proposed system dataset were not possible due to computational cost associ-
with our own implementation of an n-gram based malware ated with the n-gram method.
detection method [10]. For both datasets we measured the
performance of this system using 1, 2 and 3-gram features. 4.3 Learning Curves
The same training and testing samples were used for both In this experiment we aim to understand the system’s per-
systems in order to allow for direct comparison of their per- formance as a function of the quantity of training data, with
formance. The results for the small and large and v. large the aim of predicting how its performance is likely to change
datasets are shown in Table 1. We have endeavored to select if more training data were to be made available.
papers from the literature that use similar Android malware This experiment was performed on the V. Large dataset.
datasets to give as fair a comparison as possible. As in previous experiments, the dataset is split into train-
In the small dataset our proposed method clearly achieves ing and validation sets. Throughout the experiment the
state-of-the-art performance, and is comparable to methods validation-set remains fixed. An artificially reduced size
such as [10] and [23]. It achieves better performance than training-set is constructed by randomly sub-sampling from
our baseline n-gram system with 1-gram features and near the complete set of training examples. The network is then
identical performance to the baseline with 2 and 3-gram fea- trained from scratch on this reduced size training-set, and
tures. the system’s performance measured on both the training and
The large dataset is more challenging due to the greater validation sets. This process is repeated for several different
variably of malware present. Our system achieves similar sizes of training-set, ranging from a small number of exam-
performance to the baseline n-gram system, while having far ples up to the complete set of all training-examples. The
greater computational efficiency (See Section 4.1). Although system’s performance on the validation-set and training-set
other methods have achieved better performance on similar are then plotted as a function of the training-set size. Per-
tests, they make use of additional outside information such formance is recorded in terms of 1 − f-score, meaning that
perfect performance would produce a value of zero. a cloud architecture consisting of 29 machines running in
In figure 3, we can see that when only a small number of parallel, in a process which took around 11 hours. Clas-
training-examples are provided, training-set performance is sification of the opcode sequences was performed using an
perfect, while validation-set performance is very poor. This Nvidia GTX 1080 GPU, and took an hour to complete.
is to be expected as with such a small number of training- Note that for this experiment we assume that all APKs
examples the system will over-fit to the training-set and in the Google Play dataset are benign, and all the APKs in
the learned parameters will not generalize to the unseen the malicious dataset are malicious. Of course, this may be
validation-set. However, as more training-examples are pro- a naive assumption, as it is possible for malicious apps to
vided the validation-set error decreases, showing that the exist on Google Play.
system has learned to generalize from the training-set. We Cross validation testing was performed on our new dataset.
can predict from the learning curves in figure 3 that if more In each cross validation fold approximately 24,000 malware
training-examples were to be provided, the validation-set er- applications and 24,000 benign application were used. There-
ror would continue to decrease. fore, in order to present all applications to the network four-
These results suggest that our system benefits from larger fold cross validation was used. The results of this experiment
quantities of training-data as expected with neural networks [20]. are reported in Table 3.
They also show that the poor performance on the ’Large
Dataset’, which was obtained from the same source as the Classification System Acc. Prec. Recall F-score
’V. Large dataset’ and hence shares similar characteristics, Ours 0.69 0.67 0.74 0.71
is caused by lack of data. This is indicated by the gap be-
tween the validation and testing-set errors when only ap- Table 3: Malware classification results of our sys-
proximately 6000 training examples are provided. tem tested on an independent dataset of benign and
malware Android applications.
1
Validation-set error
We can see from the results in Table 3 that although the f-
0.9
Training-set error score is lower than previous experiments, our system has the
0.8 potential to work in realistic environments. This is because
our new testing dataset is much larger than the one used
0.7
Error (1 - F-score)
0.1
5. CONCLUSIONS
In this paper we have presented a novel Android mal-
0
10 1 10 2 10 3 10 4
ware detection system based on deep neural networks. This
Number of Training Examples innovative application of deep learning to the field of mal-
ware analysis has shown good performance and potential in
comparison with other state-of-art techniques, and has been
Figure 3: Learning curves for the Validation-set and validated in four different Android malware datasets. Our
Training-set as the number of training examples is system is capable of simultaneously learning to perform fea-
varied. Note the log-scale on the x-axis. ture extraction and malware classification given only the raw
opcode sequences of a large number of labeled malware sam-
ples. The main advantages of our system are that it removes
4.4 Realistic Testing the need for hand-engineered malware features, it is much
In order to assess the potential of our proposed classi- more computationally efficient than existing n-gram based
fication technique in realistic environments we apply our malware classification systems, and can be implemented to
trained network to a completely new dataset. This allows us run on the GPU of mobile devices.
to demonstrate the real-world potential of our classification As future work, we would like to extend our methodology
technique when applied to an unknown and realistic dataset to both dynamic and static malware analysis in different
at a bigger scale. The network used in this experiment was platforms. Our proposed method is general enough that it
trained on the V. Large dataset, introduced in Section 4. could be applied to other types of malware analysis with only
Our new dataset consists of 96,412 benign apps and 24,103 minor changes to the network architecture. For instance, the
malware apps. The benign apps were randomly selected network could process sequences of instructions produced by
from the Google Play store, and were collected during July dynamic analysis software. Similarly, by changing the dis-
and August 2016. To represent a distinct set of malicious assembly preprocessing step the same network architecture
apps, we used another dataset containing known malware could be applied to malware analysis on different platforms.
apps, including those from the Android Malware Genome Another open problem for malware classification, which
project [28], but removing the ones overlapping with the may allow networks with more parameters, and hence greater
training set of the network. discriminative power, to be used, is data augmentation. Data
Approximately 1 TB of APKs were used in this experi- augmentation is a way to artificially increase the size of the
ment. The APKs were converted to opcode sequences using training-set, by slightly modifying existing training-examples.
The transformations used in data augmentation are usually Conf. CISIS´12-ICEUTE´12-SOCO´12, pages
chosen to simulate variations that occur in real world data, 289–298, 2013.
but which may not be extensively covered by the available [17] J. Saxe and K. Berlin. Deep neural network based
training-set. We would like to investigate the design of data- malware detection using two dimensional binary
augmentation schemes appropriate to malware detection. program features. In 2015 10th International
Conference on Malicious and Unwanted Software
6. REFERENCES (MALWARE), pages 11–20, Oct 2015.
[18] A. Shabtai, U. Kanonov, Y. Elovici, C. Glezer, and
[1] Baksmali. https://github.com/JesusFreke/smali. Y. Weiss. “ andromaly”: a behavioral malware
Accessed: 2015-02-15. detection framework for android devices. Journal of
[2] Dalvik bytecode. https://source.android.com/devices/ Intelligent Information Systems, 38(1):161–190, 2012.
tech/dalvik/dalvik-bytecode.html. Accessed: [19] A. Sharma and S. K. Dash. Mining api calls and
2015-02-01. permissions for android malware detection. In
[3] RMSProp. www.cs.toronto.edu/˜tijmen/csc321/ Cryptology and Network Security, pages 191–205. 2014.
slides/lecture slides lec6.pdf. Slide 29. [20] K. Simonyan and A. Zisserman. Very deep
[4] Torch. http://torch.ch/. convolutional networks for large-scale image
[5] D. Arp, M. Spreitzenbarth, M. Hubner, H. Gascon, recognition. arXiv preprint arXiv:1409.1556, 2014.
and K. Rieck. Drebin: Effective and explainable [21] X. Su, M. C. Chuah, and G. Tan. Smartphone dual
detection of android malware in your pocket. In defense protection framework: Detecting malicious
NDSS, 2014. applications in android markets. In Mobile Ad-hoc and
[6] C. M. Bishop. Neural networks for pattern recognition. Sensor Networks (MSN), 2012 Eighth Int. Conf. on,
Oxford university press, 1995. pages 153–160, 2012.
[7] G. Canfora, F. Mercaldo, and C. A. Visaggio. Mobile [22] D.-J. Wu, C.-H. Mao, T.-E. Wei, H.-M. Lee, and K.-P.
malware detection using op-code frequency Wu. Droidmat: Android malware detection through
histograms. In Proc.of Int. Conf. on Security and manifest and api calls tracing. In Information Security
Cryptography (SECRYPT), 2015. (Asia JCIS), 2012 7th Asia Joint Conf. on, pages
[8] G. E. Dahl, J. W. Stokes, L. Deng, and D. Yu. 62–69, 2012.
Large-scale malware classification using random [23] S. Y. Yerima, S. Sezer, G. McWilliams, and I. Muttik.
projections and neural networks. In Acoustics, Speech A new android malware detection approach using
and Signal Processing (ICASSP), 2013 IEEE Int. bayesian classification. In Advanced Information
Conf. on, pages 3422–3426, 2013. Networking and Applications (AINA), 2013 IEEE
[9] O. E. David and N. S. Netanyahu. Deepsign: Deep 27th Int.l Conf. on, pages 121–128, 2013.
learning for automatic malware signature generation [24] S. Y. Yerima, S. Sezer, and I. Muttik. Android
and classification. In Neural Networks (IJCNN), 2015 malware detection: An eigenspace analysis approach.
Int. Joint Conf. on, pages 1–8, 2015. In Science and Information Conference (SAI), 2015,
[10] Q. Jerome, K. Allix, R. State, and T. Engel. Using pages 1236–1242, 2015.
opcode-sequences to detect malicious android [25] S. Y. Yerima, S. Sezer, and I. Muttik. High accuracy
applications. In Communications (ICC), 2014 IEEE android malware detection using ensemble learning.
Int. Conf. on, pages 914–919, 2014. Information Security, IET, 9(6):313–320, 2015.
[11] B. Kang, B. Kang, J. Kim, and E. G. Im. Android [26] X. Zhang, J. Zhao, and Y. LeCun. Character-level
malware classification method: Dalvik bytecode convolutional networks for text classification. In
frequency analysis. In Proc. of the 2013 Research in Advances in Neural Information Processing Systems,
Adaptive and Convergent Systems, pages 349–350, pages 649–657, 2015.
2013. [27] Y. Zhang and B. Wallace. A sensitivity analysis of
[12] Y. Kim. Convolutional neural networks for sentence (and practitioners’ guide to) convolutional neural
classification. arXiv preprint arXiv:1408.5882, 2014. networks for sentence classification. arXiv preprint
[13] S. Liang and X. Du. Permission-combination-based arXiv:1510.03820, 2015.
scheme for android mobile malware detection. In [28] Y. Zhou and X. Jiang. Dissecting android malware:
Communications (ICC), 2014 IEEE Int. Conf. on, Characterization and evolution. In Security and
pages 2301–2306, 2014. Privacy (SP), 2012 IEEE Symp. on, pages 95–109,
[14] X. Liu and J. Liu. A two-layered permission-based 2012.
android malware detection scheme. In Mobile Cloud
Computing, Services and Engineering (MobileCloud),
2014 2nd IEEE Int. Conf. on, pages 142–148, 2014.
[15] R. Pascanu, J. W. Stokes, H. Sanossian,
M. Marinescu, and A. Thomas. Malware classification
with recurrent networks. In Acoustics, Speech and
Signal Processing (ICASSP), 2015 IEEE Int. Conf.
on, pages 1916–1920, 2015.
[16] B. Sanz, I. Santos, C. Laorden, X. Ugarte-Pedrero,
P. G. Bringas, and G. Álvarez. Puma: Permission
usage to detect malware in android. In Int. Joint