RNN For Malware Detection
RNN For Malware Detection
1, March 2019
ISSN 2074-8523
Abstract
1 Introduction
In this century, the development of network technology changes people life where
they could easily access information from around the world more effectively than
previous years. At the same time, however, a great number of cyber incidents such
as malware also actively evolving. Malicious software, commonly known as
malware is a continuous problem and has become a major threat to computer users,
businesses, corporations, and even governments. The number of malware
increasing year by year and becomes more complex and sophisticated. It is harmful
and can contribute to unwanted loss or privacy invasion as it compromises the
confidentiality, integrity and availability of private data without user’s permission.
Consequently with the rising of cyber incidents related to malwares, most
researchers have studied various techniques to detect them.
Machine learning techniques such as Support Vector Machines (SVM), k-
Nearest Neighbor (k-NN), Naïve-Bayesian (NB), Random Forests (RF), Neural
Network (NN), are prominently explored for malware classification[1]–[4]. Several
researchers combine numerous machine learning classifiers as they claim the hybrid
of multiple classifiers [5] has better performance than single classifier. Andrew H.
Sung combined SVM with Artificial Neural Network (ANN) in his work and
reduced the number of features for better performance but the result was
undifferentiated to the result of original features [6]. Same claim from
Maheshkumar Subhnani et al. who combined Multilayer Perception (MLP), k-
Means and decision tree in his paper [7] to detect malware, also showed that the
performance improved in detection and false alarm rates. Advance machine
learning has also been applied for malware detection since it is very convenient in
extracting more information from the datasets. Even though unsupervised machine
learning such as clustering task seems to be the most preferable by researchers to
understand malware [8][9], the supervised machine learning method is also
preferable when it comes to correct labeling [10]. Dahl et al. [11] highlighted the
use of supervised machine learning to classify the labeled sample of malware by
combining random projection and NN techniques.
In the past few years, there were noticeable work involving Deep Neural
Networks (DNN) in classifying malware [1], [11]–[13]. Saxe et al. [1] utilized feed
forward NN for static analysis. However, as the focus was on static analysis and
dealing with binaries of executable files, the satisfactory input for the classification
was not achieved. Motivated by Pascanu et al. [12] that learnt malware through
language model, Athiwaratkun et al. [14] and Kolosnjaji et al. [13] expanded the
research by using Recurrent Neural Network (RNN) to enhance malware sequence
classification. In addition, a combination of RNN with MLP was applied in [12] to
learn malware and benign files through language model and formed feature
representation. In this work, the MLP was allocated as output classifier while RNN
worked as feature extractor. The usage of temporal max pooling helped improved
and produced the best result in processing long sequence of temporal features but
Mudzfirah Abdul Halim et al.
48
2 Background
In this section, we will briefly explain the fundamental of Neural Network models.
𝑥1
Sum Transfer 𝑤2
𝑤𝑛
𝑥𝑛
Input Weights Output
𝑥𝑡 Input 𝑥1 𝑥2 𝑥3
U U U U
W
Unfold Hidden W W W
Hidden layer
𝑠𝑡 layer 𝑠1 𝑠2 𝑠3
V V
Output
𝑦𝑡 𝑦1
− x +
tanh
f i x g
σ σ tanh σ o x
−
equation (4), 𝑤𝑥𝑓 represents the weight that map input 𝑥 to forget gate 𝑓. 𝑤𝑠𝑓 𝑠𝑡−1
maps hidden state value from previous cell to forget gate while 𝑤𝑐𝑓 𝑐𝑡−1 maps value
of memory cell in previous memory.
These gates play big role in information control. Equation (8) shows the
procedure of updating the new value of cell state by forgetting the previous cell
value using gate 𝑓and inputting the current cell value through gate 𝑖 and 𝑔.
𝑐𝑡 = 𝑓 ∗ 𝑐𝑡−1 + 𝑖 ∗ 𝑔 (8)
The activation function of new cell value will be processed through gate 𝑜, and
new amount of output will be produced which is known as the hidden state value.
This process can be formulated as in equation (9):
𝑠𝑡 = 𝑜 ∗ tanh(𝑐𝑡 ) (9)
ReLu function
Input Output
Drebin dataset does not provide numerical value for both malware and benign
applications, so the pre-processing was needed. To make them available for our
model to train, we first detect the features using common tokenizer. Tokenizer splits
text into individual word. The first word for every single line in the sample were
chosen and listed. Based on the features’ list, it was found that few samples had sets
of feature; services providers and services receivers. Service provider is a feature
that declares services provider of application while service receiver feature declares
the service receivers of application. Service is a component that runs in the
background until it stop itself and does not interact with user directly. Malwares are
likely request these two feature more frequent than benign application. Thus, these
two had been chosen as additional features for malware classification model
evaluation. For feature extraction, Bag of Words (BOW) model was applied. In this
BOW model, specific list of words were retrieved and unimportant words were
removed from the document. It helps document retrieval by matching the chosen
words in the list and counts their frequency. All gathered features were stored in
.txt file for further analysis. Fig. 5 presents pseudocode for BOW model in
transforming word sequences to numerical sequences. The model read every line in
the documents and ignored characters and words that were not in the feature
dictionary. Words that matched the list of features were retrieved and frequency of
its occurrence was counted.
In addition, due to uneven number of samples between malware and benign, new
training and test dataset were generated evenly. 2779 random samples from benign
and 2779 from malware were chosen. Then, all of the samples were combined in
new dataset.
55 Recurrent Convolutional Network
3.2 CNN-LSTM
Fig. 6 shows the overview of detection model CNN-LSTM where CNN stacked on
top of LSTM. The idea behind this combination model is that the feature filtration
will be done first before the temporal modelling performed. The dataset passed
through CNN at the first layer for feature reduction and next the result of smaller
feature dimension fed into LSTM layer for data sequences learning. Lastly, output
features from LSTM are classified by MLP classifier.
Feature representation
F1 F2 F3 F4 F5 F6 F7 F8 F9 F10
Convolutional
layer CNN layer
Max pooling
...
Flatten()
...
LSTM
... layer
RNB layer
with
Dense (1)
Output
3.3 LSTM-CNN
Fig. 7 shows the overview of the detection model LSTM-CNN. LSTM is put as the
first layer to learn temporal data from BOW representation data. The architecture
can be seen as a deep architecture through time steps with LSTM memory cells to
produce output sequences. This model learns end-to-end features from malware
Mudzfirah Abdul Halim et al.
56
feature sequences where it is extracted layer by layer. The temporal modeling using
LSTM is performed. One layer of LSTM is used with memory cells to remember
all feature inputs. The LSTM output is then passed to the CNN to reduce the feature
variation. The architecture used is one dimension convolution layer, with 5x1
feature filters shared across the space. A 4x1 size of max-pooling is then performed
on the convoluted output.
Feature representation
F1 F2 F3 F4 F5 F6 F7 F8 F9 F10
...
Convolutional
layer
CNN layer
Max-pooling layer
... ...
Flatten ()
... ...
MLP layer
with
Dense (1)
Output
datasets; one from MLP and another one from feature dataset for validation.
Evaluator fully utilize dataset from MLP classifier that contains an output that is
produced based on learning algorithm and validate dataset from feature dataset to
determine the maliciousness of malware. The decision is then compared with the
label provided to check the accuracy of the model.
3.4 Training
Before designing the model development, the malware detector model is optimized
on computer with Windows 7 environment using Intel(R) Core(TM) i3-2350M
2.30GHz and 6GB RAM. The model shares almost the same optimization
parameters. Parameter gives impact to the performance of model. There are
parameters values that can be tuned such as batch size, optimizer, learning rate,
number of epochs, number and size of layer and activation function. However,
among all the parameters, researcher found that the learning rate (lr) of the
optimizer and size of neuron have the greatest impact. Therefore, the parameter
experiment is run by changing the values of lr, number of neuron and epoch. Adam
optimizer is used because it computes individual adaptive learning for different
parameters. Adam maintained the learning rate for each parameter and separately
adapts as learning unfolds. As a result, the best values for lr, number of neuron and
number of epoch for LSTM are 0.01, 90 and 60 respectively.
4 Experiment
This section presents the experimental result and evaluation of the experiment on
Drebin dataset. Before conducting the experiment, the dataset was transformed into
vector using BOW model. It contained 10 feature vectors with label 0 for benign
and 1 for malware. For the hidden layer, LSTM and CNN architecture were applied;
and sigmoid function of MLP was applied to the output layer
In order to evaluate performance of proposed model, focus was first set on the
optimized parameter. For number of neuron, the optimized numbers for all
algorithms were different. The neurons for MLP were set in range of 40 to 60. As
for LSTM, the neuron numbers were set to 90 as the larger the number of neuron
the better the accuracy performance of LSTM. Besides, memory structure in LSTM
learns better if the number of neuron increases. CNN on the other hand needed only
50 to create the best performance. In aspect of epoch, LSTM required smaller
59 Recurrent Convolutional Network
number than MLP and CNN, which was 60. In comparison, MLP and CNN needed
more number of epoch required by LSTM to achieve the best detection accuracy in
which MLP 180 and CNN 130. All these three algorithms acted differently to epoch
where LSTM increased almost constantly while MLP and CNN acted the opposite
way. The only identical optimized parameter used for all algorithms was the value
learning rate parameter. All algorithms prefer lr=0.01. These optimized parameters
were then used in stacking model of CNN-LSTM and LSTM-CNN. Both shared
the same parameters but differed in layer order in which CNN-LSTM model
stacked CNN on top of LSTM layer to handle spatial problem of dataset while
LSTM-CNN set LSTM in the first layer to deal with temporal data. The optimized
lr, number of neuron and number of epoch used for these combination model were
lr=0.01, 90 neuron of LSTM, 40 filter of CNN and 110 epoch respectively.
For better evaluation, this research’s model performance was compared with
other NN model by setting each algorithm according to their optimized parameter
so that their detection performance can be observed and compared fairly. The
detection result was tabulated in Table 3 and illustrated in Fig. 8.
Table 3 Comparison accuracy detection of NN model
Neural Network Model Accuracy (%)±std
MLP 94.73±0.64
CNN 87.91±2.57
LSTM 95.90±0.34
CNN-LSTM 96.76±1.24
LSTM-CNN 98.53±0.24
100
95
90
85
80
MLP CNN LSTM CNN- LSTM-
LSTM CNN
It was found that among common NN model, LSTM showed the best accuracy
while CNN showed the lowest accuracy. The results were displayed as such because
of the difference architecture of each algorithm. Basic architecture of LSTM and
CNN were MLP but they were strained with special structure. Besides, both CNN
and LSTM were invented to exploit temporal invariant in detection. However, in
the context of temporal sequences, LSTM outperform MLP and CNN by accurately
detecting 95.90% of malware while MLP and CNN detect 94.73% and 87.91%
respectively. The main difference of CNN from MLP and LSTM is the layer of
Mudzfirah Abdul Halim et al.
60
convolution and pooling where it coalescing input data using learned function. This
means only selected input features by CNN were chosen to pass as new input to
LSTM classifier. LSTM used its memory cell to process and connect all information
to produce an input. However, it was not as powerful as the detection by LSTM-
CNN model that improved malware detection. The CNN-LSTM classifier depicts
higher accuracy outperforming general CNN, LSTM, and MLP. This is because of
the existence of best feature selections made by CNN algorithm at the first layer
before passing it to LSTM to learn the extracted features. The main difference
between CNN-LSTM and LSTM-CNN model was which algorithm came first at
the first layer. The detection rate using LSTM-CNN was 98.53% and CNN-LSTM
was 96.76% with both detections higher than Drebin detector scheme itself which
was 94%. In LSTM-CNN, LSTM layer was the first to receive input, process and
stores information not only for current input but also from previous input. The new
output produced by LSTM was then fed into CNN to be convoluted where the
features of input were extracted.
T-test approach is used to measure the significant difference of the proposed
model. The p–value or probability-value for accuracy performance is calculated,
where the hypothesis null is true. Hypothesis null in this work defined as there is
no difference between the proposed detection models with existing detection model.
If the p-value less than 0.5, the hypothesis is rejected and defined as there is
significance difference between proposed models with existing models. LSTM-
CNN detection model significantly outperforms the standard MLP (p < 0.0001),
CNN (p < 0.0001) and LSTM (p < 0.0001). The result supported this study’s
motivation to combine the neural network algorithm to deal with and detect spatial
and temporal problem better. However, the order in stacking the layers played a
huge role in model detection performance. By stacking LSTM layer before CNN to
process input, better information that accumulated from all inputs were generated
before the input were sent to be extracted in CNN for better accuracy. If CNN
initiate the layer, the sequence information from the inputs were lost and LSTM
function was not fully utilized.
5 Conclusion
This paper reported the study of two malware detection models that combining
LSTM with CNN and evaluated the models on the Drebin dataset. Two additional
features from the Drebin dataset were extracted and new numerical vector dataset
was generated using BOW model that destroyed the spatial information of data.
Previous work showed that CNN was well known with its structure in extracting
the feature and LSTM was highly capable with temporal modeling. Hence, this
study decided to combine LSTM together with CNN and MLP to see if LSTM can
help in dealing with spatiotemporal data provided by BOW model representation.
61 Recurrent Convolutional Network
LSTM and CNN were combined as feature classifier and MLP as output classifier.
The effect of changing parameter values in the performance of detection was then
observed; 10 new sub datasets were generated for testing and performing
evaluation. It was found that the LSTM-CNN outperformed MLP, CNN, LSTM
and CNN-LSTM in detecting malware. This indicates that hybrid scheme for NN
was more accurate and well-suited to detect sequential data. This will be good
subject for further research to extend the combination of LSTM and observe its
effect on malware detection. In future work, more complex modifications of the
LSTM using larger number of samples and more parameters tuning can also be
planned and explored. With this modification, it is hoped that the LSTM-CNN
model may improve and accurately detect malware better.
ACKNOWLEDGEMENTS
This is a text of acknowledgements The authors would like to thank the Universiti
Kebangsaan Malaysia (UKM) and Ministry of Higher Education (MoHe) for their
support in making this project possible. This work was supported by the
Fundamental Research Grant (FRGS) with grant number FRGS/ 1/ 2016/ ICT02/
UKM/ 02/ 05.
References
[1] J. Saxe & K. Berlin. (2015). Deep neural network based malware detection
using two dimensional binary program features. In 10th International
Conference on Malicious and Unwanted Software (MALWARE), 2015
(pp.11–20).
[2] W. Huang & J. W. Stokes. (2016). MtNet: A multi-task neural network for
dynamic malware classification. In Lecture Notes in Computer Science
(including subseries Lecture Notes in Artificial Intelligence and Lecture
Notes in Bioinformatics). 2016, (vol. 9721, pp. 399–418).
[3] Wei Wang, Ming Zhu, Xuewen Zeng, Xiaozhou Ye, & Yiqiang Sheng.
(2017). Malware traffic classification using convolutional neural network
for representation learning. In 2017 International Conference on
Information Networking (ICOIN), 2017 (pp. 712–717).
[4] Y. Liao & V. R. Vemuri. (2002). Use of k-nearest neighbor classifier for
intrusion detection. 2002 Comput. Secur. (vol. 21, no. 5, pp. 439–448).
[5] Y. Deng & Y. Zhong. (2013). Keystroke Dynamics User Authentication
Based on Gaussian Mixture Model and Deep Belief Nets. 2013 ISRN
Signal Process (vol. 2013, pp. 1–7).
[6] A. H. Sung & S. Mukkamala. (2003). Identifying Important Features for
Intrusion Detection Using Support Vector Machines and Neural Networks.
Mudzfirah Abdul Halim et al.
62