Malware Detection With LSTM Using Opcode Language
Malware Detection With LSTM Using Opcode Language
Language
Renjie Lu,
University of Chinese Academy of Sciences
Beijing, China
Email: [email protected]
Abstract—Nowadays, with the booming development of Inter- is not susceptible to code obfuscation techniques [3], so it is a
net and software industry, more and more malware variants more effective malware detection method. Dynamic behavior-
arXiv:1906.04593v1 [cs.CR] 10 Jun 2019
are designed to perform various malicious activities. Traditional based malware detection methods [4] [5] usually need a secure
signature-based detection methods can not detect variants of mal-
ware. In addition, most behavior-based methods require a secure and controlled environment, such as virtual machine, simula-
and isolated environment to perform malware detection, which is tor, sandbox, etc. Then the behavior analysis is performed by
vulnerable to be contaminated. In this paper, similar to natural using the interaction information with the environment such
language processing, we propose a novel and efficient approach to as API calls and DLL calls. Although these techniques have
perform static malware analysis, which can automatically learn been widely studied, they have also been confirmed to be less
the opcode sequence patterns of malware. We propose modeling
malware as a language and assess the feasibility of this approach. efficient enough when applied to large dataset [6]. Dynamic
First, We use the disassembly tool IDA Pro to obtain opcode behavior-based malware detection methods are quite time-
sequence of malware. Then the word embedding technique is used consuming and require considerable attention to protect the
to learn the feature vector representation of opcode. Finally, we operating environment from contaminated.
propose a two-stage LSTM model for malware detection, which At present, a number of malware detection methods com-
use two LSTM layers and one mean-pooling layer to obtain the
feature representations of opcode sequences of malwares. We bined with machine learning techniques have been proposed.
perform experiments on the dataset that includes 969 malware Reference [7] first proposed a malware detection method using
and 123 benign files. In terms of malware detection and malware data mining technique, which use three different types of
classification, the evaluation results show our proposed method static features: PE header, string sequence, and byte sequence.
can achieve average AUC of 0.99 and average AUC of 0.987 in Kolter and Maloof [8] proposed to use n-gram instead of
best case, respectively.
Index Terms—Malware detection and classification, Static byte sequence and compared the performance of naive bayes,
analysis, Opcode language, Long short-term memory decision trees, support vector machines for malware detection.
Later, artificial neural network [9] [10] were also used for
I. I NTRODUCTION malware detection. Meanwhile, there are also some novel ideas
Malicious software is referred to as malware, which is de- for malware detection. Both [11] and [12] utilize the technique
signed to perform various malicious activities, such as stealing of image processing to detect malware. In terms of malware
private information, gaining root authority, disabling targeted detection, the previous works have achieved good enough per-
host and so on. Meanwhile, with the booming development formance. However, most of these methods manually extract
of Internet and software industry, more and more variants of malware features which are used to train a machine learning
malware are emerging and almost everywhere. According to a classifier.
2018 McAfee threats report [1], the total number of malware To reduce the cost of artificial feature engineering, in this
samples has grown almost 34% over the past quarters to more paper, we propose a novel and efficient method to detect
than 774 million samples. It can be seen that the number of whether a Windows executable file is malware. First, we use
malware continues to increase. Hence, malware detection is the disassembly tool IDA Pro to obtain the assembly format
always a attractive and meaningful issue. file of all executable files. Next, we develop an algorithm to
A large number of research have been published on how to extract opcode sequence from each assembly format file. Then,
detect malware. Malware detection can be simply considered similar to natural language processing (NLP), word embedding
as a binary classification problem, and traditional anti-virus technology [13] is used to learn the feature vector representa-
software usually relies on static signature-based detection tion of opcode, and long-short term memory (LSTM) [14] is
method [2], which has a significant limitation. some minor used to automatically learn opcode sequence patterns of mal-
changes in malware can change the signature, so more mal- ware. To increase invariance of the local feature representation,
ware could easily evade signature-based detection by encrypt- we also introduce a mean-pooling layer after second LSTM
ing, obfuscating or packing. Meanwhile, the zero-day malware layer. To verify the effectiveness of our proposed method, we
can also evade this detection approach. The dynamic analysis make a series of experiments on the dataset that includes 969
malwares and 123 benign files. In the experimental section,
Corresponding author: [email protected] we evaluate the effect of the second LSTM layer on malware
detection performance, and we also make detailed performance behavior. However, because of the discovery of the vanishing
comparison with other related work. In terms of malware and exploding gradient problem, RNN became unpopular until
detection and malware classification, the evaluation result the LSTM was proposed. LSTM [14] is a special type of
shows our proposed method can achieve average AUC of 0.99 RNN, which can greatly mitigate the vanishing and exploding
and average AUC of 0.987, respectively. gradient problem.
In summary, we make the following contributions in this
paper: III. M ALWARE D ETECTION M ETHODOLOGY
• We present a novel and efficient malware detection ap- In this section, we introduce the proposed malware detec-
proach, which makes use of word embedding technology tion approach in detail, which is similar to natural language
and LSTM to automatically learn the opcode sequence processing. As shown in Figure 1, the malware detection
patterns of malwares. It can greatly reduce the cost of methodology can be simply divided into the data processing
artificial feature engineering. stage and modeling stage. In the data processing stage, we
• We propose and implement a two-stage LSTM model first use the disassembly tool IDA Pro, which can resolve
for malware detection, which use two LSTM layers executable file into Intel X86 assembly format file. Next, we
and one mean-pooling layer to automatically obtain the develop a algorithm to extract opcode sequence from each
comprehensive feature representation of malware. assembly format file. In the modeling stage, word embedding
• We make a series of evaluation experiment including technology is used to learn the correlation between opcodes
malware detection and malware classification. The ex- and to obtain the feature vector representation of each opcode.
primental results demonstrate the effectiveness of our Then, a two-stage LSTM model is used to learn the opcode
proposed method. sequence patterns of each sample and to generate the predic-
The rest of this paper is organized as follows. Related tive model. Finally, we use the predictive model to perform
work on neural network is discussed in Section II. Section malware detection on the testing set in order to evaluate its
III describes the proposed malware detection framework. Ex- performance.
periment and evaluation are presented in Section IV. Section A. Data Processing
V concludes the paper and discusses the future work.
In order to obtain the features of opcode sequence, we
II. R ELATED W ORK need to extract opcode sequence from each assembly format
A. Word Embedding file. Typically, this type of assembly format file contains four
basic predefined segments, .text segment, .idata segment, .rdata
Recurrent Neural Network Based Language Model
segment, and .data segment. Since only the .text segment
(RNNLM) [15] is the language model using recurrent
stores program instructions and the rest of segments are data
neural network, which can predict the next word from
segment, we only consider the contents of .text segment
previous input. Later, Mikolov [13] proposed the CBOW and
to extract opcode sequence. Meanwhile, we also find some
Skip-gram language model in efficient estimation of word
meaningless opcodes such as ‘dd’, ‘db’, ‘align’ and so on. To
representations in vector space. CBOW model can predict
obtain opcodes that are really beneficial for malware detection,
current word by the given context. In contrast, Skip-gram
we need to filter out these meaningless opcodes. Actually,
model can predict context by the given current word. Then,
the opcode sequence can reflect program execution logic of
each word is converted to feature vector which store the
corresponding executable file. The pseudocode of this opcode
semantic information of word, and the correlation between
extraction algorithm is shown in Algorithm 1.
words can also be calculated using this feature vector.
B. Recurrent Neural Network Algorithm 1 Extract Opcode Sequence
Neural network (NN) is a kind of mathematical model Input: Each assembly format file
which is consisted of many neuron layers. Recurrent Neural Output: Corresponding opcode sequence
Network (RNN) is a typical structure of NN, and it has a 1: pattern ← predefined matching pattern for extracting
special memory unit which can retain the state information of opcode
previous hidden layer. RNN shows good results in various 2: f ilter ← {‘align0 , ‘dd0 , ‘db0 , ...}
fields which use sequential data such as natural language 3: file ← open (assembly format file)
processing and speech recognition. A large amount of research 4: for eachline in f ile do
works using RNN for malware detection has been published. 5: if eachline starts with ’.text’ then
EI-Bakry [16] proposed that a time delay neural networks 6: result ← match (pattern, eachline)
could be used for malware classification, but the paper did 7: if result is not null and result not in filter then
not carry out any experiments to validate the idea. Pascanu et 8: add result to corresponding opcode sequence
al. [17] proposed a malware detection method using RNN. 9: end if
However, [17] uses API calls as the original feature for 10: end if
malware detection. Tobiyama et al. [18] proposed a malware 11: end for
detection method with deep neural network using process
Malware
Malware Malware
Assembly Opcode Word
IDA Pro
Format File
Algorithm 1
Sequence File
Skip-gram
/CBOW
Embedding
two-stage
LSTM Detection Testing Set ?
Benign Model
Benign
Fig. 1. The overview of malware detection methodology. Data processing stage includes the conversion of executable file to .asm file and the extraction of
opcode sequence. Modeling stage is consists of the word embedding and the generation of malware detection model.
B. Opcode Representations in Vector Space impact of different word window sizes and different word
Common word representations in NLP are one-hot represen- embedding techniques (Skip-gram and CBOW) on malware
tation, bag-of-word, or n-grams. However, the largest defect of detection accuracy in detail. Next, we use a two-stage LSTM
these local representations is that any two words are isolated so model to learn the comprehensive feature representation of
that it can’t reflect the semantic correlation between words. We entire opcode sequence.
use word embedding technique to automatically learn feature C. Feature Representation by LSTM
vector representation of opcode. As shown in Figure 2(a), the
1) Long-short term memory : As a typical and improved
Skip-gram model tries to predict its context from input opcode
recurrent neural network, long-short term memory (LSTM)
according to the word window size. In constrast, CBOW model
[14] is suitable for processing and predicting time series
can predict current word by the given context as shown in
problems. LSTM model introduces a new structure called a
Figure 2(b). If the word window size is set to i, then i is the
memory cell, which is composed of three main elements: an
maximum distance between the current opcode and predicted
input gate, a forget gate and an output gate, to control the
opcode.
transmission of information, as seen Figure 3. Because of
this special structure, LSTM can alleviate the vanishing and
INPUT PROJECTION OUTPUT INPUT PROJECTION OUTPUT exploding gradient problem.
Op(t-i) Op(t-i)
Op(t) Op(t)
Op(t+1) Op(t+1)
memory cell memory cell
input output
... ...
Op(t+i) Op(t+i)
Fig. 2. Two word embedding techniques: Skip-gram and CBOW. Fig. 3. Illustration of a standard memory cell.
As the model can not directly process opcode in the form Let xt denote the input to memory cell at time t; let Wi ,
of string, opcode is first converted to one-hot representation. Wf , Wc , Wo , Ui , Uf , Uc , Uo and Vo be weight matrixes; let
Hence, We make frequency statistics on all opcode sequence bi , bf , bc and bo be bias vectors. Formally, the equations below
files and filter out low frequency opcodes to build an opcode describe that how a memory cell is updated at time t.
vocabulary. In the end, the opcode vocabulary we created • First, we calculate the value of the forget gate ft , the
contains 391 different and valuable opcodes. Accordingly, the value of the input gate it , and update the previous state
one-hot representation of opcode should be a 391-dimensional of the memory cell to C̃t .
vector which contains only one non-zero element like [0,
ft = σ(Wf xt + Uf ht−1 + bf ) (1)
0, 0, 1, 0, ..., 0], and each opcode gets a unique one-hot
representation. it = σ(Wi xt + Ui ht−1 + bi ) (2)
Then we use the Gensim Python library [19] for word C̃t = tanh(Wc xt + Uc ht−1 + bc ) (3)
embedding to obtain the feature vector representation of
• Second, given the value of the input gate, the value of
opcode. After multiple experimental evaluations, we set the
the forget gate and the value of updated state C̃t , we can
dimension of the feature vector to 100. In this paper, we use the
calculate new state of memory cell, Ct , at time t.
CBOW model to implement our proposed malware detection
method. In the experimental evaluation, we will discuss the Ct = it ∗ C̃t + ft ∗ Ct−1 (4)
• With the new state of memory cell, we can calculate the
value of the output gate and the output of memory cell.
Positional Context
Word Relationship
Sequence Relationship
Article
ot = σ(Wo xt + Uo ht−1 + Vo Ct + bo ) (5)
ht = ot ∗ tanh(Ct ) (6)
NLP
In the above formula, σ is the logistic sigmoid function,
Analogy
so the value of gating vector it , ft , ot are in [0,1]. tanh
is the hyperbolic tangent function and * is the pointwise
multiplication operation. Assembly
Positional Mutual
2) Two-stage LSTM: We conduct detailed statistics on the Instruction Relationship
Function Call
Instruction
File
frequency of each opcode that appear in the dataset. The
statistical results are shown in Figure 4 and Figure 5. Figure 4
presents the average of opcodes for each type of samples and Malware Detection
top 10 most used opcodes in the dataset. Figure 5 also presents
the most frequent 10 opcode for each type of samples. Fig. 6. An equivalent analogy between NLP and malware detection.
2000000
ticle. Similarly, the positional relationship between instructions
Average of Opcodes
30000
1000000
constitutes an assembly function, and the mutual call between
0
20000 functions forms an assembly instruction file. Although the
pu v
sh
ll
p
im l
ul
r
p
p
d
mu
xo
ca
mo
no
po
cm
ad
Opcode
instructions consist of opcodes and operands, in this paper, we
10000
only use opcodes that represent specific operational behaviors
0
Worm Adware Backdoor Trojan Download Benign
to replace instructions. In the end, the experimental results also
Family show that this analogy intuition is indeed feasible.
Therefore, we propose a two-stage LSTM model for mal-
Fig. 4. The average of opcodes for each type of samples and Top 10 opcodes
in the dataset.
ware detection, which can handle opcode sequence of different
length. Figure 7 shows the structure of the two-stage LSTM.
We use two LSTM layers and one mean-pooling layer to
Worm Adware
obtain the feature representation of each opcode sequence
2000000 file. In this paper, every opcode is represented as a 100-
1000000
dimensional vector. The input of the first LSTM layer is all
0 0
word embedding in the opcode sequence, and its output is
puov
cah
po ll
cmp
lepa
jz
tesd
jm t
p
nov
mup
im l
xo l
stdr
clhd
po ll
p
u
ca
mo
s
ad
s
pu
m
pu p
posh
mop
adv
sud
xob
in r
cm c
dep
c
jm
pu
Download Benign function vector representation. And all the function vectors are
20000 200000 the input of the second LSTM layer.
0 0 Differently, we added a mean-pooling layer after the second
LSTM layer, which can enhance the invariance of feature
mosh
cav
lell
poa
adp
cmd
p
jz
jnzt
puov
cah
ad ll
cmd
sup
b
jz
pop
lepa
tes
jm
pu
Feature vector
File Representation representation
Fig. 7. The model is consists of two LSTM hidden layer, a mean-pooling layer and a softmax layer.
as follows, where N is the number of executable file classes choose 70% samples as training set and choose 30% samples
and Si is the probability of belonging to the i-th class. as testing set.
exp(vi )
Si = PN (7) TABLE I
j=1 exp(vj ) T HE DETAILS ON DATASET
LSTM model and other related malware detection approachs. 0.75 0.75
Related approachs include the convolutional neural network 0.50 0.50
(CNN), RNN and the multilayer perceptron (MLP). It can 0.25 Malware (area = 0.99) 0.25 Malware (area = 0.98)
be seen that the performance of two-stage LSTM model is Benign (area = 0.99) Benign (area = 0.98)
0.00 0.00
significantly better than RNN for both binary classification 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
False Positive Rate False Positive Rate
and multi-classification. It can also be seen that the two-stage
RNN MLP
LSTM model is better than MLP for both binary classification 1.00 1.00
True Positive Rate
and multi-classification, and is slightly better than CNN for 0.75 0.75
binary classification and multi-classification. Although the per-
0.50 0.50
formance of CNN and two-stage LSTM model is comparable,
the time cost of training CNN model is much greater than the 0.25 Malware (area = 0.89) 0.25 Malware (area = 0.97)
Benign (area = 0.89) Benign (area = 0.97)
time cost of training two-stage LSTM model because CNN 0.00 0.00
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
model usually has a very large number of parameters. False Positive Rate False Positive Rate
Based on the experimental results, we can conclude that
the method we proposed performs excellently on malware Fig. 10. Binary-classification: the performance comparison of the two-stage
detection and malware classification. LSTM model and other related malware detection approachs.
TABLE III
D ETECTION ACCURACY OF DIFFERENT WORD WINDOW SIZES AND DIFFERENT WORD EMBEDDING TECHNIQUES
two-stage LSTM CNN [6] P. Li, L. Liu, D. Gao, and M. K. Reiter, “On challenges in evaluating
1.00 1.00 malware clustering,” in International Workshop on Recent Advances in
True Positive Rate