API Call Based Malware Detection Approach Using Recurrent Neural Network-LSTM
API Call Based Malware Detection Approach Using Recurrent Neural Network-LSTM
1 Introduction
The word ‘Malware’ is formed by the vicious combination of two words- ‘mali-
cious’ and ‘software’. It refers to programs that are deliberately designed to pen-
etrate systems of high value; gain access through unauthorized means gravely
undetected with the sole purpose to steal data, rupture business and bring it to a
standstill, or sometimes systems even being held hostages for ransom. The form
and features of malware evolve continuously and it demands that the technology
c Springer Nature Switzerland AG 2020
A. Abraham et al. (Eds.): ISDA 2018, AISC 940, pp. 87–99, 2020.
https://doi.org/10.1007/978-3-030-16657-1_9
88 J. Mathew and M. A. Ajay Kumara
used to detect and stop it should also up the ante and get ahead of the curve.
The current generation malwares are mostly clickless, living off one’s own sys-
tems and has a rise in worming components1 . In [1], a case study was conducted
on WannaCry ransomware (which in 2017 had infected 400,000 machines in 150
countries in no time) to automatically identify their distinguishing features from
system logs and API calls played a big role in the analysis.
Static analysis extracts information without executing the sample but ana-
lyzing all possible execution paths from the binary code. It is prone to code
obfuscations. Behavioral-based dynamic analysis [13] is done by executing the
binary sample to run in a virtual machine environment. API calls provide the
interface between user programs and Operating Systems (OS) by requesting ser-
vices from the kernel of the OS. API provides a set of well-defined commands
that programs will have to invoke and make use of. A truly effective means
of defending against rapidly evolving attacks is to deploy solutions that can
learn and recognize common behaviors and elements that continue to be reused.
A well-informed choice to track behavior is by analyzing system calls. In this
paper, the proposed approach considers system calls as words and system call
sequences as sentences. This is to leverage the idea that just as how meaningful
arrangement of words make sensible sentences, API calls and their order would
give us sensible information to study and predict malware and benign behaviors.
From a general scenario where conventional statistical Machine Learning
(ML) techniques [2,5,6] are used for most malware detection and analysis, the
significant contribution of this paper is that it uses API call sequences where
the particular order of the data-points matter, to train an advanced RNN
architecture- LSTM model to detect malware and benign programs. Finally by
bringing together the good works of Natural Language Processing (NLP) algo-
rithms and deep learning models, evaluating on a recent and conclusive dataset
and achieving fair and acceptable results, this paper seeks to prove that RNN-
LSTM realization of a malware detection model based on API call sequences is
worthy of further study as it would do great justice and present a larger scope
in terms of resilience and interpretation.
2 Related Work
Malware classification and detection approaches have seen many research ideas
and inquisitive models over the years. Though many early works like [2–6], etc.
attempted behavioral analysis, they were unable to self-learn patterns because
all of them used conventional machine learning techniques for model evaluation.
In an attempt to recognize malware functions based on the system calls they
relied on, Ki et al. in [10] used sequence alignment algorithms for common call
sequence extraction. However, the computing cost of sequence alignment algo-
rithms was too high. Adding the developments of NLP into API call sequence
analysis, Tran et al. in [11] introduced doc2vec, N-gram and TF-IDF methods
along with conventional ML algorithms. Due to the lack of samples, the model
1
The 3 Biggest Malware Trends to Watch in 2018. https://www.securityweek.com.
API Call Based Malware Detection Approach Using RNN—LSTM 89
was evaluated only on classifying types of malware classes. With over 96% accu-
racy NLP algorithms proved its real worth. A large share of works on malware
detection happened with respect to androids like Sugunan et al.’s [16]; but used
conventional ML techniques.
The current trend in malware analysis demands learning models which can
learn and seek to find patterns, sequences and other informative features. On
the other hand, RNN provides us that learning edge of opportunity. Pascanu
et al. in [12] pioneered the use of RNN to create a hybrid model of Echo State
Networks as RNN model and an ML classifier. The model was meant to predict
the next API call. Though the accuracy of the model is good, the length of the
API call trace was custom-constrained. While Kolosnjaji et al. in [7] implemented
a combination of Convolutional Neural Network (CNN) and feed-forward layers
for malware classification and detection through static analysis of portable exe-
cutable files and series of opcodes and achieved recall and F1-scores of 92%,
but the input was not exactly sequential as convolutional filters were used.
Thus the model was still not sequence-learning and vulnerable to advanced code
obfuscation and detection evasion techniques. Nearly closing the gap, Tobiyama
et al. in [8] focused to detect malware based on process behavior (logged opera-
tions) in infected terminals. A combination of RNN and CNN was used for fea-
ture extraction and malware detection. But here RNN was used only to extract
the features and the model that was trained remained to be CNN.
However, Wang et al. in [14] designed a multi-tasking model using RNN that
could identify malwares based on API call sequence analysis and also provide
more interpret-ability of the results. While the work focused highly on malware
class classification, effective detection of both malware and benign programs from
a mix remains a potential opportunity. Based on LSTM networks, a new classifier
was built to detect malware based on android call sequences by Xiao et al. in [15].
The classifier had two LSTMs, which were trained by benign and malware appli-
cations respectively. With a new sequence as input, similarity scores would be
calculated from both the models, consequently leading to its classification. The
model achieved good accuracy, but used two engines and was specific to Android
malware detection, not Windows, etc.
Considering the insights obtained from these related works, this paper aims
to study the effectiveness of RNN featured with N-gram and TF-IDF techniques
to classify and detect unknown malware based on its API call sequence.
3 Background
Every single API call is an exact action performed by malware or benign activ-
ity on run state of the system, e.g. creation, read, write and deletion of files or
90 J. Mathew and M. A. Ajay Kumara
registry keys. We choose API call sequences as features since their order of calls
show how malware or benign behaves with an operating system. Malware and
benign will have their own specific API calls’ patterns or unique order of the calls.
ht = f (ht−1 , xt ) (1)
where f is the activation function and xt is the input at time step t. At each t,
RNN performs the same calculations with same shared parameters on different
inputs in the sequence. Figure 1 shows the unfolded RNN with U, V and W
being the input, output and state weights respectively.
One of the major advantage with RNN is that the lengths of the input
and output sequence can be different. As explained in [14], through the
whole process of training an RNN to learn sequences- say to intake sequence
X = (x1 , x2 , x3 , ...., xT ) and output sequence Y = (y1 , y2 , y3 , ...., yT ) , the RNN
will effectively learn the conditional distribution p(Y|X), i.e. mathematically,
T
p(y1 , y2 , . . . , y
T | x1 , x2 , . . . , xT ) = p(yt | ht−1 , y1 , y2 , . . . , yt−1 ) (2)
t=1
Once the output is generated, then it is compared to the true value and
error is generated. The error is then back-propagated in the network to update
the weights and the network is thus trained. If we represent a sentence as W1N ,
containing N words of wi then in the RNN, the probability of generating a
sentence W1N can be represented as
N
p(W1N ) =
(m−1)
p(wm | W1 ) (3)
m=2
In standard RNNs, derivatives of tanh and sigmoid functions have 0 at both ends.
Thus they will have zero gradients and drive other gradients in previous layers to
0 in back-propagation. Therefore, the RNN will eventually end up not learning
long-range dependencies. This problem of vanishing gradients (also similarly
exploding gradients) in RNNs is solved with an advanced RNN architecture
called LSTMs. The repeating module in LSTM has four interacting neural layers
that interact in a very special way. Regulated by structures called gates, LSTM
has the ability to remove or add information (only as best required) to the cell
state. The gates are composed of a sigmoid neural net layer and a pointwise
multiplication operation. As the sigmoid layer outputs values between 0 and 1,
the gates decide how much of each component should be let through.
4 Proposed Method
The main focus of this work is to realize and study effectively the use of RNN-
LSTM to detect malware by interpreting the program behavior from their API
system call sequences using the language model. The features extracted from the
API call sequences are used for malware or benign detection. Figure 2 shows the
model of our proposed system designed. The model is first trained to learn the
API call sequence patterns. The model is then tested to evaluate its performance
with the unknown (to the model) test dataset. Explained in the forthcoming
subsections are the steps followed in our method.
92 J. Mathew and M. A. Ajay Kumara
the accuracy. Feature selection reduces dimensions of the original feature vec-
tor, allowing ML techniques to effectively function. TF-IDF technique has been
explored in this paper. Accuracy greater than 90% was achieved in this paper
by using only a small portion (i.e. top TF-IDF ranked sequences, as discussed
in Sect. 6) of the total 10-gram sequences generated.
IDF is given as in (6) where sf (s, c) is the number of sequences where the system
call c appears. This is a logarithmic scaled value of the number of sequences in
the corpus divided by the number of times system call c appears throughout the
corpus.
1 + |S|
idf (c, S) = log (6)
1 + sf (s, c)
tf idf (c, s, S) = tf (c, s) ∗ idf (c, S) (7)
In (7), we see that the TF-IDF value increases proportionally by the frequency
of c in a sequence, decreases proportionally by the log of the frequency of c in
the corpus.
1-Hot Vectorization. After ranking the sequences by TF-IDF, only the top-
ranked sequences will proceed further. This would reduce the loading on the
RNN-LSTM engine, remove redundancies and ensure that the model is faithfully
unbiased. 1-hot vector encoding method is used to convert these call grams into
a numerical representation that can be fed into the neural network. List (or
vocabulary) of various unique system calls in the dataset would form a dictionary.
94 J. Mathew and M. A. Ajay Kumara
The length of vector that represents each system call would be the size of the
vocabulary. Each vector is a binary series of 1s and 0s where only that position
in the vector corresponding to the position of the call in the dictionary would
have a 1 and remaining of the vector would be 0.
5 Implementation
5.1 Dataset Preparation
In this work, we used API system call traces sourced from 3000 malware and
various benign traces obtained from Hacking and Countermeasure Research Lab2
and [17] respectively. While the malware traces were extracted after executing
them in a dynamic environment and made available by Kim [10], the benign
traces were obtained using native Windows API tracer called NtTrace3 .
Fig. 3. RNN-LSTM
Metric Formula
Accuracy (TP + TN)/(TP + FP + TN + FN)
True Positive Rate (Recall) TP/(TP + FN)
False Positive Rate FP/(FP + TN)
Precision TP/(TP + FP)
F1 Score 2 * (Recall * Precision)/(Recall + Precision)
96 J. Mathew and M. A. Ajay Kumara
As can be seen in Fig. 5 the model gained a good flair in detecting malware
with a top average recall of 97%. Similar to the recall score, the overall average
accuracy of the model increased with increase in the top TF-IDF sequences to
reach as high as 92% for top 10K sequences (Fig. 4). With accuracy and recall
improving with the increase in the top TF-IDF training sequences, it is noted
that the model is not only able to detect both malware and benign with good
prediction but is also gaining an upper hand in detecting malware given that it
is present in the test sample. This can be optimized and targeted to reach above
97% with parameter optimization and more training.
95
Average Accuracy %
90
85
80
75
0 2000 4000 6000 8000 10000
Top N TF-IDF sequences used for training the model
Test_Accuracy:
While it would be always interesting to look at the error rate of the model in
any evaluation, it would be less preferred in this paper as the model needs to be
evaluated on how well it detects both malware and benign and not only one of
it. Figure 6 presents a view of F1 Scores and most importantly the complement
of it (1-F1), which represents the error loss of the LSTM model.
F1 Score which is the weighted average of Recall and Precision provides
us with better visibility. While the average F1 Scores also follows suit just like
accuracy and Recall in behaviour and achieves a 90% high for Top 10K sequences,
it should be noted that the error complement(1-F1) reduces as the TF-IDF top
sequences are selected in the range from 2,500 to 10,000. That is to say that
the model is learning better and in a quick context to learn the API system call
behavioural pattern of both benign and malware.
It should be noted that the iterations for all the analyses have been 50,000.
Any model would expect a higher number of training sequences to have a higher
number of training iterations. Thus, it has not been fair enough towards the
API Call Based Malware Detection Approach Using RNN—LSTM 97
90
85
80
75
0 2000 4000 6000 8000 10000
Top N TF-IDF sequences used for training the model
True Positive Rate (Recall)
90.19 89.95
82.20
76.93
Average %
F1 Score
1-F1 Score
23.07
17.80
9.81 10.05
designed RNN-LSTM model training owing to the varying sequences per iter-
ation ratio. Yet, the model counteracts and provides high accuracy and lower
error rate for a higher number of sequences. Thus it can be stated that increas-
ing the iterations further will not only increase true detection, but also decrease
false alarms and misses.
7 Future Work
The current model is evaluated using the features recommended only by one TF-
IDF technique. The results obtained inspires confidence to optimize the param-
eters to increase the proposed model efficiency to a greater accuracy rate. The
98 J. Mathew and M. A. Ajay Kumara
8 Conclusion
In this work, we presented a behavioural-based classification of unknown mal-
ware by leveraging RNN-LSTM technique based on API call sequences. We
used N-gram for feature extraction and TF-IDF for feature selection of the
sequences. The novel contribution of this work incorporating RNN-LSTM for
malware detection was performed for effective analysis of unknown malware.
The system evaluated leverages the popular and powerful Google’s TensorFlow
framework. The training dataset was ensured to contain zero redundancy and
equal share of top malware and benign sequences. It should be duly noted that
the high accuracy system realized in this paper is a learning model that contin-
uously seeks to recognize not only known patterns but also evolving unknown
patterns of malware. This fact about the proposed model is what gives it a
strong footing over the existing conventional statistics-based machine learning
techniques. The promising results with a highest average accuracy of 92% and
average recall of 97% achieved, it reassures faith to explore more the potential
space of malware detection based on a combination of an RNN-LSTM model
and API call sequences in the current series of researches.
References
1. Chen, Q., Bridges, R.A.: Automated behavioral analysis of malware a case study
of WannaCry Ransomware (2017). arXiv:1709.08753v1 [cs.CR], Cryptography and
Security
2. Ajay Kumara, M.A., Jaidhar, C.D.: Automated multi-level malware detection sys-
tem based on reconstructed semantic view of executables using machine learning
techniques at VMM. Future Gener. Comput. Syst. 79(Part 1), 431–446 (2018)
3. Anju, S.S., Harmya, P., Jagadeesh, N., Darsana, R.: Malware detection using
assembly code and control flow graph optimization. In: Proceedings of the 1st
Amrita ACM-W Celebration of Women in Computing in India, A2CWiC 2010,
Coimbatore (2010)
4. Kang, B., Han, K.S., Kang, B., Im, E.G.: Malware categorization using dynamic
mnemonic frequency analysis with redundancy filtering. Digit. Investig. 11, 323–
335 (2014)
5. Salehi, Z., Sami, A., Ghiasi, M.: Using feature generation from API calls for mal-
ware detection. Comput. Fraud Secur. 2014, 9–18 (2014)
6. Galal, H.S., Mahdy, Y.B., Atiea, M.A.: Behavior-based features model for malware
detection. J. Comput. Virol. Hacking Tech. 12(2), 59–67 (2016)
API Call Based Malware Detection Approach Using RNN—LSTM 99
7. Kolosnjaji, B., Zarras, A., Eraisha, G., Webster, G., Eckert, C.: Empowering con-
volutional networks for malware classification and analysis. In: International Joint
Conference on Neural Networks (IJCNN) (2017)
8. Tobiyama, S., Yamaguchi, Y., Shimada, H., Ikuse, T., Yagi, T.: Malware detec-
tion with deep neural network using process behavior. In: 40th Annual Computer
Software and Applications Conference (COMPSAC) (2016)
9. Athira, V., Geetha, P., Vinayakumar, R., Soman, K.P.: DeepAirNet: applying
recurrent networks for air quality prediction. In: International Conference on Com-
putational Intelligence and Data Science (ICCIDS) (2018)
10. Ki, Y., Kim, E., Kim, H.K.: A novel approach to detect malware based on API
call sequence analysis. Int. J. Distrib. Sens. Netw. 11, 659101 (2015)
11. Tran, T.K., Sato, H.: NLP-based approaches for malware classification from API
sequences. In: 21st Asia Pacific Symposium on Intelligent and Evolutionary Sys-
tems (IES) (2017)
12. Pascanu, R., Stokes, J.W., Sanossian, H., Marinescu, M., Thomas, A.: Malware
classification with recurrent networks. In: IEEE International Conference on Acous-
tics, Speech and Signal Processing (ICASSP) (2015)
13. Rhodes, M., Burnap, P., Jones, K.: Early-stage malware prediction using recurrent
neural networks. Comput. Secur. 77, 578–594 (2018). arXiv:1708.03513 [cs.CR]
14. Wang, X., Yiu, S.M.: A multi-task learning model for malware classification with
useful file access pattern from API call sequence (2016). arXiv:1610.05945 [cs.SD],
Cryptography and Security
15. Xiao, X., Zhang, S., Mercaldo, F., Hu, G., Sangaiah, A.K.: Android malware detec-
tion based on system call sequences and LSTM. Multimed. Tools Appl. (2017)
16. Sugunan, K., Gireesh Kumar, T., Dhanya, K.A.: Static and dynamic analysis for
android malware detection. Advances in Intelligent Systems and Computing, vol.
645, pp. 147–155. Springer, Cham (2018)
17. Kim, C.W.: GitHub repository (2018). https://github.com/codeandproduce/
NtMalDetect