0% found this document useful (0 votes)

29 views13 pages

API Call Based Malware Detection Approach Using Recurrent Neural Network-LSTM

Uploaded by

Eric Howard

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

29 views13 pages

API Call Based Malware Detection Approach Using Recurrent Neural Network-LSTM

Uploaded by

Eric Howard

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 13

API Call Based Malware Detection

Approach Using Recurrent Neural

Network—LSTM

J. Mathew(B) and M. A. Ajay Kumara

Department of Computer Science and Engineering,

Amrita School of Engineering,
Bengaluru Amrita Vishwa Vidyapeetham, Bengaluru, India
[email protected], ma [email protected]

Abstract. Malware variants keep increasing every year as most mal-

ware developers tweak existing easily available malware codes to create
their custom versions. Though their behaviours are coherent, because of
change in signature, static signature-based malware detection schemes
would fail to identify such malware. One promising approach for detec-
tion of malware is dynamic analysis by observing the malware behaviour.
Malware executions largely depend on Application Programming Inter-
face (API) calls they issue to the operating systems to achieve their
malicious tasks. Therefore, behaviour-based detection techniques that
eye on such API system calls can deliver promising results as they are
inherently semantic-aware. In this paper, we have used Recurrent Neural
Network’s (RNN) capability to capture long-term features of time-series
and sequential data to study the scope and effectiveness of RNNs to effi-
ciently detect and analyze malware and benign based on their behaviour,
i.e. system call sequences specifically. We trained the RNN-Long Short
Term Memory (LSTM) model to learn from the most informative of
sequences from the API-dataset based on their relative ranking based on
Term Frequency-Inverse Document Frequency (TF-IDF) recommended
features and were able to achieve accuracy as high as 92% in detecting
malware and benign from an unknown test API-call sequence.

Keywords: API system call · Malware detection · TF-IDF ·

RNN-LSTM

1 Introduction
The word ‘Malware’ is formed by the vicious combination of two words- ‘mali-
cious’ and ‘software’. It refers to programs that are deliberately designed to pen-
etrate systems of high value; gain access through unauthorized means gravely
undetected with the sole purpose to steal data, rupture business and bring it to a
standstill, or sometimes systems even being held hostages for ransom. The form
and features of malware evolve continuously and it demands that the technology
c Springer Nature Switzerland AG 2020
A. Abraham et al. (Eds.): ISDA 2018, AISC 940, pp. 87–99, 2020.
https://doi.org/10.1007/978-3-030-16657-1_9
88 J. Mathew and M. A. Ajay Kumara

used to detect and stop it should also up the ante and get ahead of the curve.
The current generation malwares are mostly clickless, living off one’s own sys-
tems and has a rise in worming components1 . In [1], a case study was conducted
on WannaCry ransomware (which in 2017 had infected 400,000 machines in 150
countries in no time) to automatically identify their distinguishing features from
system logs and API calls played a big role in the analysis.
Static analysis extracts information without executing the sample but ana-
lyzing all possible execution paths from the binary code. It is prone to code
obfuscations. Behavioral-based dynamic analysis [13] is done by executing the
binary sample to run in a virtual machine environment. API calls provide the
interface between user programs and Operating Systems (OS) by requesting ser-
vices from the kernel of the OS. API provides a set of well-defined commands
that programs will have to invoke and make use of. A truly effective means
of defending against rapidly evolving attacks is to deploy solutions that can
learn and recognize common behaviors and elements that continue to be reused.
A well-informed choice to track behavior is by analyzing system calls. In this
paper, the proposed approach considers system calls as words and system call
sequences as sentences. This is to leverage the idea that just as how meaningful
arrangement of words make sensible sentences, API calls and their order would
give us sensible information to study and predict malware and benign behaviors.
From a general scenario where conventional statistical Machine Learning
(ML) techniques [2,5,6] are used for most malware detection and analysis, the
significant contribution of this paper is that it uses API call sequences where
the particular order of the data-points matter, to train an advanced RNN
architecture- LSTM model to detect malware and benign programs. Finally by
bringing together the good works of Natural Language Processing (NLP) algo-
rithms and deep learning models, evaluating on a recent and conclusive dataset
and achieving fair and acceptable results, this paper seeks to prove that RNN-
LSTM realization of a malware detection model based on API call sequences is
worthy of further study as it would do great justice and present a larger scope
in terms of resilience and interpretation.

2 Related Work
Malware classiﬁcation and detection approaches have seen many research ideas
and inquisitive models over the years. Though many early works like [2–6], etc.
attempted behavioral analysis, they were unable to self-learn patterns because
all of them used conventional machine learning techniques for model evaluation.
In an attempt to recognize malware functions based on the system calls they
relied on, Ki et al. in [10] used sequence alignment algorithms for common call
sequence extraction. However, the computing cost of sequence alignment algo-
rithms was too high. Adding the developments of NLP into API call sequence
analysis, Tran et al. in [11] introduced doc2vec, N-gram and TF-IDF methods
along with conventional ML algorithms. Due to the lack of samples, the model
1
The 3 Biggest Malware Trends to Watch in 2018. https://www.securityweek.com.
API Call Based Malware Detection Approach Using RNN—LSTM 89

was evaluated only on classifying types of malware classes. With over 96% accu-
racy NLP algorithms proved its real worth. A large share of works on malware
detection happened with respect to androids like Sugunan et al.’s [16]; but used
conventional ML techniques.
The current trend in malware analysis demands learning models which can
learn and seek to find patterns, sequences and other informative features. On
the other hand, RNN provides us that learning edge of opportunity. Pascanu
et al. in [12] pioneered the use of RNN to create a hybrid model of Echo State
Networks as RNN model and an ML classifier. The model was meant to predict
the next API call. Though the accuracy of the model is good, the length of the
API call trace was custom-constrained. While Kolosnjaji et al. in [7] implemented
a combination of Convolutional Neural Network (CNN) and feed-forward layers
for malware classification and detection through static analysis of portable exe-
cutable files and series of opcodes and achieved recall and F1-scores of 92%,
but the input was not exactly sequential as convolutional filters were used.
Thus the model was still not sequence-learning and vulnerable to advanced code
obfuscation and detection evasion techniques. Nearly closing the gap, Tobiyama
et al. in [8] focused to detect malware based on process behavior (logged opera-
tions) in infected terminals. A combination of RNN and CNN was used for fea-
ture extraction and malware detection. But here RNN was used only to extract
the features and the model that was trained remained to be CNN.
However, Wang et al. in [14] designed a multi-tasking model using RNN that
could identify malwares based on API call sequence analysis and also provide
more interpret-ability of the results. While the work focused highly on malware
class classification, effective detection of both malware and benign programs from
a mix remains a potential opportunity. Based on LSTM networks, a new classifier
was built to detect malware based on android call sequences by Xiao et al. in [15].
The classifier had two LSTMs, which were trained by benign and malware appli-
cations respectively. With a new sequence as input, similarity scores would be
calculated from both the models, consequently leading to its classification. The
model achieved good accuracy, but used two engines and was specific to Android
malware detection, not Windows, etc.
Considering the insights obtained from these related works, this paper aims
to study the effectiveness of RNN featured with N-gram and TF-IDF techniques
to classify and detect unknown malware based on its API call sequence.

3 Background

In this section, we attempt to provide background notions of topics that form

the basic building blocks of the proposed method.

3.1 API Call Sequence

Every single API call is an exact action performed by malware or benign activ-
ity on run state of the system, e.g. creation, read, write and deletion of ﬁles or
90 J. Mathew and M. A. Ajay Kumara

registry keys. We choose API call sequences as features since their order of calls
show how malware or benign behaves with an operating system. Malware and
benign will have their own speciﬁc API calls’ patterns or unique order of the calls.

3.2 Natural Language Processing (NLP)

NLP is an automated way to understand and analyze natural human language
and derive information from it using ML algorithms. NLP ﬁnds applications
in automatic summarization, topic segmentation, machine translation, named
entity recognition, etc. Since computers and programs have a ﬁxed vocabulary
of commands and their sequential usage always converges into an intended func-
tion, NLP can be applied to the analysis of API call sequences. The work by
Tran et al. in [11] is a pioneer work that incorporated NLP algorithm to analyse
malwares.

3.3 Recurrent Neural Network

An RNN would learn sequences as they are fed in because they can remember
events that are presented to them in the past. They are called ‘recurrent’ because
the same task is performed on every element in the sequence and the output
depends on the present state and previous states’ computations. Hidden state
ht at each time step t is given by

ht = f (ht−1 , xt ) (1)

where f is the activation function and xt is the input at time step t. At each t,
RNN performs the same calculations with same shared parameters on diﬀerent
inputs in the sequence. Figure 1 shows the unfolded RNN with U, V and W
being the input, output and state weights respectively.

Fig. 1. Unfolded RNN

API Call Based Malware Detection Approach Using RNN—LSTM 91

One of the major advantage with RNN is that the lengths of the input
and output sequence can be diﬀerent. As explained in [14], through the
whole process of training an RNN to learn sequences- say to intake sequence
X = (x1 , x2 , x3 , ...., xT ) and output sequence Y = (y1 , y2 , y3 , ...., yT ) , the RNN
will eﬀectively learn the conditional distribution p(Y|X), i.e. mathematically,

T

p(y1 , y2 , . . . , y
T | x1 , x2 , . . . , xT ) = p(yt | ht−1 , y1 , y2 , . . . , yt−1 ) (2)
t=1

Once the output is generated, then it is compared to the true value and
error is generated. The error is then back-propagated in the network to update
the weights and the network is thus trained. If we represent a sentence as W1N ,
containing N words of wi then in the RNN, the probability of generating a
sentence W1N can be represented as
N

p(W1N ) =
(m−1)
p(wm | W1 ) (3)
m=2

3.4 Long Short Term Memory

In standard RNNs, derivatives of tanh and sigmoid functions have 0 at both ends.
Thus they will have zero gradients and drive other gradients in previous layers to
0 in back-propagation. Therefore, the RNN will eventually end up not learning
long-range dependencies. This problem of vanishing gradients (also similarly
exploding gradients) in RNNs is solved with an advanced RNN architecture
called LSTMs. The repeating module in LSTM has four interacting neural layers
that interact in a very special way. Regulated by structures called gates, LSTM
has the ability to remove or add information (only as best required) to the cell
state. The gates are composed of a sigmoid neural net layer and a pointwise
multiplication operation. As the sigmoid layer outputs values between 0 and 1,
the gates decide how much of each component should be let through.

4 Proposed Method

The main focus of this work is to realize and study eﬀectively the use of RNN-
LSTM to detect malware by interpreting the program behavior from their API
system call sequences using the language model. The features extracted from the
API call sequences are used for malware or benign detection. Figure 2 shows the
model of our proposed system designed. The model is ﬁrst trained to learn the
API call sequence patterns. The model is then tested to evaluate its performance
with the unknown (to the model) test dataset. Explained in the forthcoming
subsections are the steps followed in our method.
92 J. Mathew and M. A. Ajay Kumara

Fig. 2. System design

4.1 Data Preprocessing

To validate the proposed system, API call sequences are considered for behavioral
analysis. Prior to training and classiﬁcation, the data must be normalized and
preprocessed to improve model convergence speed in training. As only system
calls are considered as features, redundancies if present are removed. Extra ﬁelds
are added if required. The resulting dataset is ensured to contain only unique
system call sequences that are invoked by the malware and benign programs.

4.2 Feature Extraction

A challenge with API call traces is their variable lengths. N-grams model is used
for feature extraction. In it, the sequences of N continuous system calls from the
original call trace forms the features. These are obtained by performing n-length
sliding window operation on all the API call traces. In this work, N is chosen
as 10. It was a conservative choice for accuracy improvement based on previous
work [17] as programs would typically require reasonably long sequences of API
calls to fully serve its purpose.

4.3 Feature Selection and Vectorization

Feature selection techniques will judiciously trim down the training set to con-
tain the most meaningful records. The engine would be better oﬀ handling a
smaller weight of sensible data resulting in better performance while preserving
API Call Based Malware Detection Approach Using RNN—LSTM 93

the accuracy. Feature selection reduces dimensions of the original feature vec-
tor, allowing ML techniques to eﬀectively function. TF-IDF technique has been
explored in this paper. Accuracy greater than 90% was achieved in this paper
by using only a small portion (i.e. top TF-IDF ranked sequences, as discussed
in Sect. 6) of the total 10-gram sequences generated.

Term Frequency-Inverse Document Frequency. TF is the frequency of a

term in a document, while IDF quantiﬁes the information provided by a term,
relative to that term’s presence in the other documents. Here, terms refer to API
calls and documents refers to call sequences. For a particular type of malware, a
particular system call may appear in the sequence corpus a signiﬁcant number
of times based on its relevance. While it can be seen that most programs tend
to use some system calls as common in all of their sequences, there could be
some characteristic system call pertaining to malware and benign, that occurs
rarely. Such valuable system calls should have a higher rank compared to others.
Information relative term weighting scheme TF-IDF provides a suitable ranking
mechanism for the same purpose. In order to perceive TF-IDF mathematically,
here are some of the terms and variables to be referred:

S = {s1 , s2 , . . . , sn } V = {c1 , c2 , . . . , cn } (4)

S is the sequence corpus and s is a particular sequence. V is the vocabulary

of system calls and c is a particular system call that appears in the corpora.
Frequency of the system call c in a sequence s is given by (5) where fs (c) is the
frequency of the system call c ∈ V in s ∈ S.

tf (c, s) = fs (c) (5)

IDF is given as in (6) where sf (s, c) is the number of sequences where the system
call c appears. This is a logarithmic scaled value of the number of sequences in
the corpus divided by the number of times system call c appears throughout the
corpus.
1 + |S|
idf (c, S) = log (6)
1 + sf (s, c)
tf idf (c, s, S) = tf (c, s) ∗ idf (c, S) (7)
In (7), we see that the TF-IDF value increases proportionally by the frequency
of c in a sequence, decreases proportionally by the log of the frequency of c in
the corpus.

1-Hot Vectorization. After ranking the sequences by TF-IDF, only the top-
ranked sequences will proceed further. This would reduce the loading on the
RNN-LSTM engine, remove redundancies and ensure that the model is faithfully
unbiased. 1-hot vector encoding method is used to convert these call grams into
a numerical representation that can be fed into the neural network. List (or
vocabulary) of various unique system calls in the dataset would form a dictionary.
94 J. Mathew and M. A. Ajay Kumara

The length of vector that represents each system call would be the size of the
vocabulary. Each vector is a binary series of 1s and 0s where only that position
in the vector corresponding to the position of the call in the dictionary would
have a 1 and remaining of the vector would be 0.

4.4 Training the RNN-LSTM

LSTM is one of the best choices for a large dataset. Once the 10-gram sequences
ranked on the basis of their TF-IDF scores, we discard the least signiﬁcant
sequences from the training dataset. Those sequences which are above a TF-
IDF score and form the cream of informative call sequences are considered for
model training. As noted in Sect. 4.3 a dictionary of API calls and sequence
label is made. At each epoch of training, corresponding to each element in a
sequence, their numerical indexes from the dictionary are passed into the LSTM
as inputs for training the model and its parameters. RNN-LSTM has a hidden
representation that gets updated with every new input. The output from the
RNN is truly a vector of probability distributions. The value attributed to the
highest probability in the vector is predicted as the output. The error is a cross-
entropy between the predicted label and actual value. The parameters are then
adjusted each time to minimize this error and thus the learning happens.

5 Implementation
5.1 Dataset Preparation
In this work, we used API system call traces sourced from 3000 malware and
various benign traces obtained from Hacking and Countermeasure Research Lab2
and [17] respectively. While the malware traces were extracted after executing
them in a dynamic environment and made available by Kim [10], the benign
traces were obtained using native Windows API tracer called NtTrace3 .

5.2 Feature Extraction and Selection

This stage is to do with the 10-gram extraction. 10 API calls from the sequences
are bundled together and padded with the correct Malware or Benign label at the
end. Each sequence would then essentially have 11 grams in one N-gram. This
is required as we are implementing malware detection using RNN-LSTM Lan-
guage Classiﬁer model. TfidfVectorizer method from scikit-learn library is used
to N-gramize and also calculate the TF-IDF scores of the resulting sequences.
After generating N-gram features, about 250,000 N-gram sequences are generated.
From these, equal shares of malware and benign N-gram sequences were utilized
for model analysis. Once the scores and N-grams are ready, only the top-ranked
sequences are used to train the model.
2
http://ocslab.hksecurity.net/apimds-dataset.
3
https://github.com/rogerorr/NtTrace.
API Call Based Malware Detection Approach Using RNN—LSTM 95

5.3 Realizing the RNN-LSTM Learning Engine

The RNN-LSTM malware detection engine in this project was implemented
with TensorFlow framework from Google. As can be seen in Fig. 3, an LSTM
fed with sequences of system calls from the training dataset- with 10-grams as
inputs and 11th being the label. After thousands of iterations, the model will
eventually learn to predict the next gram correctly. 50,000 iterations were set
as the general norm in this analysis. RMSProp [9] at a learning rate of 0.001
was used as the optimizer. The model had 2 layers of LSTM with 512 hidden
units in each. The accuracy of the model can be bettered by optimization of the
hyper-parameters.

Fig. 3. RNN-LSTM

6 Evaluation and Results

The preprocessed dataset of N-gram feature sequences that are unknown to
the trained model are selected for testing. Same test dataset is used for all the
analyses, making the evaluation trustworthy. The model was assessed using top
TF-IDF ranked 10000 , 7500 , 5000 and 2500 sequences. The evaluation metrics
used can be seen in Table 1.

Table 1. Evaluation metrics

Metric Formula
Accuracy (TP + TN)/(TP + FP + TN + FN)
True Positive Rate (Recall) TP/(TP + FN)
False Positive Rate FP/(FP + TN)
Precision TP/(TP + FP)
F1 Score 2 * (Recall * Precision)/(Recall + Precision)
96 J. Mathew and M. A. Ajay Kumara

As can be seen in Fig. 5 the model gained a good ﬂair in detecting malware
with a top average recall of 97%. Similar to the recall score, the overall average
accuracy of the model increased with increase in the top TF-IDF sequences to
reach as high as 92% for top 10K sequences (Fig. 4). With accuracy and recall
improving with the increase in the top TF-IDF training sequences, it is noted
that the model is not only able to detect both malware and benign with good
prediction but is also gaining an upper hand in detecting malware given that it
is present in the test sample. This can be optimized and targeted to reach above
97% with parameter optimization and more training.

Accuracy in Testing Phase

100

95
Average Accuracy %

75
0 2000 4000 6000 8000 10000
Top N TF-IDF sequences used for training the model
Test_Accuracy:

Fig. 4. Average accuracy in testing phase

While it would be always interesting to look at the error rate of the model in
any evaluation, it would be less preferred in this paper as the model needs to be
evaluated on how well it detects both malware and benign and not only one of
it. Figure 6 presents a view of F1 Scores and most importantly the complement
of it (1-F1), which represents the error loss of the LSTM model.
F1 Score which is the weighted average of Recall and Precision provides
us with better visibility. While the average F1 Scores also follows suit just like
accuracy and Recall in behaviour and achieves a 90% high for Top 10K sequences,
it should be noted that the error complement(1-F1) reduces as the TF-IDF top
sequences are selected in the range from 2,500 to 10,000. That is to say that
the model is learning better and in a quick context to learn the API system call
behavioural pattern of both benign and malware.
It should be noted that the iterations for all the analyses have been 50,000.
Any model would expect a higher number of training sequences to have a higher
number of training iterations. Thus, it has not been fair enough towards the
API Call Based Malware Detection Approach Using RNN—LSTM 97

True Positive Rate (Recall)

in Testing Phase
100
Average TPR (Recall) %
95

75
0 2000 4000 6000 8000 10000
Top N TF-IDF sequences used for training the model
True Positive Rate (Recall)

Fig. 5. Average recall (TPR) in testing phase

F1 SCORE & 1-F1 SCORE

IN THE TESTING PHASE

90.19 89.95
82.20
76.93
Average %

F1 Score
1-F1 Score
23.07
17.80
9.81 10.05

10000 7500 5000 2500

Top N TF-IDF sequences used for training the model

Fig. 6. Average F1 score & 1-F1 score in testing phase

designed RNN-LSTM model training owing to the varying sequences per iter-
ation ratio. Yet, the model counteracts and provides high accuracy and lower
error rate for a higher number of sequences. Thus it can be stated that increas-
ing the iterations further will not only increase true detection, but also decrease
false alarms and misses.

7 Future Work
The current model is evaluated using the features recommended only by one TF-
IDF technique. The results obtained inspires conﬁdence to optimize the param-
eters to increase the proposed model eﬃciency to a greater accuracy rate. The
98 J. Mathew and M. A. Ajay Kumara

approach followed must be evaluated by leveraging high-end systems on large

datasets. The areas that would be explored in the extended work related to this
paper would be to study the implications and eﬀects of applying other popular
feature selection techniques such as Chi-square, Fisher’s score and combinations
of these (hybrid), with diﬀerent values of N-gram sizes, all evaluated over the
extensive list of evaluation metrics.

8 Conclusion
In this work, we presented a behavioural-based classiﬁcation of unknown mal-
ware by leveraging RNN-LSTM technique based on API call sequences. We
used N-gram for feature extraction and TF-IDF for feature selection of the
sequences. The novel contribution of this work incorporating RNN-LSTM for
malware detection was performed for eﬀective analysis of unknown malware.
The system evaluated leverages the popular and powerful Google’s TensorFlow
framework. The training dataset was ensured to contain zero redundancy and
equal share of top malware and benign sequences. It should be duly noted that
the high accuracy system realized in this paper is a learning model that contin-
uously seeks to recognize not only known patterns but also evolving unknown
patterns of malware. This fact about the proposed model is what gives it a
strong footing over the existing conventional statistics-based machine learning
techniques. The promising results with a highest average accuracy of 92% and
average recall of 97% achieved, it reassures faith to explore more the potential
space of malware detection based on a combination of an RNN-LSTM model
and API call sequences in the current series of researches.

References
1. Chen, Q., Bridges, R.A.: Automated behavioral analysis of malware a case study
of WannaCry Ransomware (2017). arXiv:1709.08753v1 [cs.CR], Cryptography and
Security
2. Ajay Kumara, M.A., Jaidhar, C.D.: Automated multi-level malware detection sys-
tem based on reconstructed semantic view of executables using machine learning
techniques at VMM. Future Gener. Comput. Syst. 79(Part 1), 431–446 (2018)
3. Anju, S.S., Harmya, P., Jagadeesh, N., Darsana, R.: Malware detection using
assembly code and control ﬂow graph optimization. In: Proceedings of the 1st
Amrita ACM-W Celebration of Women in Computing in India, A2CWiC 2010,
Coimbatore (2010)
4. Kang, B., Han, K.S., Kang, B., Im, E.G.: Malware categorization using dynamic
mnemonic frequency analysis with redundancy ﬁltering. Digit. Investig. 11, 323–
335 (2014)
5. Salehi, Z., Sami, A., Ghiasi, M.: Using feature generation from API calls for mal-
ware detection. Comput. Fraud Secur. 2014, 9–18 (2014)
6. Galal, H.S., Mahdy, Y.B., Atiea, M.A.: Behavior-based features model for malware
detection. J. Comput. Virol. Hacking Tech. 12(2), 59–67 (2016)
API Call Based Malware Detection Approach Using RNN—LSTM 99

7. Kolosnjaji, B., Zarras, A., Eraisha, G., Webster, G., Eckert, C.: Empowering con-
volutional networks for malware classification and analysis. In: International Joint
Conference on Neural Networks (IJCNN) (2017)
8. Tobiyama, S., Yamaguchi, Y., Shimada, H., Ikuse, T., Yagi, T.: Malware detec-
tion with deep neural network using process behavior. In: 40th Annual Computer
Software and Applications Conference (COMPSAC) (2016)
9. Athira, V., Geetha, P., Vinayakumar, R., Soman, K.P.: DeepAirNet: applying
recurrent networks for air quality prediction. In: International Conference on Com-
putational Intelligence and Data Science (ICCIDS) (2018)
10. Ki, Y., Kim, E., Kim, H.K.: A novel approach to detect malware based on API
call sequence analysis. Int. J. Distrib. Sens. Netw. 11, 659101 (2015)
11. Tran, T.K., Sato, H.: NLP-based approaches for malware classification from API
sequences. In: 21st Asia Pacific Symposium on Intelligent and Evolutionary Sys-
tems (IES) (2017)
12. Pascanu, R., Stokes, J.W., Sanossian, H., Marinescu, M., Thomas, A.: Malware
classification with recurrent networks. In: IEEE International Conference on Acous-
tics, Speech and Signal Processing (ICASSP) (2015)
13. Rhodes, M., Burnap, P., Jones, K.: Early-stage malware prediction using recurrent
neural networks. Comput. Secur. 77, 578–594 (2018). arXiv:1708.03513 [cs.CR]
14. Wang, X., Yiu, S.M.: A multi-task learning model for malware classification with
useful file access pattern from API call sequence (2016). arXiv:1610.05945 [cs.SD],
Cryptography and Security
15. Xiao, X., Zhang, S., Mercaldo, F., Hu, G., Sangaiah, A.K.: Android malware detec-
tion based on system call sequences and LSTM. Multimed. Tools Appl. (2017)
16. Sugunan, K., Gireesh Kumar, T., Dhanya, K.A.: Static and dynamic analysis for
android malware detection. Advances in Intelligent Systems and Computing, vol.
645, pp. 147–155. Springer, Cham (2018)
17. Kim, C.W.: GitHub repository (2018). https://github.com/codeandproduce/
NtMalDetect

MPT Boards (WMPT & Umpt) Configuration Steps
100% (2)
MPT Boards (WMPT & Umpt) Configuration Steps
14 pages
Hybrid Android Malware Detection and Classification Using Deep Neural Networks
No ratings yet
Hybrid Android Malware Detection and Classification Using Deep Neural Networks
26 pages
Malware Detection With LSTM Using Opcode Language
100% (1)
Malware Detection With LSTM Using Opcode Language
7 pages
Mansions of Madness Second Editon Rules Reference
100% (1)
Mansions of Madness Second Editon Rules Reference
24 pages
Blancco Drive Eraser
100% (1)
Blancco Drive Eraser
2 pages
Malware Detection Using Machine Learning and Deep Learning
No ratings yet
Malware Detection Using Machine Learning and Deep Learning
10 pages
Ransomware Attack Detection Based On Pertinent System Calls Using Machine Learning Techniques
No ratings yet
Ransomware Attack Detection Based On Pertinent System Calls Using Machine Learning Techniques
23 pages
Ransomware Attack Detection Based On Pertinent System Calls Using Machine Learning Techniques
No ratings yet
Ransomware Attack Detection Based On Pertinent System Calls Using Machine Learning Techniques
23 pages
SOA - A Malware Detection System Using A Hybrid Approach of Multi-Heads Attention-Based Control Flow Traces and Image Visualization
No ratings yet
SOA - A Malware Detection System Using A Hybrid Approach of Multi-Heads Attention-Based Control Flow Traces and Image Visualization
47 pages
Mamadroid: Detecting Android Malware by Building Markov Chains of Behavioral Models (Extended Version)
No ratings yet
Mamadroid: Detecting Android Malware by Building Markov Chains of Behavioral Models (Extended Version)
34 pages
A Malicious Code Detection Method Based On Stacked Depthwise Separable Convolutions and Attention Mechanism
No ratings yet
A Malicious Code Detection Method Based On Stacked Depthwise Separable Convolutions and Attention Mechanism
27 pages
FYP GROUP 2 Presentation-Proposal 1
No ratings yet
FYP GROUP 2 Presentation-Proposal 1
23 pages
A Malware-Detection Method Using Deep Learning To
No ratings yet
A Malware-Detection Method Using Deep Learning To
24 pages
Network Malware Detection Using Deep Learning Netw
No ratings yet
Network Malware Detection Using Deep Learning Netw
26 pages
DEF: Deep Ensemble Neural Network Classifier For Android Malware Detection
No ratings yet
DEF: Deep Ensemble Neural Network Classifier For Android Malware Detection
11 pages
Researchdemo 3
No ratings yet
Researchdemo 3
19 pages
Comparison of Malware Classification Methods Using Convolutional Neural Network Based On Api Call Stream
No ratings yet
Comparison of Malware Classification Methods Using Convolutional Neural Network Based On Api Call Stream
19 pages
TLTK1
No ratings yet
TLTK1
20 pages
Malware Categrisn
No ratings yet
Malware Categrisn
16 pages
Preprints202407 1214 v1
No ratings yet
Preprints202407 1214 v1
20 pages
Swood 2024 Spo.1
No ratings yet
Swood 2024 Spo.1
98 pages
Api MD
No ratings yet
Api MD
13 pages
Reasearch 1
No ratings yet
Reasearch 1
18 pages
Researchdemo 2
No ratings yet
Researchdemo 2
13 pages
Automated Machine Learning For Deep Learning Based Malware Detection
No ratings yet
Automated Machine Learning For Deep Learning Based Malware Detection
17 pages
Android Malware Classification Using Convolutional Neural Network and LSTM
No ratings yet
Android Malware Classification Using Convolutional Neural Network and LSTM
12 pages
API Frequency Detection Model
No ratings yet
API Frequency Detection Model
15 pages
Researchdemo 1
No ratings yet
Researchdemo 1
11 pages
Major Project 1
No ratings yet
Major Project 1
11 pages
1-Malicious Software Identification Based On Deep Learning Algorithms and API Feature Extraction
No ratings yet
1-Malicious Software Identification Based On Deep Learning Algorithms and API Feature Extraction
15 pages
Detection of Malicious Android Apps Using Machine Learning Techniques
No ratings yet
Detection of Malicious Android Apps Using Machine Learning Techniques
7 pages
Behavior-Based Features Model For Malware Detectio
No ratings yet
Behavior-Based Features Model For Malware Detectio
12 pages
Liu Et Al. - 2024 - SeGDroid An Android Malware Detection Method Base
No ratings yet
Liu Et Al. - 2024 - SeGDroid An Android Malware Detection Method Base
15 pages
A Behavior-Based Approach For Malware Detection: Rayan Mosli, Rui Li, Bo Yuan, Yin Pan
No ratings yet
A Behavior-Based Approach For Malware Detection: Rayan Mosli, Rui Li, Bo Yuan, Yin Pan
16 pages
Graph-Oriented Modelling of Process Event Activity For The Detection of Malware
No ratings yet
Graph-Oriented Modelling of Process Event Activity For The Detection of Malware
10 pages
Lightweight and Robust Malware Detection Using Dictionaries of API Calls
No ratings yet
Lightweight and Robust Malware Detection Using Dictionaries of API Calls
12 pages
First Review B19
No ratings yet
First Review B19
24 pages
Effective Malware Detection Based On Behaviour and Data Features
No ratings yet
Effective Malware Detection Based On Behaviour and Data Features
16 pages
Final Research
No ratings yet
Final Research
12 pages
A Framework For Detection of Malicious Code by Exploiting Machine Learning Techniques On Portable Executables
No ratings yet
A Framework For Detection of Malicious Code by Exploiting Machine Learning Techniques On Portable Executables
4 pages
Computers 11 00160 v2
No ratings yet
Computers 11 00160 v2
15 pages
16.experimental Comparison of Features and Classifiers For Android Malware Detection
No ratings yet
16.experimental Comparison of Features and Classifiers For Android Malware Detection
12 pages
Malware Detection Using ANN
No ratings yet
Malware Detection Using ANN
10 pages
Document Malware
No ratings yet
Document Malware
9 pages
Malware Detection Using LSTM and CNN
No ratings yet
Malware Detection Using LSTM and CNN
11 pages
RNN For Malware Detection
No ratings yet
RNN For Malware Detection
18 pages
Radon Transform Based Malware Classification in Cyb 2024 Results in Control
No ratings yet
Radon Transform Based Malware Classification in Cyb 2024 Results in Control
14 pages
2021 - Makhor - Malware Detection Using Fuzzy Similarity of System Call Dependency Sequence
No ratings yet
2021 - Makhor - Malware Detection Using Fuzzy Similarity of System Call Dependency Sequence
10 pages
Malware - Detection - Research - Paper - Updated Soheb6
No ratings yet
Malware - Detection - Research - Paper - Updated Soheb6
8 pages
Synopsis 1
No ratings yet
Synopsis 1
7 pages
5474-Article Text-8699-1-10-20200511
No ratings yet
5474-Article Text-8699-1-10-20200511
8 pages
Dynamic Android Malware Category Classification
No ratings yet
Dynamic Android Malware Category Classification
8 pages
CSS Background
No ratings yet
CSS Background
11 pages
07 Art NLP-based Entity Behavior Analytics For Malware Detection
No ratings yet
07 Art NLP-based Entity Behavior Analytics For Malware Detection
5 pages
Analysis of Cyber Security Threats Using
No ratings yet
Analysis of Cyber Security Threats Using
5 pages
FuzzyRNN NIT SUB 2columns PDF
No ratings yet
FuzzyRNN NIT SUB 2columns PDF
8 pages
Deep Learning For Classification of Malware System Call Sequences
No ratings yet
Deep Learning For Classification of Malware System Call Sequences
12 pages
Deep Learning LSTM Based Ransomware Detection. 2019 Recent Developments in Control, Automation & Power Engineering (RDCAPE) .
No ratings yet
Deep Learning LSTM Based Ransomware Detection. 2019 Recent Developments in Control, Automation & Power Engineering (RDCAPE) .
5 pages
Zhu 2015
No ratings yet
Zhu 2015
4 pages
Malcode Detection
No ratings yet
Malcode Detection
5 pages
The Curious Case of Machine Learning in Malware Detection: Sherif Saad, William Briguglio and Haytham Elmiligi
No ratings yet
The Curious Case of Machine Learning in Malware Detection: Sherif Saad, William Briguglio and Haytham Elmiligi
9 pages
The Curious Case of Machine Learning in Malware Detection: Sherif Saad, William Briguglio and Haytham Elmiligi
No ratings yet
The Curious Case of Machine Learning in Malware Detection: Sherif Saad, William Briguglio and Haytham Elmiligi
8 pages
Classification of Android Apps and Malware Using Deep Neural Networks
No ratings yet
Classification of Android Apps and Malware Using Deep Neural Networks
8 pages
AM 8000 Manu Prog ENG
No ratings yet
AM 8000 Manu Prog ENG
60 pages
Select From Garment Where Size 'XL'
No ratings yet
Select From Garment Where Size 'XL'
10 pages
CDLU BCA Syllabi New1324
No ratings yet
CDLU BCA Syllabi New1324
11 pages
Getchell Et Al 2022 Artificial Intelligence in Business Communication The Changing Landscape of Research and Teaching
No ratings yet
Getchell Et Al 2022 Artificial Intelligence in Business Communication The Changing Landscape of Research and Teaching
27 pages
M.Tech Cyber Security Syllabi Final
No ratings yet
M.Tech Cyber Security Syllabi Final
46 pages
DGUS Development Guide V3.4.0
No ratings yet
DGUS Development Guide V3.4.0
90 pages
Domain and Range Homework
100% (1)
Domain and Range Homework
5 pages
TBC 401 Data Analytics Using Python
No ratings yet
TBC 401 Data Analytics Using Python
2 pages
Signal Fire OTDR ZS1000-A)
No ratings yet
Signal Fire OTDR ZS1000-A)
9 pages
Csi Safe 22.1.0.2728
No ratings yet
Csi Safe 22.1.0.2728
3 pages
Bentley DGNDB Imodel Importer 2.0
No ratings yet
Bentley DGNDB Imodel Importer 2.0
6 pages
7293-Article Text-10523-1-10-20200601
No ratings yet
7293-Article Text-10523-1-10-20200601
12 pages
CH 13: Building Information Systems: LO 13.1 - How Does Building New Systems Produce Organizational Change?
No ratings yet
CH 13: Building Information Systems: LO 13.1 - How Does Building New Systems Produce Organizational Change?
2 pages
AWT and Swing Components
No ratings yet
AWT and Swing Components
1 page
DB - Sherlog CRX - 032018 - Eng - 2
No ratings yet
DB - Sherlog CRX - 032018 - Eng - 2
4 pages
Silo - Tips - Arabic Mathematical Symbol Insertion Application System Using Arabic Pack For Math Type Software
No ratings yet
Silo - Tips - Arabic Mathematical Symbol Insertion Application System Using Arabic Pack For Math Type Software
8 pages
PM Debug Info
No ratings yet
PM Debug Info
15 pages
Lab Experiment
No ratings yet
Lab Experiment
14 pages
Tuan Nguyen: Senior IT Director
No ratings yet
Tuan Nguyen: Senior IT Director
3 pages
FAF233 Lab3 PostoroncaDumitru10
No ratings yet
FAF233 Lab3 PostoroncaDumitru10
5 pages
Interrupts Programming
No ratings yet
Interrupts Programming
7 pages
Basic IT 2
No ratings yet
Basic IT 2
4 pages
Microsoft Fingerprint Reader Solution For Windows 7
No ratings yet
Microsoft Fingerprint Reader Solution For Windows 7
12 pages
RMC No. 25-2024 - Annex B
No ratings yet
RMC No. 25-2024 - Annex B
1 page
Medicine Get Price by Disease
No ratings yet
Medicine Get Price by Disease
2 pages
Penetration Testing Fundamentals-2: Penetration Testing Study Guide To Breaking Into Systems
From Everand
Penetration Testing Fundamentals-2: Penetration Testing Study Guide To Breaking Into Systems
Devi Prasad
No ratings yet
Penetration Testing Fundamentals -1: Penetration Testing Study Guide To Breaking Into Systems
From Everand
Penetration Testing Fundamentals -1: Penetration Testing Study Guide To Breaking Into Systems
Devi Prasad
No ratings yet

API Call Based Malware Detection Approach Using Recurrent Neural Network-LSTM

Uploaded by

API Call Based Malware Detection Approach Using Recurrent Neural Network-LSTM

Uploaded by

API Call Based Malware Detection

Approach Using Recurrent Neural

J. Mathew(B) and M. A. Ajay Kumara

Department of Computer Science and Engineering,

Abstract. Malware variants keep increasing every year as most mal-

Keywords: API system call · Malware detection · TF-IDF ·

In this section, we attempt to provide background notions of topics that form

3.1 API Call Sequence

3.2 Natural Language Processing (NLP)

3.3 Recurrent Neural Network

Fig. 1. Unfolded RNN

3.4 Long Short Term Memory

Fig. 2. System design

4.1 Data Preprocessing

4.2 Feature Extraction

4.3 Feature Selection and Vectorization

Term Frequency-Inverse Document Frequency. TF is the frequency of a

S = {s1 , s2 , . . . , sn } V = {c1 , c2 , . . . , cn } (4)

S is the sequence corpus and s is a particular sequence. V is the vocabulary

tf (c, s) = fs (c) (5)

4.4 Training the RNN-LSTM

5.2 Feature Extraction and Selection

5.3 Realizing the RNN-LSTM Learning Engine

6 Evaluation and Results

Table 1. Evaluation metrics

Accuracy in Testing Phase

Fig. 4. Average accuracy in testing phase

True Positive Rate (Recall)

Fig. 5. Average recall (TPR) in testing phase

F1 SCORE & 1-F1 SCORE

10000 7500 5000 2500

Fig. 6. Average F1 score & 1-F1 score in testing phase

approach followed must be evaluated by leveraging high-end systems on large

You might also like