JStrack-Enriching Malicious JavaScript Detection Based on AST Graph Analysis and Attention Mechanism
JStrack-Enriching Malicious JavaScript Detection Based on AST Graph Analysis and Attention Mechanism
net/publication/356808188
CITATIONS READS
0 124
6 authors, including:
Some of the authors of this publication are also working on these related projects:
All content following this page was uploaded by Muhammad Fakhrur Rozi on 06 January 2022.
1 Introduction
Javascript payload injection into legitimate or fake websites has been one of the
largest attack on the web. The malicious script can exploit the vulnerability
of the web applications to perform a drive-by download attack [2] or cross-site
scripting (XSS) [19]. When the attack is succesful, attackers distribute malware
to clients, which can cause damage such as sensitive data leakage, wire transfer,
or integrating into distributed denial-of-service (DDoS) attacks [3]. For instance,
one of the most famous examples of XSS vulnerability is the Myspace Samy
c Springer Nature Switzerland AG 2021
T. Mantoro et al. (Eds.): ICONIP 2021, LNCS 13109, pp. 669–680, 2021.
https://doi.org/10.1007/978-3-030-92270-2_57
670 M. F. Rozi et al.
3 Proposed Approach
To overcome such challenges from malicious JavaScript, we propose a detection
system that can predict the label of a given source code, whether it is malicious
or benign. Our proposed approach uses AST as the feature of JavaScript that
can define the style and semantic meaning of the source code. By analyzing
this feature, we can capture the malicious intent based on the typical structure
and attribute of the AST graph. We use GCN to learn the graph to have the
generalization of malicious and benign samples.
3.1 Overview
We can see the entire detection system framework in Fig. 1. It begins with a
JavaScript file that we want to predict the malicious intent. After that, we parse
it using a parser to get the AST representation, describing how programmers
write the code. The output is a JSON format file where each record is a syn-
tactic unit object based on ESTree standardization [4]. We can construct graph
objects from a JSON file as a simplification of its data structure. The graph gen-
erator creates syntactic unit types as finite nodes, and the hierarchical connection
among nodes is an edge of the AST graph. Next, we create two matrices, feature
matrix X and adjacency A, representing the feature value of each node and
JStrack 673
Fig. 1. The overview of proposed approach. (a) The original architecture consists of
three layers of convolutional and pooling layers. (b) The combination of GCN and
attention mechanism to locate the suspicious codes of JavaScript. To get the whole
information of nodes, we put the pooling layer after attention layer before going to
fully-connected layer.
the adjacency matrix A. In some cases, edges also have real-valued features in
addition to discrete edge types.
We can use a graph-based approach to represent the AST feature with a tree
graph structure. AST is a top-down parsing structure in which each syntactic
unit has at least one hierarchical connection where the root is always a ’program’
type. Based on that, we consider each syntactic unit as a node and hierarchical
link as an edge. Using graph representation simplifies the AST feature in a fixed
form to help the feature extraction process. This representation also allows us
to capture the big picture of the source code, which shows the complexity yet
the programmer’s obfuscation style.
QKT
Attention(Q, K, V) = sof tmax( √ )V (2)
dk
where Q, K, V are query, key, and value matrices, respectively. dk is the key
of dimensions.
In this work, the attention mechanism can leverage the learning process of
GCN by giving attention weight to concentrate selectively on a discrete aspect of
the graph convolutional layer. We use a self-attention layer to handle long-range
dependencies and have lower complexity than other layer types (e.g., convolu-
tional or recurrent).
4 Experiments
In this section, we present our experiments to evaluate our proposed approach
for detecting malicious JavaScript samples. We evaluated our framework’s per-
formance by adjusting the maximum number of nodes in each graph. Then, we
compared our results with some related works that have a similar task. Finally,
we give some analysis discussion to find out our limitations.
4.1 Setup
Dataset. We collect malicious and benign JavaScript datasets, where the mali-
cious samples are from two different sources due to the difficulties of getting the
real-world dataset. For our malicious samples, we mixed the dataset from Rozi
et al. [14] and Ndichu et al. [12] that use some different time stamps of files
from 2015 until 2017. We also confirmed that all those datasets are dangerous
scripts based on the VirusTotal scanner [18]. Meanwhile, we collected JavaScript
codes for benign samples by scrapping from the top domain list on the Majestic
website [10], and we combined it with the benign dataset from SRILAB [13]. We
consider all JavaScript codes inside popular websites as safe code without any
attacking intent.
We split our dataset into two parts: training and testing. We used the train-
ing dataset for the learning purpose of our graph learning model. Otherwise,
we evaluated our model with the testing dataset. We conducted 10-folds cross-
validation to see our model’s average performance that generalizes to an inde-
pendent dataset. Because of that, the proportion between training and testing
is 80% and 20%, respectively. Table 1 summarizes the number of JavaScript files
that we use in our experiments.
Hyper-parameters and Setup. We set optimal hyper-parameters to conduct
our experiments to control the learning process. We used the Adam algorithm
optimization with a 0.01 learning rate and 32 for the batch size. In addition, the
feature size of the convolutional layer in GCN is 32 and using rectified linear
unit (ReLU) as the activation function. For the pooling layer, we used a 50%
ratio to downsample the matrix node.
676 M. F. Rozi et al.
Table 1. The description of our dataset that is used for training and testing process.
Label Dataset
Training Testing Total
Benign 97,361 24,341 121,702
Malicious 31,560 7,890 39,450
Total 128,921 32,231 161,152
Unlike the usual deep learning model, adding more layers does not correlate
with the performance. When we work with the GNNs, this model will signif-
icantly lose the ability to learn if we have too deep layers, where we call this
problem over-smoothing [23]. The main idea of over-smoothing is that all node
representations look identical and uninformative after too many message passing
rounds due to too many layers. Zhou et al. [22] recommended using between 2
and 4 layers to achieve an optimal solution. Therefore, we used the middle range
number, three layers, in our experiments.
Moreover, we applied a data loader with disjoint mode for creating mini-
batches of data in graph learning. It represents a batch of graphs with a disjoint
union that gives us one big graph [11]. Figure 2 illustrates how the disjoint loader
works.
Fig. 2. Disjoint loader is a method to load dataset in graph learning process that
represents batch of graphs via disjoint union. It uses zero-based indices to keep track
of the different graphs.
with the number of nodes in the AST graph that we can capture. This result
is in accordance with our hypothesis that AST nodes give an abstraction of
the source code where all nodes give essential information. However, using 2000
nodes still give high performance even though we did not include all information.
It is because AST uses the hierarchical structure that each node has summarized
its successor.
Table 3 shows the comparison between previous works and our proposed
method. GCN has around 98% in terms of F1 score for our dataset with the
maximum 50 nodes of the AST graph. Meanwhile, adding attention layers before
fully connected layers can improve the performance by 99%. Our approaches
outperform the previous works that use the FastText model based on frequency
analysis of syntactic AST units. Even though the difference is relatively small,
our proposed method can predict the part of the source code which gives more
attention to detect malicious intent. This information will be valuable for fur-
ther analysis of malicious code. Figure 4 is one of the malicious samples in our
dataset that shows the attention score for each node in a graph. Moreover, the
bytecode sequences feature cannot be implemented on every JavaScript samples
because we have to declare all possible DOM objects.
Moreover, we found in our experiments that the malicious JavaScript has its
obfuscation technique to hide the actual source code. Figure 3(a) shows the graph
visualization of malicious JavaScript code. The structure of the AST graph for
malicious JavaScript has many repetitions of the subgraph that we rarely find
in benign samples. Some similar styles appear many times within the same time
range, indicating that attackers consistently use their obfuscation function that
normal programmers will not use. On the other hand, most benign samples in
Fig. 3(b) have an arbitrary structure of AST and inconsistent subgraph patterns.
This result is in line with our hypothesis that benign JavaScript mostly does not
use obfuscation techniques, or if it has obfuscated parts, it uses more complicated
methods to protect from reverse engineering.
678 M. F. Rozi et al.
Model Feature F1
DPCNN [14] Bytecode sequence 0.9684
DPCNN+LSTM [14] Bytecode sequence 0.9657
DPCNN+BiLSTM [14] Bytecode sequence 0.9683
LSTM [12] AST 0.9234
FastText [12] AST 0.9873
GCN (3-layers;max 50 nodes) AST 0.9875
GCN (w/attention; max 50 nodes) AST 0.9935
Fig. 3. A sample of AST graph that is constructed from a benign (a) and malicious
(b) JavaScript file.
Fig. 4. (a) A malicious sample where the highlight parts are the vital parts to execute
the code. (b) The AST representation of the malicious code that each node has a color
represents the attention score. Some nodes have high scores that correlate to the vital
part of malicious code.
However, there are two limitations to our proposed method that we are con-
sidering. First, we lose detailed information about malicious code due to using
the AST feature to represent JavaScript. In the AST graph, we merely use the
syntactic units and omit component details for each unit, which may contain the
JStrack 679
essential information for our detection system. Then, the use of deep/machine
learning does not always consider uncertainty in the prediction task. It relies on
statistical assumptions about the distribution of the dataset to train the model.
Consequently, adversaries-based attacks can exploit the machine learning model
to disrupt the analysis process and make false detection.
References
1. Belkin, M., Niyogi, P., Sindhwani, V.: Manifold regularization: a geometric frame-
work for learning from labeled and unlabeled examples. J. Mach. Learn. Res. 7,
2399–2434 (2006)
2. Cova, M., Kruegel, C., Vigna, G.: Detection and analysis of drive-by-download
attacks and malicious JavaScript code. In: Proceedings of the 19th International
Conference on World Wide Web, WWW 2010, pp. 281–290. Association for Com-
puting Machinery, New York (2010). https://doi.org/10.1145/1772690.1772720
3. Douligeris, C., Mitrokotsa, A.: DDoS attacks and defense mechanisms: classifica-
tion and state-of-the-art. Comput. Netw. 44(5), 643–666 (2004)
4. The estree spec. https://github.com/estree/estree. Accessed 20 Jan 2021
5. Fang, Y., Huang, C., Liu, L., Xue, M.: Research on malicious JavaScript detection
technology based on LSTM. IEEE Access 6, 59118–59125 (2018)
6. Fass, A., Krawczyk, R.P., Backes, M., Stock, B.: JaSt: fully syntactic detection of
malicious (obfuscated) JavaScript. In: Giuffrida, C., Bardin, S., Blanc, G. (eds.)
DIMVA 2018. LNCS, vol. 10885, pp. 303–325. Springer, Cham (2018). https://doi.
org/10.1007/978-3-319-93411-2 14
7. Gupta, S., Gupta, B.: Enhanced XSS defensive framework for web applica-
tions deployed in the virtual machines of cloud computing environment. Proce-
dia Technol. 24, 1595–1602 (2016). https://doi.org/10.1016/j.protcy.2016.05.152.
https://www.sciencedirect.com/science/article/pii/S2212017316302419. Interna-
tional Conference on Emerging Trends in Engineering, Science and Technology
(ICETEST - 2015)
680 M. F. Rozi et al.