0% found this document useful (0 votes)
68 views16 pages

2024ist - A Vulnerability Detection Framework by Focusing On Critical Execution Paths

Uploaded by

Lu Liu
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
68 views16 pages

2024ist - A Vulnerability Detection Framework by Focusing On Critical Execution Paths

Uploaded by

Lu Liu
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 16

Information and Software Technology 174 (2024) 107517

Contents lists available at ScienceDirect

Information and Software Technology


journal homepage: www.elsevier.com/locate/infsof

A vulnerability detection framework by focusing on critical execution paths


Jianxin Cheng a , Yizhou Chen a , Yongzhi Cao a,b , Hanpin Wang c,a ,∗
a
Key Laboratory of High Confidence Software Technologies (Peking University), Ministry of Education; School of Computer Science, Peking
University, Beijing, China
b
Zhongguancun Laboratory, Beijing, China
c
School of Computer Science and Cyber Engineering, Guangzhou University, Guangzhou, China

ARTICLE INFO ABSTRACT

Keywords: Context: Vulnerability detection is critical to ensure software security, and detecting vulnerabilities in smart
Vulnerability detection contract code is currently gaining massive attention. Existing deep learning-based vulnerability detection
Software security methods represent the code as a code structure graph and eliminate vulnerability-irrelevant nodes. Then, they
Code representation
learn vulnerability-related code features from the simplified graph for vulnerability detection. However, this
Control flow graph
simplified graph struggles to represent relatively complete structural information of code, which may affect
Deep learning
the performance of existing vulnerability detection methods.
Objective: In this paper, we present a novel Vulnerability Detection framework based on Critical Execution
Paths (VDCEP), which aims to improve smart contract vulnerability detection.
Method: Firstly, given a code structure graph, we deconstruct it into multiple execution paths that reflect
rich structural information of code. To reduce irrelevant code information, a path selection strategy is
employed to identify critical execution paths that may contain vulnerable code information. Secondly, a feature
extraction module is adopted to learn feature representations of critical paths. Finally, we feed all path feature
representations into a classifier for vulnerability detection. Also, the feature weights of paths are provided to
measure their importance in vulnerability detection.
Results: We evaluate VDCEP on a large dataset with four types of smart contract vulnerabilities. Results show
that VDCEP outperforms 14 representative vulnerability detection methods by 5.34%–60.88% in F1-score. The
ablation studies analyze the effects of our path selection strategy and feature extraction module on VDCEP.
Moreover, VDCEP still outperforms ChatGPT by 34.46% in F1-score.
Conclusion: Compared to existing vulnerability detection methods, VDCEP is more effective in detecting smart
contract vulnerabilities by utilizing critical execution paths. Besides, we can provide interpretable details about
vulnerability detection by analyzing the path feature weights.

1. Introduction blockchain system, which can seriously hamper its development. Thus,
it is imperative to design vulnerability detection techniques in smart
Software vulnerabilities are weaknesses in the code that can be contracts to improve the security of the blockchain system [8–10].
exploited by hackers to trigger a series of security incidents, which Existing efforts and limitations. As countermeasures, deep
pose a serious threat to software security [1–3]. Smart contracts are
learning-based methods are proposed to identify smart contract vulner-
program codes, generally written in Solidity, that run automatically
abilities. These methods could automatically learn vulnerability-related
on the blockchain system [4]. Like most program codes (e.g., C and
Java), smart contract codes are plagued by multiple vulnerabilities and code features through deep learning techniques. For example, some ef-
cause huge financial losses. For example, the hackers stole over 60 forts [11,12] treat the smart contract code as a sequence structure, and
million dollars by exploiting the reentrancy vulnerability of the DAO apply techniques of natural language processing, e.g., Long Short-Term
contract [5]. In addition, smart contracts are difficult to modify once Memory (LSTM) [13] and Transformer [14], to learn vulnerability
they are deployed on the blockchain system, which makes them a more features. Regrettably, these sequence-based methods ignore that the
desirable target for hackers than other program codes [6,7]. These source code is more logical and structural than the natural language [5,
properties of smart contracts raise huge security concerns about the

∗ Corresponding author at: Key Laboratory of High Confidence Software Technologies (Peking University), Ministry of Education; School of Computer Science,
Peking University, Beijing, China.
E-mail addresses: [email protected] (J. Cheng), [email protected] (H. Wang).

https://doi.org/10.1016/j.infsof.2024.107517
Received 2 March 2024; Received in revised form 23 May 2024; Accepted 13 June 2024
Available online 15 June 2024
0950-5849/© 2024 Elsevier B.V. All rights are reserved, including those for text and data mining, AI training, and similar technologies.
J. Cheng et al. Information and Software Technology 174 (2024) 107517

15]. This requires vulnerability detection methods to represent the


source code as the non-sequence structure, and obtain the structural
information of the code.
To this end, existing methods employ various transformations to
transform the source code as the code structure graph, such as control
flow graph [16], data flow graph [17], or code semantic graph [18,19].
In these graph-based methods, each node in the graph corresponds
to a line of code statement. Subsequently, graph neural networks,
such as graph convolutional network [20] and graph attention net-
work [21], are utilized to learn code features from all nodes for vul-
nerability detection. In practice, a code structure graph may comprise
many vulnerability-irrelevant node information (i.e., code statement
information), and this irrelevant code information interferes with the
vulnerability feature learning. To eliminate irrelevant code informa-
tion, existing graph-based methods [17–19] mainly utilize the graph
slicing technique to simplify the code structure graph and retain the
vulnerability-related nodes. However, this simplified graph only con-
tains partial structural information of code and ignores some potentially
valuable code information, which may hinder the vulnerability feature
learning and further affect the detection performance.
Insights and challenges. To solve this, we need to represent the
source code using such a code representation that eliminates as much
irrelevant code information as possible, while retaining relatively com-
plete structural information of code. The execution paths can satisfy
these requirements [3,22]. Specifically, a code structure graph can be
treated as the combination of multiple execution paths from an entry
node to its exit node. These execution paths have lower code coverage
than the code structure graph, which helps to filter out some irrelevant
code information. Furthermore, these execution paths have complete
code execution logic, which contains rich structure information of code, Fig. 1. A typical example of the integer overflow vulnerability. The code statements
i.e., control flow and data flow information. Based on this insight, marked red are vulnerable.
we adopt such execution paths to represent the source code. Our
challenge is how to select some critical execution paths that contain
more valuable code information related to the vulnerabilities, instead • We devise a heuristic-based path selection strategy to identify
of retaining all execution paths to represent the source code. This helps critical execution paths, which effectively eliminate irrelevant
to capture more accurate vulnerability features for further enhancing code information and preserve complete code execution logic.
the performance of smart contract vulnerability detection. • We propose VDCEP, a novel vulnerability detection framework
Our solution. In the paper, we propose VDCEP, a novel Vulnerability that learns more accurate vulnerability features on critical execu-
Detection framework with Critical Execution Paths. Firstly, we trans- tion paths. Results show the effectiveness of VDCEP in detecting
form the source code into a code structure graph and deconstruct it smart contract vulnerabilities.
into the execution paths. A heuristic-based path selection strategy is
then applied to identify a fixed number of critical paths. Among them, We organize the remainder of this paper as follows. Section 2
we employ some heuristic rules to preserve critical paths that have describes the motivating example and preliminary. Section 3 introduces
specific code statements related to the vulnerability. Also, these critical the proposed VDCEP framework and the corresponding implementation
paths are required to have the shortest path lengths to further eliminate details. We present the experimental setup and result analysis in Sec-
irrelevant code information. Secondly, a feature extraction module, tions 4 and 5. Section 6 recalls the advantages of VDCEP and gives
leveraging convolutional neural networks and attention mechanisms, the implications and threats. Section 7 lists the related work in smart
is employed to learn feature representations for all critical paths. Fi- contract vulnerability detection. Finally, Section 8 concludes our work
nally, we merge these feature representations into an enhanced feature and gives some perspectives.
representation, enabling the capture of more accurate vulnerability
features. This enhanced representation is then fed into a classifier to 2. Preliminary and motivating example
predict whether the code snippet is vulnerable. Moreover, we calculate
the feature weights of all critical paths to measure their importance in In this section, we present the problem definition and some
vulnerability detection. widespread smart contract vulnerabilities. Next, the motivating exam-
Evaluation. We perform extensive experiments on a benchmark ple from the real world is given to facilitate the understanding of our
dataset with four kinds of smart contract vulnerabilities, i.e., reen- approach.
trancy, timestamp dependency, integer overflow/underflow, and del-
egatecall. We select 14 state-of-the-art vulnerability detection methods 2.1. Preliminary
as baselines. Results suggest that VDCEP is superior to these baselines
by 6.16%–59.80% and 5.34%–60.88% in recall and F1-score, respec- Problem definition. Given a code snippet 𝑋 of the smart contract,
tively. Compared to ChatGPT, VDCEP still has 41.93% and 34.46% and a deep learning model 𝑀(⋅), we aim to use this model 𝑀(⋅) to
improvement in recall and F1-score. This shows that VDCEP is more predict whether there is a vulnerability in the code snippet 𝑋. More
effective in smart contract vulnerability detection. Also, it offers in- specifically, we set 𝑌̂ = 𝑀(𝑋) and 𝑌 to be the prediction of the model
terpretable details of vulnerability detection by analyzing path feature 𝑀(⋅) and the true label. Among them, 𝑌 = 0 denotes the given code
weights. 𝑋 is non-vulnerable and safe, and 𝑌 = 1 means 𝑋 is vulnerable and
The main contributions of this paper are listed below: contains a certain type of smart contract vulnerability. Our objective is

2
J. Cheng et al. Information and Software Technology 174 (2024) 107517

to utilize a loss function 𝐿(⋅) to update the trained model 𝑀(⋅), and to 3.1. Overview
make the prediction 𝑌̂ as close as possible to the true label 𝑌 .
Smart contract vulnerability. Smart contracts are program codes Existing vulnerability detection methods struggle to strike a bal-
that run automatically on blockchain systems (i.e., decentralized soft- ance between eliminating irrelevant code information and retaining
ware systems). Detecting smart contract vulnerabilities has gained a relatively complete structural information of code. To solve this issue,
great deal of attention recently as it can be effective in minimizing huge VDCEP deconstructs the code structure graph into the execution paths
financial losses on blockchain systems [4,23]. Building upon previous that encompass various structural information of the code. Then, by fo-
vulnerability detection efforts [18,24], we focus on four kinds of smart cusing on the critical execution paths, VDCEP can give higher attention
contract vulnerabilities in Ethereum, one of the popular blockchain to the vulnerability-related code information and learn more accurate
platforms. These vulnerabilities possess the typical characteristics of vulnerability features. Moreover, VDCEP provides the feature weights
smart contract vulnerabilities and result in over 70% asset loss to
of all critical paths to offer more interpretable details of vulnerability
Ethereum, which are briefly introduced below.
detection.
Reentrancy is a widely recognized vulnerability infamous for its
In detail, VDCEP consists of the following three phases as illustrated
association with the DAO attack [5]. This vulnerability occurs when
in Fig. 2.
smart contracts invoke each other, and its implementation principle
bears similarities to recursive function calls (i.e., 𝑐𝑎𝑙𝑙.𝑣𝑎𝑙𝑢𝑒).
• Data processing. Given the smart contract code snippet, VDCEP
Timestamp dependence vulnerability arises when the smart con-
converts it into a Control Flow Graph (CFG) and decomposes it
tracts adopt a timestamp (i.e., 𝑏𝑙𝑜𝑐𝑘.𝑡𝑖𝑚𝑒𝑠𝑡𝑎𝑚𝑝) as part of a conditional
into the execution paths. Then, a heuristic-based path selection
judgment (i.e., 𝑏𝑙𝑜𝑐𝑘) or as a basis for producing the random numbers.
strategy is used to pick a fixed number of critical execution paths.
Integer overflow/underflow happens when the operation result is
beyond the maximum value (i.e., cause overflow) or minimum value • Feature extraction. VDCEP utilizes a feature extraction module
(i.e., cause underflow) that can be represented by this type [8]. consisting of the semantic compressor and Transformer encoder
Delegatecall vulnerability occurs when the attackers exploit the to learn the feature representations of all paths. The semantic
argument of the 𝑑𝑒𝑙𝑒𝑔𝑎𝑡𝑒𝑐𝑎𝑙𝑙 method to execute malicious behavior. compressor uses convolutional neural networks to filter out non-
In other words, the attacker can leverage this vulnerability to cause vulnerable code statements and the Transformer builds the depen-
damage to Ethereum when the 𝑑𝑒𝑙𝑒𝑔𝑎𝑡𝑒𝑐𝑎𝑙𝑙 is not checked. dencies between critical code statements through the attention
mechanism.
2.2. Motivating example • Vulnerability detection. VDCEP fuses all feature representa-
tions of paths and feeds an enhanced feature representation into
In Fig. 1, we give a motivating example from real-world smart the multi-layer perceptron network to generate the prediction
contracts. This code example enables users to deposit some tokens into (i.e., vulnerable or benign). Besides, VDCEP outputs the inter-
the balances variable by the TransferDepoist function. As described in pretable feature weights.
the above subsection, this example contains the integer overflow vul-
nerability. In detail, if the amount variable (see line 21) is manipulated 3.2. Data processing
to overflow (i.e., over the maximum integer value), an attacker can
bypass the code statements in charge of validating the balances variable
Constructing the CFG. The CFG, as a type of code structure graph,
and transfer a lot of tokens with a minimal cost [8].
is the dominant choice for representing structural information in the
Observation. A vulnerable smart contract may contain a large
smart contract code [16,24]. Therefore, we convert the source code into
number of irrelevant code statement information. As shown in Fig. 1,
a CFG that contains the control flow semantics among the code state-
this example contains a total of 51 lines of code statements. We can
identify the integer overflow vulnerability simply by analyzing the four ments, as shown in Fig. 3. More specifically, the source code is cleaned
code statements marked red, including lines 13, 14, 21, and 32. The through some normalization operations, such as eliminating all blank
remaining code statements do not help in detecting this vulnerability. lines and comments. Following this, we apply a public compiler [25]
In other words, not all code statements in a smart contract code are to construct the abstract syntax tree of the code, and then traverse it
equally important for vulnerability detection. To eliminate irrelevant to generate the CFG that starts from the entry node and ends at one of
code information, existing graph-based vulnerability detection meth- the exit nodes.
ods [16–19] generally use graph slicing techniques to simplify the Each node, as a basic block, corresponds to a line of code statements.
graph and retain the neighboring nodes of the vulnerability-related We treat the first line of the code statement as the entry node and three
nodes. Unfortunately, this simplified graph has difficulty in represent- types of exit statements (‘‘return’’, ‘‘assert’’, and ‘‘throw’’) as the exit
ing abundant structural information of code. Motivated by [3,22], we node. As a complement, we add a dummy exit node at the end of the
use execution paths to represent the smart contract code, instead of CFG when the last code statement is not included in the above exit
the code structure graph. The execution paths have relatively com- statements. Each edge represents the control flow information between
plete code execution logic and lower code coverage. There is often two adjacent nodes, which reveals the possible program execution
a situation where only a few execution paths are associated with the order. There are three main kinds of labels designed for the edges of the
detected vulnerability, and the remaining paths as noise can interfere CFG. Among them, an edge labeled ‘‘Next’’ indicates an unconditional
with vulnerability feature learning. Therefore, the main challenge of jump between code statements. For example, we connect node 5 and
this work is how to identify the critical execution paths that may be node 6 in Fig. 3 through the ‘‘Next’’ edge. The edge labeled ‘‘True’’
vulnerable. In addition, existing graph-based vulnerability detection (i.e., nodes 6→8) or ‘‘False’’ (i.e., nodes 6→7) reflects the true or false
methods can only accomplish a graph-level classification task without condition, which results in a different program execution order.
providing more interpretable details. In other words, they can only tell
Generating critical execution paths. As presented in the right of
whether a vulnerability exists in a smart contract but cannot explain
Fig. 3, the CFG of code can be viewed as a combination of multiple
which code statements are more likely to contain vulnerabilities.
execution paths where the execution order of each code statement
3. The proposed framework is linear, such as the path 1→2→3→4→5→6→7→11→12. Obviously,
the execution path with the linear sequence structure has lower code
In this section, we propose VDCEP, a novel deep learning-based coverage than the entire CFG. It is thus reasonable to believe that
framework to identify smart contract vulnerabilities, as depicted in using execution paths to represent the code snippet can effectively
Fig. 2. The overall preview of the proposed VDCEP is first described. eliminate non-vulnerable code information. Besides, the structure of
Thereafter, we introduce the three phases of VDCEP. the execution path is simpler than the CFG, which makes it easier for

3
J. Cheng et al. Information and Software Technology 174 (2024) 107517

Fig. 2. The overall structure of VDCEP. It consists of the data processing, feature extraction, and vulnerability detection phases. Conv1d and MLP denote the one-dimensional
convolutional layer and multi-layer perceptron network, respectively.

Table 1 Therefore, it is reasonable for us to adopt defined heuristic rules to


Specific statements associated with four categories of smart contract vulnerabilities.
identify all critical execution paths from the CFG.
Vulnerability Specific statement
Firstly, following prior works [17,18], we conclude vulnerability-
call function: 𝑐𝑎𝑙𝑙.𝑣𝑎𝑙𝑢𝑒(), related code statements for four types of smart contract vulnerabili-
Reentrancy 𝑑𝑒𝑝𝑜𝑠𝑖𝑡.𝑣𝑎𝑙𝑢𝑒(), 𝑡𝑟𝑎𝑛𝑠𝑓 𝑒𝑟()
ties as heuristic rules, as listed in Table 1. Among them, for all call
the variable: related to user 𝑏𝑎𝑙𝑎𝑛𝑐𝑒
functions, 𝑏𝑙𝑜𝑐𝑘.𝑛𝑢𝑚𝑏𝑒𝑟, 𝑏𝑙𝑜𝑐𝑘.𝑔𝑎𝑠𝑙𝑖𝑚𝑖𝑡, and 𝑚𝑠𝑔.𝑠𝑒𝑛𝑑𝑒𝑟, they simulate
block information: 𝑏𝑙𝑜𝑐𝑘.𝑡𝑖𝑚𝑒𝑠𝑡𝑎𝑚𝑝,
Timestamp dependency
𝑏𝑙𝑜𝑐𝑘.𝑛𝑢𝑚𝑏𝑒𝑟, 𝑏𝑙𝑜𝑐𝑘.𝑔𝑎𝑠𝑙𝑖𝑚𝑖𝑡
whether there is an invocation to the corresponding statements in
the smart contract code. For user 𝑏𝑎𝑙𝑎𝑛𝑐𝑒, we check if it is sufficient
the variable: related to user 𝑏𝑎𝑙𝑎𝑛𝑐𝑒
Integer overflow/underflow
global variable: 𝑚𝑠𝑔.𝑠𝑒𝑛𝑑𝑒𝑟 before transferring money to other users. Besides, for 𝑏𝑙𝑜𝑐𝑘.𝑡𝑖𝑚𝑒𝑠𝑡𝑎𝑚𝑝,
we verify whether its value is assigned to another variable, i.e., whether
Delegatecall call function: 𝑑𝑒𝑙𝑒𝑔𝑎𝑡𝑒𝑐𝑎𝑙𝑙()
this statement is actually used. Secondly, we scan all code execution
paths several times and check if they fulfill the above heuristic rules.
For most rules, we directly use keyword matching to determine if an
deep learning models to understand the code semantic information and execution path has corresponding vulnerability-related statements. In
capture vulnerability features. particular, the syntactic and semantic analysis techniques are used to
Utilizing only one execution path to represent the smart contract determine if complex rules (i.e., user 𝑏𝑎𝑙𝑎𝑛𝑐𝑒 and 𝑏𝑙𝑜𝑐𝑘.𝑡𝑖𝑚𝑒𝑠𝑡𝑎𝑚𝑝) exist
code may ignore many vulnerable code statements, which hinders the in an execution path. For any code execution path, we consider it
extraction of vulnerability features. Furthermore, it is impractical to critical as long as it matches any of the heuristic rules.
extract the vulnerability features from all execution paths. On the (2) Selecting a fixed number of critical paths based on path
one hand, only a few critical paths that cover vulnerability-related priority. For all critical execution paths, we preserve those that match
nodes are helpful for vulnerability detection. The remaining execution more heuristic rules to speed up model training. To this end, we select a
path that contains the irrelevant code information introduces a lot of fixed number (𝑛-1) of critical paths based on path priority. In detail, for
noise, which hinders the detection performance. On the other hand, any type of vulnerability, each of its designed heuristic rules is equally
a CFG may comprise a vast number of execution paths since it has important for vulnerability detection. Hence, a critical execution path
the ‘‘loops’’ and ‘‘if/else’’ statements. Allowing a deep learning model is deemed to be of high priority when it matches a greater number
to model numerous execution paths and learn feature representations of heuristic rules. Also, it should comprise as few irrelevant code
is computationally intensive. With these considerations in mind, we statements as possible (i.e., low code coverage). Code coverage (𝑟) is
mainly use a fixed number of critical execution paths to represent the the ratio of the number of nodes in the execution path to the entire
code, rather than the entire CFG. CFG. If multiple critical execution paths match the same number of
To achieve this, we devise a heuristic-based path selection strategy, rules, we only pick some of them at random. This operation ensures
which consists of the following two stages. that the number of retained critical execution paths does not exceed
(1) Identifying all critical paths by heuristic rules. Some stud- 𝑛-1.
ies [8,24] indicate that smart contract vulnerabilities have more dis- The complete selection process for critical execution paths is shown
tinct and obvious code logic than other software vulnerabilities (e.g., C in Algorithm 1. Specifically, we first traverse the CFG in a depth-first
and Java). As mentioned in Section 2.1, the reentrancy vulnerability search manner and decompose it into the execution paths. Subse-
is related to incorrect calls (e.g., 𝑐𝑎𝑙𝑙.𝑣𝑎𝑙𝑢𝑒) among smart contracts. quently, we use the heuristic-based path selection strategy to retain 𝑛-1

4
J. Cheng et al. Information and Software Technology 174 (2024) 107517

Algorithm 1 The selection for critical execution paths the computation resources. In detail, existing Transformer-based deep
learning models generally limit the length of the input code snippet.
Require: 𝑛: number of critical paths, 𝑣𝑠 : entry node of CFG, 𝑉𝑒 : exit
For example, they only preserve the first 512 lines of the code snippet.
nodes of CFG, 𝑟: code coverage, ℎ: number of matched heuris-
If the input execution path is quite long, these models may miss a lot
tic rules, 𝐾: collection of heuristic rules, 𝑃 : collection of critical
of vulnerability-related code information, which further hinders their
execution paths.
performance.
Ensure: The selected execution paths 𝑃 = {𝑝𝑖 }𝑛𝑖=1 .
Some studies [3,5] have shown that convolutional neural networks
1: // identify all critical execution paths
are good at extracting invariant features from input data and can
2: set collections 𝑃̂ , 𝐻,
̂ 𝑅;
̂
capture critical code statements that are more vulnerable. Motivated by
3: for 𝑖 = 0 to |𝑉𝑒 |-1 do
this, we use a semantic compressor composed of convolutional neural
4: get an execution path 𝑝 = OnePath(𝑣𝑠 , 𝑉𝑒𝑖 );
networks to process the input code before feeding it into the Trans-
5: set ℎ = 0;
former encoder. The semantic compressor leverages the convolutional
6: for 𝑗 = 1 to |𝐾| do
neural network to automatically filter irrelevant code information,
7: if 𝑝 matches 𝑗-th heuristic rule in 𝐾 then
which effectively reduces the length of the input code and decreases
8: ℎ += 1;
the training burden of the Transformer. The calculation process of our
9: end if
feature extraction model is illustrated as follows.
10: end for
First, given an execution path 𝑝𝑖 , we consider it as a token sequence
11: if ℎ > 0 then
with the length of 𝑡𝑖 . To allow the deep learning model to understand
12: the path 𝑝 is a critical and count 𝑟;
this token sequence from a natural language perspective, the word2vec
13: 𝑃̂ = 𝑃̂ ∪ 𝑝, 𝐻̂ = 𝐻̂ ∪ ℎ, 𝑅̂ = 𝑅̂ ∪ 𝑟;
embedding algorithm [28] is applied to vectorize each token. By this,
14: end if
we obtain the vectorized path 𝑝𝑖 ∈ R𝑡𝑖 ×𝑘 with the dimension of each
15: end for
token vector is 𝑘. Second, the semantic compressor takes all execution
16: // select a fixed number of critical execution paths
paths as the input, which contains a total of 𝑚 convolution modules.
17: for 𝑖 = 1 to 𝑛-1 do
In this module, a ReLU activation function 𝑅𝑒𝐿𝑈 (⋅) and a maxpooling
18: 𝑝𝑖 𝑟𝑖 , ℎ𝑖 = ∅, 1, 0;
layer 𝑀𝑎𝑥𝑝𝑜𝑜𝑙𝑖𝑛𝑔(⋅) are followed by a one-dimensional convolutional
19: for 𝑗 = 0 to |𝑃 |-1 do
filter 𝐶𝑜𝑛𝑣(⋅). Among them, each convolutional filter has 𝑘 convolution
20: if 𝑃̂𝑗 not in 𝑃 and 𝑅̂ 𝑗 < 𝑟𝑖 and 𝐻̂ 𝑗 > ℎ𝑖 then
kernels to learn comprehensive features within the input code snippet.
21: 𝑝𝑖 , 𝑟𝑖 , ℎ𝑖 = 𝑃̂𝑗 , 𝑅̂ 𝑗 , 𝐻̂ 𝑗 ;
The maxpooling layer is used to select the most important feature from
22: end if
a feature map generated by a convolution kernel.
23: end for
Each path 𝑝𝑖 passes through the 𝑗th convolution module and gener-
24: retain the critical path by 𝑃 = 𝑃 ∪ 𝑝𝑖 ;
ates the feature matrix 𝑝̂𝑖 ∈ R𝑚×𝑘 according to:
25: end for
26: // get a long path with high code coverage 𝑝̂𝑖 = 𝑀𝑎𝑥𝑝𝑜𝑜𝑙𝑖𝑛𝑔(𝑅𝑒𝐿𝑈 (𝐶𝑜𝑛𝑣𝑗 (𝑝𝑖 ))), (1)
27: 𝑝𝑛 , 𝑟𝑛 = ∅, 0;
28: for 𝑖 = 0 to |𝑉𝑒 |-1 do
𝐶𝑜𝑛𝑣𝑗 (𝑝𝑖 ) = 𝑤𝑗 ⋅ 𝑝𝑖 + 𝑏𝑗 , (2)
29: get a path 𝑝 = OnePath(𝑣𝑠 , 𝑉𝑒𝑖 ) and count 𝑟;
30: if 𝑝 not in 𝑃 and 𝑟 > 𝑟𝑛 then where 𝐶𝑜𝑛𝑣𝑗 (⋅) means the 𝑗th convolutional filter with the learned
31: 𝑝𝑛 , 𝑟𝑛 = 𝑝, 𝑟; weight and bias of 𝑤𝑗 and 𝑏𝑗 , respectively. In this way, all execution
32: end if paths with different lengths are processed into the feature matrices
33: end for with fixed dimensions, as described in Fig. 4. Meanwhile, sensitive code
34: retain a long path by 𝑃 = 𝑃 ∪ 𝑝𝑛 ; information that is more vulnerable is preserved as much as possible.
35: return critical execution paths 𝑃 ; Third, we employ the Transformer model to handle the feature
matrix 𝑝̂𝑖 and build dependencies among sensitive code information.
The Transformer contains ℎ encoders with three operations, i.e., a
multi-head attention layer 𝑀𝐻𝐴(⋅), a feed-forward network 𝐹 𝐹 𝑁(⋅),
critical execution paths. Finally, we traverse the remaining execution and a layer normalization 𝐿𝑁(⋅). In the 𝑗th encoder, for the input
paths to find a long execution path with the highest code coverage. feature matrix 𝑝̂𝑖 𝑗−1 , we update it by:
To record more unseen code information, this long execution path 𝑝𝑛
covers most nodes that do not appear in the critical paths {𝑝𝑖 }𝑛−1 . 𝑝𝑖 𝑗 = 𝑀𝐻𝐴(𝑝̂𝑖 𝑗−1 ), (3)
𝑖=1
In practice, the selection of critical execution paths makes a tradeoff
between these two considerations. First, to facilitate the extraction of 𝑝̂𝑖 𝑗 = 𝐿𝑁(𝑝𝑖 𝑗 + 𝐹 𝐹 𝑁(𝑝𝑖 𝑗 )), (4)
vulnerability features, we should pick the critical paths that cover more
vulnerability-related nodes. Second, to prevent the loss of potentially where a residual connection operation is added after the multi-head
important information in the code, we expect to select long paths that attention layer. Finally, we generate the feature representation 𝑝̃𝑖 =
retain as many nodes as possible. By doing so, we obtain a total of 𝑛 𝑝̂𝑖 ℎ+1 of the execution path by repeating ℎ encoders.
execution paths {𝑝𝑖 }𝑛𝑖=1 that are helpful for vulnerability detection. In addition, we add some discussion about the two components of
the feature extraction module. First, the semantic compressor utilizes
3.3. Feature extraction convolutional neural networks to retain vulnerability-related critical
code statements in the execution paths [3,5]. In fact, this semantic com-
In this phase, we design a feature extraction module to learn the pressor leverages 𝑚 convolution modules, containing 𝑚 convolutional
feature representations (i.e., vectors) of the selected execution paths. operations, to scan each code token in the execution path. All code
This module consists of a semantic compressor and Transformer en- tokens are conducted with the convolutional operations and then calcu-
coder, as described in Fig. 2. Among them, the Transformer, based on lated its importance in vulnerability detection by the activation func-
the multi-head attention mechanism [13], has been widely applied to tion. By doing so, it is easier for us to ignore vulnerability-irrelevant
vulnerability detection since it can better learn the contextual semantic code information and preserve critical code statements. Second, we
information of the code snippet [26,27]. Unfortunately, it is difficult for introduce the classic Transformer model [13] to learn contextual se-
the Transformer to handle the long code snippet due to the limitation of mantic information of the code, instead of ChatGPT [29] and advanced

5
J. Cheng et al. Information and Software Technology 174 (2024) 107517

Fig. 3. The example of constructing the CFG of the source code and the corresponding execution paths. The execution order of code statements in each execution path is linear.

Fig. 4. The process of the semantic compressor. The pink and blue circles indicate the vulnerability-related and irrelevant code token vectors, respectively.

pretraining models, e.g., CodeBERT [3] and GraphCodeBERT [17]. vulnerable. Otherwise, this code snippet is benign. Subsequently, we
The reason is that in smart contract vulnerability detection (i.e., a use a cross-entropy loss function 𝐿(⋅) to optimize the model by:
classification task), we still need to fine-tune these large-scale models
to get the desired detection performance [28]. Moreover, these large- ∑
𝑆
𝐿(𝑌̂ , 𝑌 ) = − 𝑌𝑖 ⋅ log(𝑌̂𝑖 ), (7)
scale models require high training costs. More details are discussed in 𝑖=1
Section 6.2.
where 𝑌 and 𝑆 indicate the truth label and the size of the training data,
separately.
3.4. Vulnerability detection
Remarkably, to measure the importance of different execution paths
As shown in Fig. 2, after obtaining all feature representations 𝑝̃𝑖 ∈ in vulnerability detection, we compute the feature weight 𝑤𝑝̃𝑖 of each
R𝑘 of the selected paths, we fuse them through a concatenation opera- execution path 𝑝̃𝑖 according to:
tion 𝐶𝑜𝑛𝑐𝑎𝑡(⋅) [18] and take them into a multi-layer perceptron network exp (𝑝̃𝑖 ⊙ 𝑝)̃
to generate an enhanced feature representation 𝑝̃ ∈ R𝑘 . This feature 𝑤𝑝̃𝑖 = ∑𝑛 , (8)
𝑖=1 exp ( 𝑝
̃𝑖 ⊙ 𝑝)
̃
fusion strategy allows us to extract more vulnerability features from
this enhanced feature representation. The calculation process is listed where ⊙ and exp(⋅) refer to the inner product operator and an ex-
as follows: ponential function, respectively. In fact, this feature weight provides
interpretable details about which execution paths are more relevant to
𝑝 = 𝐶𝑜𝑛𝑐𝑎𝑡(𝑝̃1 , 𝑝̃2 , … , 𝑝̃𝑛 ), (5)
the detected vulnerability.

𝑝̃ = 𝑀𝐿𝑃 (𝑝). (6)


4. Experimental setup
Following this, the enhanced feature representation 𝑝̃ is taken as
the input of the sigmoid activation function 𝑆𝑖𝑔𝑚𝑜𝑖𝑑(⋅) and outputs the
predicted probability 𝑌̂ = 𝑆𝑖𝑔𝑚𝑜𝑖𝑑(𝑝).
̃ The closer this probability 𝑌̂ is In this section, we perform a large number of experiments, and
to 1 indicates that the input smart contract code is more likely to be present the experimental setup as follows.

6
J. Cheng et al. Information and Software Technology 174 (2024) 107517

4.1. Research questions 4.3. Baselines

The objective of our study is to answer the following three Research To investigate the effectiveness of VDCEP, we first compare it
Questions (RQs): with thirteen state-of-the-art methods in the field of smart contract
RQ1: How does VDCEP perform in detecting smart contract vul- vulnerability detection, i.e., six rule-based methods [25,31–35] and
nerabilities compared to existing vulnerability detection methods? seven deep learning-based methods [12,16–19,24].
This RQ is designed to measure the ability of our approach to Smartcheck [33] transforms the smart contract code into an XML-
identify smart contract vulnerabilities. To answer this RQ, we adopt based parse tree and then checks the vulnerabilities by utilizing XPath
several metrics to evaluate the performance of VDCEP in detecting schema queries on the XML-based parse tree.
four categories of smart contract vulnerabilities, and compare it to 14 Securify [35] infers the semantic information of the smart contract
state-of-the-art vulnerability detection methods. code and checks it against the predefined security property rules to
RQ2: What is the impact of the path selection strategy on the analyze the vulnerabilities.
performance of VDCEP? Slither [25] performs vulnerability detection by adopting lexical
In practice, this RQ aims to explore the impact of different design and syntactic analysis in the intermediate language of the smart con-
choices for the heuristic-based path selection strategy on our approach. tract called SlitherIR.
To answer this RQ, we focus on the impact of two design choices, Osiris [34] converts the smart contract code into a CFG and then
containing different path selection strategies and the number of all uses path constraints for vulnerability detection.
selected paths. Mythril [31] is a symbolic execution engine. It mainly combines
taint analysis and control flow inspection to analyze smart contract
RQ3: How does the feature extraction module contribute to the
vulnerabilities.
performance of VDCEP?
sFuzz [32] uses a branch distance-driven fuzz testing technique to
This RQ aims to explore the capability of the feature extraction
detect smart contract vulnerabilities.
module in learning the feature representation of the execution path. To
DeHunter [12] treats the source code as a sequence structure and
answer this RQ, we first introduce six representative sequence learning
introduces the LSTM model to capture the vulnerability features.
models within the field of natural language processing as the compari-
GCN [16] represents the smart contract code as a CFG and then
son. Then, we investigate the effect of different hyperparameters of the
utilizes a graph convolutional network to learn the graph feature for
feature extraction module, i.e., the number of encoder layers and the
vulnerability detection.
dimension of word embedding.
TMP [19] transforms the smart contract into a code semantic graph
that contains both the control flow and data flow information. It
4.2. Datasets then uses a temporal message propagation network to extract the
vulnerability features.
In our evaluation, we adopt the large and widely-used smart con- AME [18] learns the vulnerability features from both the code
tract dataset1 with four kinds of vulnerabilities, i.e., reentrancy (RE), semantic graph and expert patterns to identify smart contract vulnera-
timestamp dependency (TD), integer overflow/underflow (OF), and bilities.
delegatecall (DE). This benchmark smart contract dataset is collected Peculiar [17] converts the smart contract code into a critical data
and labeled by Qian et al. [24]. The smart contract code in the bench- flow graph and introduces the pretraining model (i.e., GraphCode-
mark dataset is generally written in the Solidity programming language BERT) as the detection model.
and collected from the Ethereum platform [30]. In fact, this benchmark SMS [24] and DMT [24] introduce the teacher–student network
dataset contains more than 40K smart contracts that are labeled with to perform vulnerability detection. Among them, SMS and DMT de-
many kinds of vulnerability. Our work focuses on the RE, TD, OF, note the single-modality student network and dual-modality teacher
and DE vulnerabilities, since they possess the typical characteristics network, respectively.
of Ethereum smart contract vulnerabilities. The corresponding four Furthermore, we introduce EPVD [3], a representative vulnerability
vulnerability datasets are obtained from the benchmark smart contract detection method for the C/C++ program, as an additional baseline.
dataset. It constructs a CFG of the source code and deconstructs it into some
After obtaining the above vulnerability datasets, we manually check execution paths with the shortest length. Next, EPVD introduces the
the correctness of the labeling results of these smart contracts based pretraining model (i.e., CodeBERT) to learn vulnerability features from
on the labeling strategy in [24]. Given a smart contract code snippet, the selected paths.
we analyze it using the labeling strategy and obtain a predicted label
(i.e., vulnerable or benign). If this predicted label is different from 4.4. Evaluation metrics
the original label, we consider this given smart contract snippet to
be unreliable and discard it. After checking all smart contracts in the Here, we employ the following four classification metrics that are
vulnerability datasets, we count the data distribution information in all widely used in software engineering for evaluation, including accuracy,
vulnerability datasets. Among them, 680 of 2385, 2242 of 4490, 1368 recall, precision, and F1-score.
of 7183, and 136 of 414 vulnerable smart contracts in the RE, TD, OF, Accuracy is the proportion of true-positive and true-negative in-
and DE vulnerability datasets, respectively. It is evident that each type stances to the total instances detected. It is calculated by: 𝑎𝑐𝑐𝑢𝑟𝑎𝑐𝑦 (𝐴) =
of vulnerability dataset has a different amount of smart contracts, and 𝑇 𝑃 +𝑇 𝑁
𝑇 𝑃 +𝑇 𝑁+𝐹 𝑃 +𝐹 𝑁
.
some datasets even have only a few hundred smart contract snippets. Recall measures the ratio of true-positive instances to the total
This does not affect the performance of the trained deep learning model samples that are classified correctly. It is calculated by: 𝑟𝑒𝑐𝑎𝑙𝑙 (𝑅) =
due to its outstanding feature learning ability [19,28]. Following previ- 𝑇𝑃
𝑇 𝑃 +𝐹 𝑁
.
ous efforts [18], we use the same data split setting for the vulnerability Precision means the proportion of true-positive instances classified
dataset with the proportion of the training set and test set being 8:2. by the total instances that are detected as true. We define it as:
𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 (𝑃 ) = 𝑇 𝑃𝑇+𝐹
𝑃
𝑃
.
F1-score indicates the overall effectiveness by considering both
1 𝑅×𝑃
https://github.com/Messi-Q/Smart-Contract-Dataset precision and recall. It is defined as: 𝐹 1-𝑠𝑐𝑜𝑟𝑒 (𝐹 ) = 2 × 𝑅+𝑃 .

7
J. Cheng et al. Information and Software Technology 174 (2024) 107517

Among them, 𝑇 𝑃 and 𝐹 𝑁 indicate the number of vulnerable smart prevent redundant traversals, when we find it is already within the
contracts that are correctly predicted and wrongly predicted as non- stack. Conversely, if the stack does not contain this node, we add it
vulnerable contracts, respectively. 𝑇 𝑁 and 𝐹 𝑃 mean the number to the stack and continue our exploration by traversing its neighboring
of non-vulnerable smart contracts that are correctly predicted and nodes. In the feature extraction phase, we set the dimension (𝑘) of each
wrongly predicted to be vulnerable, respectively. token vector in the execution path to 400. The number (𝑚) of the one-
Following prior vulnerability detection efforts [16,19], we focus on dimensional convolutional kernel in the semantic compressor is 200.
the above four evaluation metrics, although the false positive rate and The number (ℎ) of encoders in the Transformer is set to 7. As for model
false negative rate metrics can also be used to evaluate the performance hyperparameters, the learning rate is initialized to 0.001, and the batch
of vulnerability detection methods [36]. In fact, the four metrics we size and maximum training epoch are fixed at 8 and 100, separately.
used can already evaluate all vulnerability detection methods from We update VDCEP with the AdamW optimizer.
multiple perspectives. To ensure the rigor of the experimental results,
we do not use fewer metrics as this may mislead us to arrive at an 5. Experimental results
incorrect evaluation. A method with high accuracy, recall, precision,
and F1-score is regarded as effective in identifying smart contract In this section, we answer all the questions via extensive experimen-
vulnerabilities. Ideally, all four metrics are equal to one, indicating tal results to reveal more details of our approach.
that the vulnerability detection method will not miss the detected
vulnerability nor trigger a false alarm [36]. 5.1. RQ1: Effectiveness of VDCEP
In addition, we perform statistical analyses to examine whether
the performance differences between our VDCEP and other vulner- In this subsection, we choose 14 state-of-the-art vulnerability detec-
ability detection methods (i.e., paired samples) are statistically sig- tion methods as the baselines. Here, we do not choose other represen-
nificant [37]. Specifically, we conduct ten experiments for each vul- tative efforts, i.e., Effuzz [9] and GraBit [28], as the baseline because
nerability detection method. The Wilcoxon signed-rank test [5], a the source code in their papers is not publicly available. In Table 2,
classical statistical method, is utilized to compute the 𝑝-values of these we compare VDCEP and these baselines in the four evaluation metrics,
ten experimental outcomes, scrutinizing the performance differences and calculate the corresponding 𝑝-values by the Wilcoxon signed-rank
(i.e., multiple comparisons) with a significance level of 0.05. If the 𝑝- test in terms of the F1-score metric [5]. The gray-shaded cells are used
values fall below 0.05, we accept the null hypothesis, indicating no to indicate the significant differences in our VDCEP over other baseline
significant difference between the paired samples. Conversely, rejection methods. Also, we count the average performance of all methods as the
of the null hypothesis occurs otherwise. overall performance, as shown in Fig. 6, and analyze these results from
the following perspectives.
4.5. Implementation details Performance comparison with existing vulnerability detection
methods. As seen in Fig. 6, the overall performance of VDCEP beats
We perform the experiments on a computer equipped with an Intel all the baselines. More specifically, VDCEP achieves the average ac-
Core i7 CPU at 3.7 GHz, a GPU at NVIDIA GeForce RTX3090, and 80 GB curacy, recall, precision, and F1-score of 91.33%, 88.59%, 90.83%,
Memory. The designed path selection strategy is implemented with and 89.82%. Looking at the rule-based vulnerability detection methods,
Python and the feature extraction module is accomplished with Py- they obtain relatively poor results. sFuzz drops VDCEP by 41.58%,
torch. To ensure a fair comparison, we reproduce all baseline methods 59.80%, 61.20%, and 60.88% in the accuracy, recall, precision, and
and adopt the same hyperparameter settings as described in their paper. F1-score, respectively. This may be attributed to the fact that rule-based
These baselines are compared with VDCEP on the same vulnerability methods are limited to design rules that can cover all vulnerability
dataset. We shuffle each vulnerability dataset and randomly divide it scenarios, which further limits their performance.
into the training set and test set before each experiment. The vulnerable When compared to current deep learning-based methods, our VD-
ratios (i.e., the ratios between vulnerable and non-vulnerable data) of CEP still outperforms them across the board. For instance, DeHunter
the training set and test set are consistent. After repeating ten exper- witnesses a decline of 22.31%, 18.49%, 20.69%, and 21.41% in the
iments, we take the average value of all experimental results as the accuracy, recall, precision, and F1-score, separately. The reason is
final outcome. Moreover, overfitting is a common issue in deep learning that DeHunter directly treats the source code as a sequence structure,
model training that can affect the model’s generalizability. To avoid which ignores the program structural information of the code and
this, we adopt the early stopping operation [4] to monitor the model limits the ability of code representation. Our VDCEP deconstructs the
performance in the training process. Specifically, we decide when to code structure graph and preserves the critical execution paths, which
stop model training by adjusting the patience parameter, and set the retains the important execution logic of the code associated with the
patience to five during the implementation. This means that if the vulnerabilities. When it comes to DMT, the top-performing baseline
model performance does not improve for more than five consecutive method, it still drops VDCEP by 3.23%, 6.16%, 4.01%, and 5.34% in
epochs, the training process stops. It is worth noting that we determine the accuracy, recall, precision, and F1-score, separately. We argue that
the value of the patience parameter through empirical experiments DMT extracts the vulnerability features from the CFG that contain many
and cross-validation techniques. During the empirical experiments, irrelevant node information. Indeed, this complex CFG generates a lot
we select different values of patience based on their impact on the of noise, which is not beneficial for capturing vulnerability features.
generalization ability of the trained model. The k-fold cross-validation In contrast, VDCEP decomposes the complex CFG into a few critical
is employed to validate the robustness of the selected patience. By execution paths with the linear sequence structure, which effectively
constantly adjusting the value of patience, we hope to strike a balance reduces the interference of the massive irrelevant noise. This operation
between preventing overfitting and allowing the model to fully learn promotes the performance enhancement of VDCEP.
from the training data. Furthermore, the code execution paths we used are derived from
In the data processing phase, the CFG of the code is first constructed the CFG, as mentioned in Section 3.2. We are interested in further
based on the parse results of a public compiler [25]. Then, we decon- analyzing the performance of our VDCEP and vulnerability detection
struct the CFG into the 𝑛 execution paths that consist of 𝑛-1 critical methods that use the CFG directly. In fact, the GCN, SMS, and DMT
execution paths and one long execution path. By default, 𝑛 is set to 4. To methods leverage the CFG to learn the vulnerability features. Obvi-
avoid the infinite long paths when traversing the CFG with the depth- ously, these vulnerability detection methods based on the CFG do not
first search way, we set the visited node stack and examine whether achieve more prominent performance compared to our VDCEP. This
a given CFG node exists in the stack [38]. This node is bypassed to may stem from the following reasons. First, these methods use various

8
J. Cheng et al. Information and Software Technology 174 (2024) 107517

Table 2
Performance, i.e., 𝐴 (%), 𝑅 (%), 𝑃 (%), and 𝐹 (%), comparison between VDCEP with 14 baselines on four kinds of smart contract vulnerabilities. The best results are marked in
bold. ‘‘n/a’’ denotes that the baseline cannot identify this vulnerability. Smartcheck is also called Scheck.
Method RE TD IO DE
𝐴 (%) 𝑅 (%) 𝑃 (%) 𝐹 (%) 𝐴 (%) 𝑅 (%) 𝑃 (%) 𝐹 (%) 𝐴 (%) 𝑅 (%) 𝑃 (%) 𝐹 (%) 𝐴 (%) 𝑅 (%) 𝑃 (%) 𝐹 (%)
Scheck 54.65 16.34 45.71 24.07 47.73 79.34 47.89 59.73 53.91 68.54 42.81 52.70 62.41 56.21 45.56 50.33
Securify 72.89 73.06 68.40 70.41 n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a
Slither 74.02 73.50 74.44 73.97 68.52 67.17 69.27 68.20 n/a n/a n/a n/a 68.97 52.27 70.12 59.89
Osiris 56.73 63.88 40.94 49.90 66.83 55.42 59.26 57.28 68.41 34.18 60.83 43.77 n/a n/a n/a n/a
Mythril 64.27 75.51 42.86 54.68 62.40 49.80 57.50 53.37 n/a n/a n/a n/a 75.06 62.07 72.30 66.80
sFuzz 55.69 14.95 10.88 12.59 33.41 27.01 23.15 24.93 45.50 25.97 25.88 25.92 64.37 47.22 58.62 52.31
DeHunter 70.95 72.92 70.15 71.51 66.65 65.22 73.39 62.58 70.49 71.59 70.56 71.07 67.99 70.66 66.47 68.50
GCN 73.21 73.18 74.47 73.82 75.91 77.55 74.93 76.22 67.53 70.93 69.52 70.22 65.76 69.74 69.01 69.37
TMP 76.45 75.30 76.04 75.67 78.84 76.09 78.68 77.36 70.85 69.47 70.26 69.86 69.11 70.37 68.18 69.26
AME 81.06 78.45 79.62 79.03 82.25 80.26 81.42 80.84 73.24 71.59 71.36 71.47 72.85 69.40 70.25 69.82
Peculiar 81.91 82.88 80.53 81.55 77.02 74.77 80.22 82.61 80.53 81.67 80.64 80.07 77.89 80.06 76.48 78.21
SMS 83.85 77.48 79.46 78.46 89.77 91.09 89.15 90.11 79.36 72.98 78.14 75.47 78.82 73.69 76.97 75.29
DMT 89.42 81.06 83.62 82.32 94.58 96.39 93.60 94.97 85.64 74.32 85.44 79.49 82.76 77.93 84.61 81.13
EPVD 80.52 74.64 84.92 79.34 79.32 70.43 88.76 78.42 73.64 74.27 74.68 74.22 67.84 70.49 70.15 70.88
VDCEP 93.12 90.58 92.96 90.77 91.57 92.44 90.36 92.48 91.92 90.43 90.77 89.56 88.69 80.89 89.22 86.47

Fig. 5. The interpretable feature weights of the selected execution paths. High feature weight indicates that the corresponding execution path is important for detecting smart
contract vulnerabilities.

graph neural networks, such as graph convolutional network [20] and smart contract vulnerability detection, although it achieves outstanding
graph attention network [21], to handle the CFG. The increase in performance in the code generation task.
hidden layers of these graph neural networks can cause the overfitting Performance comparison in detecting different types of vul-
issue, which makes the graph features learned by different smart con- nerabilities. We consider four evaluation metrics regarding four vul-
tract codes similar [2]. Second, to minimize the inclusion of irrelevant nerability scenarios in Table 2 and set up a total of 16 combination
code information, some vulnerability detection methods (e.g., TMP and cases. We have the following findings. First, many rule-based methods
AME) simplify the code structure graph. This can undermine the in- cannot support the detection of all vulnerabilities. A fact is that neither
tegrity of the code logic and leave out some potentially important code Securify nor Osiris can identify the delegatecall vulnerability (see lines
information, which ultimately affects the detection performance [3,22]. 4 and 6). This shows that rule-based methods are hard to apply for
Even if the CFG is replaced with a data flow graph (see Peculiar), the detecting multiple types of smart contract vulnerabilities if predefined
above limitations are encountered. These findings suggest that directly vulnerability rules are missing.
utilizing the CFG or other code structure graphs to detect smart contract Second, existing vulnerability detection methods do not consis-
vulnerabilities may not be ideal over the code execution paths. tently yield desirable results in all vulnerability scenarios. For example,
Performance comparison with vulnerability detection methods AME, SMS, and DMT achieve higher F1-score in the reentrancy and
applied for other software. We make an additional comparison with timestamp dependency vulnerabilities (see lines 12, 14, and 15). In
EPVD which is a representative vulnerability detection method applied comparison, our VDCEP performs better than the deep learning-based
to the C/C++ program. Regrettably, it fails to achieve the expected methods in the majority of combination cases. Even compared to
results with the accuracy, recall, precision, and F1-score of 75.33%, DMT, VDCEP achieves 12 best results in all the 16 combination cases.
72.46%, 79.63%, and 75.72%. Even compared with Peculiar, SMS, These findings further highlight the effectiveness of VDCEP in detecting
and DMT, EPVD still has no advantage. The decline in performance of multiple types of smart contract vulnerabilities.
EPVD can be attributed to the following reasons. First, EPVD employs a Analysis of the interpretable feature weights. Existing graph-
greedy-based path selection strategy to retain shorter execution paths based vulnerability detection methods, such as AME, Peculiar, and
from the CFG. However, it is difficult to ensure that such short exe- DMT, represent code as a code structure graph and predict vulner-
cution paths contain sensitive code information associated with smart abilities by completing a graph-level classification task. They have
contract vulnerabilities. Second, EPVD introduces CodeBERT, a repre- difficulty providing more interpretable details about label predictions.
sentative pretraining model, to handle the execution paths and extract Another advantage of VDCEP is that it can provide feature weights of
the vulnerability features. Similarly, Peculiar applies GraphCodeBERT, the selected execution paths, as mentioned in Section 3.4. These feature
another pretraining model, to handle the data flow graph. Regrettably, weights can reflect the importance of the corresponding path in smart
these methods do not get satisfactory detection performance. This sug- contract vulnerability detection. In Fig. 5, we give the feature weights
gests that the pretraining model is not always an optimal solution for of execution paths on detecting all smart contract vulnerabilities. We

9
J. Cheng et al. Information and Software Technology 174 (2024) 107517

Fig. 6. Average performance of VDCEP and other ten baselines on all vulnerability scenarios. We only count the average performance of methods that can identify all smart
contract vulnerabilities.

Table 3
Performance comparison of VDCEP and four variants with different path selection strategies.
Method RE TD IO DE Average
𝐴 (%) 𝐹 (%) 𝐴 (%) 𝐹 (%) 𝐴 (%) 𝐹 (%) 𝐴 (%) 𝐹 (%) 𝐴 (%) 𝐹 (%)
VDCEP-epvd 85.08 83.07 82.72 79.81 79.12 79.02 78.02 77.33 81.24(↓10.09%) 79.81(↓10.01%)
VDCEP-c1 89.92 86.92 89.52 87.23 88.66 86.03 81.20 81.36 87.33(↓4.00%) 85.39(↓4.43%)
VDCEP-c2 91.42 88.22 91.77 88.14 90.42 88.23 83.79 83.67 89.35(↓1.97%) 87.07(↓2.75%)
VDCEP-c4 93.47 89.89 91.10 87.93 90.41 88.01 83.95 84.11 89.73(↓1.60%) 87.49(↓2.33%)
VDCEP 93.12 90.77 91.57 92.48 91.92 89.56 88.69 86.47 91.33 89.82

find the critical execution path has higher feature weights compared execution paths are three and one, respectively. To explore the effect
to the long execution path, which shows that the critical path con- of different path selection strategies on VDCEP, we create four vari-
tributes more to the performance of VDCEP. With these interpretable ants. Among them, VDCEP-epvd adopts a greedy-based path selection
feature weights, we can quickly locate the critical execution path strategy in the EPVD, which preserves two short paths with low code
associated with the vulnerability and further analyze the details of the coverage and one long path. VDCEP-c1, VDCEP-c2, and VDCEP-c4
vulnerability. select one, two, and four critical paths, respectively. Also, VDCEP-c1
and VDCEP-c2 contain three and two long paths, respectively.
From Table 3, we find the overall performance of the default
Answer to RQ1: Our VDCEP outperforms the existing VDCEP is superior to all variants. VDCEP-epvd obtains the worst results
state-of-the-art vulnerability detection methods. In detail, in all vulnerability scenarios. As discussed in the previous subsec-
VDCEP achieves a performance improvement of 3.23%– tion, VDCEP-epvd struggles to ensure that the selected short execution
41.58%, 6.16%–59.80%, 4.01%–61.20%, and 5.34%–60.88% paths contain sensitive code information associated with the detected
in accuracy, recall, precision, and F1-score, respectively. vulnerabilities. In contrast, our default VDCEP uses critical paths to
represent the smart contract code and capture more accurate vulner-
ability features from these critical paths. Looking at the VDCEP-c1 and
VDCEP-c2 variants, the reason for their performance decline is the
5.2. RQ2: Effect of the path selection strategy reduction in the number of critical paths makes us fail to extract enough
vulnerability-related sensitive information. However, a higher number
In this subsection, we explore the effect of different design choices of critical paths does not necessarily help in vulnerability detection.
of the path selection strategy on the performance of VDCEP, including For example, VDCEP-c4 witnesses a decline of 1.60% and 2.33% in the
different path selection strategies and the number of all selected paths. average accuracy and F1-score, separately. As the vulnerability logic
The effect of different path selection strategies. The key insight becomes more complex, it is difficult to guarantee that the critical
of VDCEP is to decompose the complex CFG into multiple execu- path selected by heuristic rules contains sensitive information about
tion paths with rich structural information of code, and then use a all possible vulnerabilities. Thus, we let the long execution path be the
heuristic-based path selection strategy to pick a fixed number of critical supplemental information of the critical path. Our heuristic-based path
execution paths. By this, VDCEP can effectively eliminate massive irrel- selection strategy achieves a tradeoff between the number of critical
evant code information and learn more accurate vulnerability features. paths and long paths.
By default, our VDCEP sets the number of the selected execution paths The effect of the number of all selected paths. As mentioned
to four. Among them, the number of critical execution paths and long above, VDCEP decomposes the CFG into a large number of execution

10
J. Cheng et al. Information and Software Technology 174 (2024) 107517

Table 4
Performance comparison of VDCEP and four variants with different numbers of selected paths. The number of our default selected paths is four.
Method RE TD IO DE Average
𝐴 (%) 𝐹 (%) 𝐴 (%) 𝐹 (%) 𝐴 (%) 𝐹 (%) 𝐴 (%) 𝐹 (%) 𝐴 (%) 𝐹 (%)
VDCEP-p1 80.22 78.46 93.83 90.11 75.72 75.47 76.43 75.29 81.55(↓9.78%) 79.83(↓9.99%)
VDCEP-p2 92.02 89.01 89.55 86.06 86.17 84.68 84.66 84.25 88.10(↓3.23%) 86.00(↓3.82%)
VDCEP-p3 91.76 89.24 92.45 91.14 90.69 87.32 83.88 84.33 89.70(↓1.63%) 88.01(↓1.81%)
VDCEP-all 90.17 86.87 90.51 88.24 90.77 85.84 81.67 83.75 88.28(↓3.05%) 86.18(↓3.64%)
VDCEP 93.12 90.77 91.57 92.48 91.92 89.56 88.69 86.47 91.33 89.82

Table 5
Performance comparison of VDCEP and six variants with different sequence learning models.
Method RE TD IO DE Average
𝐴 (%) 𝐹 (%) 𝐴 (%) 𝐹 (%) 𝐴 (%) 𝐹 (%) 𝐴 (%) 𝐹 (%) 𝐴 (%) 𝐹 (%)
VDCEP-RNN 80.89 80.23 77.56 69.84 77.65 77.43 78.56 73.18 78.67(↓12.66%) 75.17(↓14.65%)
VDCEP-GRU 82.38 83.52 76.92 75.68 80.86 81.37 86.42 74.82 81.65(↓9.68%) 78.85(↓10.97%)
VDCEP-LSTM 84.60 82.53 80.16 81.75 82.48 84.16 87.63 79.14 83.72(↓7.61%) 81.90(↓7.93%)
VDCEP-BiLSTM 87.49 85.75 81.56 86.74 85.17 85.82 87.65 81.77 85.47(↓5.86%) 85.02(↓4.80%)
VDCEP-TextCNN 89.37 89.86 83.08 82.06 83.54 83.28 87.82 83.83 85.95(↓5.38%) 84.76(↓5.06%)
VDCEP-Transformer 91.80 88.36 86.31 85.59 88.37 86.86 89.43 84.96 88.98(↓2.35%) 86.44(↓3.38%)
VDCEP 93.12 90.77 91.57 92.48 91.92 89.56 88.69 86.47 91.33 89.82

paths and only selects four execution paths to represent the source code. module with six classic sequence learning models, including Recurrent
The reason behind this is that it is impractical for us to learn vulner- Neural Network (RNN) [39], Gated Recurrent Unit (GRU) [40], LSTM,
ability features from all execution paths due to the cost constraints of Bi-directional Long Short-Term Memory (BiLSTM) [13], TextCNN [41],
model training. Therefore, it is necessary to investigate the effect of Transformer [14].
the number of different execution paths on the VDCEP. We create four In Table 5, all variants do not perform as well as the default
variants with different numbers of the selected execution paths. Among VDCEP. For example, VDCEP-RNN even witnesses a decline of 12.66%
them, VDCEP-p1, VDCEP-p2, and VDCEP-p3 have one long execution and 14.65% in the average accuracy and F1-score, respectively. This
path. VDCEP-p2 and VDCEP-p3 also contain one and two critical exe- highlights the capability of our feature extraction module in handling
cution paths, respectively. VDCEP-all preserves all execution paths in sequence data. In detail, we first apply the Transformer model to
the CFG without any path selection measures. Extensive experimental capture long-range dependencies in the code snippet, which provides
a better representation of the context code information (i.e., program
results are recorded in Table 4.
structural information) in the execution path. For example both the
Obviously, all variants still do not perform as well as the default VD-
default VDCEP and VDCEP-Transformer outperform the remaining vari-
CEP, realizing a decrease of 1.63%–9.78% accuracy and 1.81%–9.99%
ants. Second, due to the limitation of computational resources, it is
F1-score, respectively. Among them, the lack of critical execution paths
difficult for the Transformer model to handle long sequence data. This
hinders the performance of the VDCEP-p1, VDCEP-p2, and VDCEP-p3 shortcoming makes the VDCEP-Transformer missing some important
variants. Besides, the performance degradation of VDCEP-all suggests code information related to the vulnerability. As countermeasures, we
that a large number of the execution paths may limit the detection employ the convolutional neural network to eliminate many irrelevant
performance. The reason is that we fuse the feature representations of code information and effectively reduce the length of input code.
all selected paths and generate an enhanced feature representation for The effect of the model hyperparameters. To reveal more de-
vulnerability detection. As the number of the selected paths increases, tails about our feature extraction module, we focus on two important
the weight of the execution paths that are most associated with the hyperparameters, i.e., the dimension (𝑘) of the token vector and the
detected vulnerability decreases, which further hinders the detection number (ℎ) of the encoder layer in Transformer, and analyze how
performance. Therefore, it is reasonable for us to set the number of they affect the performance of VDCEP. In this study, we choose eight
selected execution paths to four. encoders where the number of layers are 2, 3, 4, 5, 6, 7, 8, and 9,
respectively. Besides, the dimensions of the token vector are 100, 200,
400, 600, and 1000, separately. Fig. 7 gives the F1-score of VDCEP
Answer to RQ2: The heuristic-based path selection strategy with different hyperparameter combinations on the reentrancy and
allows VDCEP to eliminate massive irrelevant code informa- timestamp dependency vulnerabilities.
tion. The default design of our path selection strategy is the VDCEP achieves optimal performance on all two vulnerability sce-
most beneficial decision. narios when the dimension of the token vector and the number of
the encoder layer are set to 400 and 7, respectively. This is our
default model hyperparameter setting. Specifically, high-dimensional
token vectors can more accurately distinguish code elements from each
5.3. RQ3: Effect of the feature extraction module other, but they also cause a huge memory overhead and dilute the
relationships between code elements. Moreover, when the number of
In this subsection, we would like to explore the effect of the feature encoder layers is small, VDCEP fails to capture enough contextual code
extraction module on the VDCEP, including the capability of handling information from the execution paths, resulting in a low F1-score.
the sequence data and the effect of the model hyperparameters. However, as the number of encoder layers increases, i.e., more than
The capability in handling the sequence data. An execution path 7, the performance of our method does not improve due to the model
is a sequence structure in which the execution order of each code state- overfitting issue.
ment is linear. To learn the feature representation of the execution path,
we use a feature extraction module based on the convolutional neural Answer to RQ3: Our feature extraction module is more ef-
networks and the Transformer to handle it. These two networks are fective in handling the execution paths with the sequence
representative models within the field of natural language processing. structure. Also, our default model hyperparameter settings
To evaluate the ability of our feature extraction model in processing obtain satisfactory results.
sequence data, we create six variants by replacing the feature extraction

11
J. Cheng et al. Information and Software Technology 174 (2024) 107517

Fig. 7. The F1-score of VDCEP with different dimensions of the token vector and different numbers of the encoder layer. The bluer the color, the better the detection performance.

Fig. 8. The details of how our approach works. There is an integer overflow vulnerability among the red-shaded code statements. The code information of the execution path is
marked with a red serial number.

6. Discussion

In this section, we explain why our method works. Also, we compare


the detection performance of our method with ChatGPT. Some threats
and implications are given finally.

6.1. Why does our VDCEP work?

Here, we recall the advantages of VDCEP, which can clarify its


contribution to the field of smart contract vulnerability detection.
First, compared to existing graph-based vulnerability detection meth-
ods (e.g., AME, SMS, and DMT), VDCEP can capture more vulnerability-
related code features from critical execution paths. This promotes the
improvement of detection performance. The evidence is that our VD-
Fig. 9. Performance of VDCEP and ChatGPT on two smart contract vulnerability
CEP achieves the best results in most vulnerability scenarios (12 out of scenarios, including reentrancy and timestamp dependency.
16 combination cases) in Table 2. VDCEP outperforms six graph-based
methods by 3.23%–20.72% and 5.34%–17.41% in the accuracy and F1-
score, respectively. Second, VDCEP provides more interpretable details
vulnerability-related sensitive information and the long paths provide
of vulnerability detection by analyzing the feature weights of critical
some potentially important code information. Deconstructing the code
execution paths. In Fig. 5, we give the path feature weights on four
structure graph into these execution paths can effectively reduce the
vulnerability scenarios. The execution path with the highest feature
weight is considered to be the most relevant to the vulnerability. By interference of irrelevant code information. Following this, a feature
analyzing this, we can get more accurate details about the vulnerability. extraction module is adopted to learn more accurate vulnerability
Specifically, VDCEP uses a heuristic-based path selection strategy features on all selected paths and calculate the path feature weights.
to identify a fixed number of execution paths from the code struc- As mentioned in Section 2.2, this example contains the integer
ture graph, including the critical paths and long paths. These two overflow vulnerability since it has the vulnerability-related code state-
types of execution paths can complement each other in smart con- ments (see lines 13, 14, 21, and 32) with red shaded. As shown in the
tract vulnerability detection, where the critical paths contribute more middle of Fig. 8, the first critical path gets the highest feature weight,

12
J. Cheng et al. Information and Software Technology 174 (2024) 107517

which suggests this path deserves more attention when detecting the VNT chain platform [18,19]. These platforms also host a large number
integer overflow vulnerability. In the right of Fig. 8, we mark the code of smart contracts. In the future, we plan to collect many smart contract
information of the first execution path with a red serial number. It codes from various blockchain systems and improve the generalizability
is evident that this path contains fewer irrelevant code information of VDCEP. In addition, although VDCEP focuses on smart contract
than the original code snippet, and thus assigns more attention to the vulnerability detection, it is generalized enough and can be applied
vulnerability-related code information. This strategy allows VDCEP to to other software vulnerability detection. There are distinct differ-
pay more attention to the vulnerable code information and establish ences in program syntax among different software, but certain types
strong semantic dependencies for achieving accurate detection. of vulnerabilities (e.g., integer overflow/underflow) we are concerned
about can exist in different software. For these vulnerabilities, VDCEP
6.2. How does ChatGPT perform in detecting smart contract vulnerabilities? can get ideal performance on the corresponding software dataset after
completing model training.
Recently, large language models, represented by ChatGPT, have
The final threat comes from the execution paths. VDCEP only de-
received a lot of attention due to their demonstrated power in code
constructs the CFG into the execution path, which does not directly
analysis and code generation tasks [29]. Therefore, we would like to
utilize other code structural information, i.e., data flow and value flow
investigate the performance of ChatGPT in detecting smart contract
information. Some potential smart contract vulnerabilities may be more
vulnerabilities. In detail, we first send requests to the ChatGPT model
associated with this code structure information. In fact, the execution
directly using the API2 provided by OpenAI. For example, we first
path from the CFG implicitly contains the structural information and
input the prompt as: please analyze the following code snippet for <type>
vulnerabilities, the code snippet starts with @. @ <code>. Among them, can be well captured by the Transformer model. Our further research is
<code> is the entire source code of smart contract, <type> indicates to allow VDCEP to generate execution paths from multiple types of code
the type of smart contract vulnerability. Following this, the ChatGPT structure graphs and learn the vulnerability features. Besides, VDCEP
model returns a response determining whether the input code snippet primarily extracts vulnerability features from the critical execution path
is vulnerable or not. identified by the heuristic rules. There is a special case where we cannot
Fig. 9 illustrates the performance of ChatGPT and VDCEP in iden- capture enough critical paths if there is a lack of specific statements in
tifying reentrancy and timestamp dependency vulnerabilities. In both the code snippet. To mitigate this, we supplement the remaining critical
vulnerability scenarios, VDCEP obtains better performance than Chat- execution paths with the long execution paths.
GPT. The poor performance of ChatGPT may be attributed to the
following reasons. The relatively poor performance of ChatGPT can be 6.4. Limitation
attributed to several reasons. Firstly, we primarily apply the ChatGPT
model, including other language models based on the GPT architecture,
to the unsupervised learning tasks. These tasks involve training the There is a potential limitation regarding the accuracy of data label-
model on a vast amount of code snippets without explicit human- ing. The four vulnerability datasets utilized in our study are derived
labeled supervision. On the other hand, vulnerability detection is a from a widely recognized smart contract dataset. This benchmark
supervised learning task that requires models to be trained on specific dataset comprises smart contract samples collected and labeled by Qian
source code datasets with labeled vulnerabilities. Unfortunately, Chat- et al. [24]. However, potential human errors can lead to some in-
GPT has not been trained on such datasets with specific vulnerability accurate labels, thus constituting a dataset containing noisy labels.
labels, making it challenging for the model to comprehend the require- This can have significant impacts on the generalizability of the exper-
ments of the vulnerability detection task. Consequently, its performance imental results. In detail, VDCEP is a deep learn-based vulnerability
in identifying smart contract vulnerabilities is hindered. detection method, and deep learning techniques rely on high-quality
(or clean) labeled datasets for model training. If the datasets contain
6.3. Threats to validity many noisy labels (or inaccurate labels), this can lead the trained
model to learn low-quality feature representations of code samples.
The first threat to the validity is the limited number of smart con- Subsequently, the learning of vulnerability features is hindered, and
tract vulnerability scenarios. We evaluate the VDCEP on a benchmark the vulnerability detection methods are misled to obtain incorrect pre-
dataset with four types of smart contract vulnerabilities, including diction results [42,43]. Ultimately, the generalizability of our method
reentrancy, timestamp dependency, integer overflow/underflow, and is restricted, and it is not effective enough in real-world smart contract
delegatecall. These vulnerabilities have typical characteristics of smart scenarios. In fact, the noisy label issue is widespread in software vulner-
contract vulnerabilities, such as the misuse of on-chain information ability detection [44]. For instance, some deep learning-based methods
(timestamp dependency), and are widely used in many vulnerability (e.g., VulDeePecker [36]) applied in C/C++ projects are still unable
detection efforts [17,24]. However, they cannot cover all kinds of to learn accurate vulnerability features from noisy labels. Specifically,
vulnerabilities in smart contracts. We will extend VDCEP to more a recent study [45] indicates that in the clean CWE190 vulnerability
categories of smart contract vulnerabilities, such as state-reverting dataset, VulDeePecker can achieve an F1-score of 85.0%. However, it
vulnerability [6] and cross-contract vulnerability [7]. Briefly, we first
only gets an F1-score of 66.0%, if the dataset contains 30% noisy labels
collect training samples with specific vulnerabilities from open-source
(i.e., flip label noises). This finding further highlights the damage of
smart contract datasets [6,7]. We then use the same model training
inaccurate labels on the generalizability of our research.
operation to update VDCEP on these new vulnerability datasets. By this,
To mitigate this risk, we implement the original labeling strat-
our VDCEP can cover more types of vulnerabilities in smart contracts,
egy [24] to meticulously review and verify each smart contract within
which contribute to the security of the blockchain system.
The second threat is related to the generalizability of the datasets. the vulnerability datasets. Any data samples found to have inaccu-
The smart contract code within the datasets is collected from Ethereum, rate labels are subsequently removed from the corresponding datasets.
which may not represent all smart contract vulnerabilities in blockchain These measures are undertaken to ensure the integrity and accuracy of
systems. In detail, Ethereum is one of the most popular blockchain the labeling results. Moreover, weakly supervised learning, including
systems, and there are other popular blockchain systems such as the semi-supervised and self-supervised learning, is a promising strategy to
reduce the impact of inaccurate labels [44]. This approach combines a
small set of accurately labeled data with a large amount of unlabeled
2
https://platform.openai.com/docs/models/gpt-4 data for model training, thereby avoiding the need for human review.

13
J. Cheng et al. Information and Software Technology 174 (2024) 107517

6.5. Implications in different smart contract codes, it can result in high rates of false
positives.
VDCEP achieves state-of-the-art performance in the field of smart Deep learning-based methods have recently gained attention due
contract vulnerability detection, offering profound implications for the to their ability to automatically learn vulnerability features, which are
security of blockchain systems. Specifically, we devise a new code rep- divided into sequence-based and graph-based methods according to the
resentation method centered on critical execution paths, different from code representations.
the existing graph structures [16–18] and sequence structures [11,12] (1) Sequence-based methods treat the smart contract code as the
of codes. First, the usage of execution paths can address the critical sequence structure. For example, Tann et al. [11] and Yu et al. [12]
challenge of representing source code in a manner that balances the converted the source code into the opcode sequence and source code
elimination of irrelevant information with the retention of essential sequence, respectively. Then, they utilized the LSTM-based deep learn-
ing models to learn the vulnerability features. Zhu et al. [28] adopted a
structural details [3,22]. Then, instead of using the execution path
sequential model based on the BiLSTM model and attention mechanism
directly, we employ heuristic rules to elaborate the selection of critical
to handle the source code, and then extracted the contextual semantic
execution paths that contain more vulnerability-related code informa-
information for vulnerability detection.
tion. This strategic path selection process not only optimizes code
(2) Graph-based methods used various graph structures to preserve
representation but also enhances vulnerability feature learning, thus the program structure information. For instance, some efforts [18,19]
elevating the performance of vulnerability detection. converted source code into a code semantic graph that contains control
Moreover, VDCEP provides interpretable details by analyzing the flow and data flow information. Zhuang et al. [19] employed a tempo-
path feature weights, contributing to a deeper understanding of the ral message propagation network to learn vulnerability features. Liu
vulnerability detection process. This enables researchers to gain insight et al. [18] combined code semantic graph and expert patterns for
into the intrinsic mechanisms of vulnerability detection. In summary, vulnerability detection. Wu et al. [17] extracted the critical parts from
the introduction of VDCEP enables blockchain systems to effectively ad- the data flow graph and used the pretraining model to identify vul-
dress emerging vulnerability threats, which further ensures the security nerabilities. Jie et al. [16] represented the code as a CFG and adopted
of blockchain systems. the graph convolutional network to learn the feature representation for
vulnerability detection. Qian et al. [24] utilized the teacher–student
7. Related work network, including single-modality student and dual-modality teacher
networks, to capture vulnerability features from the CFG. To reduce ir-
Here, we first list and describe representative efforts in smart con- relevant code information, these graph-based methods generally utilize
tract vulnerability detection. Then, vulnerability detection efforts ap- graph slicing techniques, such as graph sampling or graph pooling, to
plied to other software are presented. simplify the code graph.
In practice, it is hard for this simplified code graph to preserve rela-
7.1. Vulnerability detection in smart contracts tively complete structural information of code. Motivated by [3,22], we
deconstruct the code structure graph into the critical execution paths
with complete code execution logic and concentrate on learning more
Detecting smart contract vulnerabilities is important to ensure the
accurate vulnerability features from these critical paths. In comparison,
security of the blockchain system [8–10]. In response, many smart
sequence-based methods have faster detection efficiency since they are
contract vulnerability detection methods have been developed, which
not required to deal with complex code structures [3]. This means
include the following rule-based and deep learning-based methods.
that they ignore the structural information of the code, which hinders
Rule-based methods mainly contain static analysis [25,33,35] and
the capture of the vulnerability features. Graph-based methods have
dynamic analysis [9,23,31,32]. For instance, Tikhomirov et al. [33] better code representation capability by representing the code as a non-
converted the smart contract into an XML-based parse tree as an in- sequence structure (i.e., simplified code graph). This simplified graph
termediate representation and used the XPath schema queries to check may ignore potentially valuable code information associated with the
the potential vulnerabilities. Tsankov et al. [35] obtained the semantic vulnerabilities, and fails to offer interpretable details of vulnerability
information of the smart contract in Datalog syntax and then checked it detection [18].
against the predefined security property rules by inferring the semantic
information. Feist et al. [25] used abstract syntax trees to represent 7.2. Vulnerability detection in other software
the smart contract code and generated the corresponding intermediate
representation for vulnerability detection. Static analysis tools examine It is worth noting that we pay attention to vulnerability detection
the source code of a smart contract without executing it. While they methods [1,3,26,46] applied to other software programs, i.e., C/C++
rely solely on the structure and syntax of code, without considering programs. Chakraborty et al. [46] presented ReVeal, which employed
runtime behavior or external inputs. Conversely, dynamic analysis tools a gated graph neural network to extract the vulnerability features from
analyze the source code during runtime, by executing it or simulating the property graph that combines the control flow and data dependency
various runtime scenarios. For example, Mythril [31] was presented information. Dong et al. [1] considered both semantic and syntactic
as a symbolic execution engine, that performs vulnerability detection information in the code property graph, and utilized the relational
graph convolutional network for vulnerability detection. LineVul [26]
through taint analysis and control flow inspection techniques. Nguyen
was proposed to use a pretraining model (i.e., CodeBERT) as a detection
et al. [32] adopted a branch distance-driven fuzz testing technique to
model and handle the sequence structure of the source code. Zhang
generate smart contract execution traces and then analyzed the poten-
et al. [3] proposed EPVD, which converted the source code into a
tial vulnerabilities. Torres et al. [23] combined the evolutionary fuzz
CFG and then deconstructed it into some execution paths with the
testing and constraint-solving techniques, and then developed a hybrid
shortest lengths. Subsequently, EPVD introduced CodeBERT to extract
test fuzzifier to identify potential vulnerabilities. Ji et al. [9] designed vulnerability features from the selected execution paths.
a Greybox fuzzing technique based on the input parameter analysis and Compared to our VDCEP, these methods do not focus on detect-
accelerated multi-objective search strategies for vulnerability detection. ing vulnerabilities in smart contracts. Although EPVD also uses code
These dynamic analyzers may miss certain vulnerabilities since they do execution paths for vulnerability detection, it still fails to obtain sat-
not cover all possible program paths or inputs. isfactory performance in this study. These methods ignore that smart
In fact, these rule-based methods typically require predefined vul- contract vulnerabilities have more distinct code logic, over other kinds
nerability rules provided by human experts to conduct code analysis, of software vulnerabilities [8]. The huge gap in vulnerability types
which is quite time-consuming and not automated [15]. Addition- and languages, makes the above methods hard to learn valuable code
ally, when predefined rules contain identical syntax elements found patterns related to smart contract vulnerabilities.

14
J. Cheng et al. Information and Software Technology 174 (2024) 107517

8. Conclusions and future work [6] Z. Liao, S. Hao, Y. Nan, et al., SmartState: Detecting state-reverting vulnerabilities
in smart contracts via fine-grained state-dependency analysis, in: Proceedings
of the 32nd ACM SIGSOFT International Symposium on Software Testing and
In the paper, we proposed VDCEP, a novel deep learning-based Analysis, 2023, pp. 980–991.
framework to identify smart contract vulnerabilities. VDCEP could [7] Z. Liao, Z. Zheng, X. Chen, et al., SmartDagger: a bytecode-based static analysis
adopt a path selection strategy to construct critical execution paths, en- approach for detecting cross-contract vulnerability, in: Proceedings of the 31st
suring complete code execution logic while eliminating irrelevant code ACM SIGSOFT International Symposium on Software Testing and Analysis, 2022,
pp. 752–764.
information interference. Also, it could use a feature extraction module
[8] H. Chu, P. Zhang, H. Dong, et al., A survey on smart contract vulnerabilities:
to capture more accurate vulnerability features from the selected paths. Data sources, detection and repair, Inf. Softw. Technol. 159 (2023) 107221.
Results showed that VDCEP achieves state-of-the-art performance, sur- [9] S. Ji, J. Wu, J. Qiu, et al., Effuzz: Efficient fuzzing by directed search for smart
passing baseline methods by 6.16% to 59.80% in recall and 5.34% to contracts, Inf. Softw. Technol. 159 (2023) 107213.
60.88% in F1-score. This underscores the effectiveness of VDCEP in [10] C. Shi, Y. Xiang, J. Yu, et al., Machine translation-based fine-grained comments
generation for solidity smart contracts, Inf. Softw. Technol. 153 (2023) 107065.
smart contract vulnerability detection. Moreover, VDCEP allowed for [11] W.J.-W. Tann, X.J. Han, S.S. Gupta, et al., Towards safer smart contracts: A
the interpretation of path feature weights, enabling researchers to com- sequence learning approach to detecting security threats, 2018, arXiv preprint
prehensively analyze detected vulnerabilities. This provides valuable arXiv:1811.06632.
insights for guiding security enhancements within blockchain systems. [12] X. Yu, H. Zhao, B. Hou, et al., DeeSCVHunter: A deep learning-based frame-
work for smart contract vulnerability detection, in: 2021 International Joint
Overall, VDCEP not only advances the state-of-the-art in smart con- Conference on Neural Networks, 2021, pp. 1–8.
tract vulnerability detection but also ensures the security of blockchain [13] Z. Yang, J.W. Keung, X. Yu, et al., On the significance of category prediction
systems. Its introduction enables blockchain systems to effectively ad- for code-comment synchronization, ACM Trans. Softw. Eng. Methodol. 32 (2)
dress emerging vulnerability threats and promotes the healthy devel- (2023) 1–41.
[14] C. Mamede, E. Pinconschi, R. Abreu, A transformer-based IDE plugin for
opment of blockchain technology. In the future, we plan to explore
vulnerability detection, in: Proceedings of the 37th IEEE/ACM International
other deep learning-based interpretable technologies to analyze the Conference on Automated Software Engineering, 2022, pp. 1–4.
vulnerabilities more accurately. Also, we will extend CEP to iden- [15] J. Huang, S. Han, W. You, et al., Hunting vulnerable smart contracts via graph
tify more kinds of smart contract vulnerabilities, e.g., state-reverting embedding based bytecode matching, IEEE Trans. Inf. Forensics Secur. 16 (2021)
2144–2156.
vulnerability and cross-contract vulnerability.
[16] W. Jie, Q. Chen, J. Wang, et al., A novel extended multimodal AI framework
towards vulnerability detection in smart contracts, Inform. Sci. 636 (2023)
CRediT authorship contribution statement 118907.
[17] H. Wu, Z. Zhang, S. Wang, et al., Peculiar: Smart contract vulnerability detection
based on crucial data flow graph and pre-training techniques, in: 2021 IEEE
Jianxin Cheng: Writing – original draft, Visualization, Method-
32nd International Symposium on Software Reliability Engineering, 2021, pp.
ology, Investigation, Data curation, Conceptualization. Yizhou Chen: 378–389.
Writing – review & editing, Validation, Resources, Investigation, Formal [18] Z. Liu, P. Qian, X. Wang, et al., Smart contract vulnerability detection: From
analysis. Yongzhi Cao: Writing – review & editing, Validation, Su- pure neural network to interpretable graph feature and expert pattern fusion, in:
Proceedings of the 30th International Joint Conference on Artificial Intelligence,
pervision, Project administration, Funding acquisition. Hanpin Wang:
2021, pp. 2751–2759.
Validation, Supervision, Project administration. [19] Y. Zhuang, Z. Liu, P. Qian, et al., Smart contract vulnerability detection using
graph neural network, in: Proceedings of the 29th International Joint Conference
Declaration of competing interest on Artificial Intelligence, 2020, pp. 3283–3290.
[20] M. Chen, Z. Wei, Z. Huang, et al., Simple and deep graph convolutional networks,
in: International Conference on Machine Learning, PMLR, 2020, pp. 1725–1735.
The authors declare that they have no known competing finan- [21] P. Veličković, G. Cucurull, A. Casanova, et al., Graph attention networks, in:
cial interests or personal relationships that could have appeared to International Conference on Learning Representations, 2018, pp. 1–12.
influence the work reported in this paper. [22] M. Fu, L. Wu, Z. Hong, et al., A critical-path-coverage-based vulnerability
detection method for smart contracts, IEEE Access 7 (2019) 147327–147344.
[23] C.F. Torres, A.K. Iannillo, A. Gervais, et al., Confuzzius: A data dependency-
Data availability aware hybrid fuzzer for smart contracts, in: 2021 IEEE European Symposium on
Security and Privacy, 2021, pp. 103–119.
Data will be made available on request. [24] P. Qian, Z. Liu, Y. Yin, et al., Cross-modality mutual learning for enhancing
smart contract vulnerability detection on bytecode, in: Proceedings of the ACM
Web Conference, 2023, pp. 2220–2229.
Acknowledgments [25] J. Feist, G. Grieco, A. Groce, Slither: A static analysis framework for smart
contracts, in: 2019 IEEE/ACM 2nd International Workshop on Emerging Trends
in Software Engineering for Blockchain, 2019, pp. 8–15.
This work was supported by the National Key R&D Program of
[26] M. Fu, C. Tantithamthavorn, Linevul: A transformer-based line-level vulnerability
China under Grant 2021YFF1201102 and the National Natural Science prediction, in: Proceedings of the 19th International Conference on Mining
Foundation of China under Grants 62172016 and 61932001. Software Repositories, 2022, pp. 608–620.
[27] F. Zhang, X. Yu, J. Keung, et al., Improving stack overflow question title
generation with copying enhanced CodeBERT model and bi-modal information,
References Inf. Softw. Technol. 148 (2022) 106922.
[28] H. Zhu, K. Yang, L. Wang, et al., GraBit: A sequential model-based framework
[1] Y. Dong, Y. Tang, X. Cheng, et al., SedSVD: Statement-level software vulnera- for smart contract vulnerability detection, in: 2023 IEEE 34th International
bility detection based on relational graph convolutional network with subgraph Symposium on Software Reliability Engineering, 2023, pp. 568–577.
embedding, Inf. Softw. Technol. 158 (2023) 107168. [29] Y. Chang, X. Wang, J. Wang, et al., A survey on evaluation of large language
[2] X. Wen, Y. Chen, C. Gao, et al., Vulnerability detection with graph simplification models, ACM Trans. Intell. Syst. Technol. 15 (3) (2024) 1–45.
and enhanced graph representation learning, in: Proceedings of the ACM/IEEE [30] T. Durieux, J.F. Ferreira, R. Abreu, et al., Empirical review of automated analysis
45th International Conference on Software Engineering, 2023, pp. 2275–2286. tools on 47,587 ethereum smart contracts, in: Proceedings of the ACM/IEEE 42nd
[3] J. Zhang, Z. Liu, X. Hu, et al., Vulnerability detection by learning from International Conference on Software Engineering, 2020, pp. 530–541.
syntax-based execution paths of code, IEEE Trans. Softw. Eng. 49 (2023) [31] M. Breidenbach, A framework for bug hunting on the ethereum blockchain, 2017,
4196–4212. https://github.com/ConsenSys/mythril/, Accessed 20 February 2024.
[4] Z. Yang, J. Keung, X. Yu, et al., A multi-modal transformer-based code sum- [32] T.D. Nguyen, L.H. Pham, J. Sun, et al., Sfuzz: An efficient adaptive fuzzer for
marization approach for smart contracts, in: 2021 IEEE/ACM 29th International solidity smart contracts, in: Proceedings of the ACM/IEEE 42nd International
Conference on Program Comprehension, 2021, pp. 1–12. Conference on Software Engineering, 2020, pp. 778–788.
[5] S. Cao, X. Sun, L. Bo, et al., Bgnn4vd: Constructing bidirectional graph [33] S. Tikhomirov, E. Voskresenskaya, I. Ivanitskiy, et al., Smartcheck: Static analysis
neural-network for vulnerability detection, Inf. Softw. Technol. 136 (2021) of ethereum smart contracts, in: Proceedings of the 1st International Workshop
106576. on Emerging Trends in Software Engineering for Blockchain, 2018, pp. 9–16.

15
J. Cheng et al. Information and Software Technology 174 (2024) 107517

[34] C.F. Torres, J. Schütte, R. State, Osiris: Hunting for integer bugs in ethereum [41] L. Yu, L. Chen, J. Dong, et al., Detecting malicious web requests using
smart contracts, in: Proceedings of the 34th Annual Computer Security an enhanced textcnn, in: 2020 IEEE 44th Annual Computers, Software, and
Applications Conference, 2018, pp. 664–676. Applications Conference, 2020, pp. 768–777.
[35] P. Tsankov, A. Dan, D. Drachsler-Cohen, et al., Securify: Practical security [42] Y. Shen, K. Li, L. Mao, et al., IntelliCon: Confidence-based approach for fine-
analysis of smart contracts, in: Proceedings of the 2018 ACM SIGSAC Conference grained vulnerability analysis in smart contracts, in: International Conference on
on Computer and Communications Security, 2018, pp. 67–82. Blockchain and Trustworthy Systems, 2023, pp. 45–59.
[36] D. Zou, S. Wang, S. Xu, et al., 𝜇VulDeePecker: A deep learning-based system for [43] J. Zhang, L. Tu, J. Cai, et al., Vulnerability detection for smart contract via
multiclass vulnerability detection, IEEE Trans. Dependable Secure Comput. 18 backward bayesian active learning, in: International Conference on Applied
(5) (2021) 2224–2236. Cryptography and Network Security, 2022, pp. 66–83.
[37] F. Lomio, E. Iannone, A. De Lucia, et al., Just-in-time software vulnerability [44] X. Wen, X. Wang, C. Gao, et al., When less is enough: Positive and unla-
detection: Are we there yet? J. Syst. Softw. 188 (2022) 111283. beled learning model for vulnerability detection, in: Proceedings of the 38th
[38] M. Wang, C. Tao, H. Guo, LCVD: Loop-oriented code vulnerability detection via IEEE/ACM International Conference on Automated Software Engineering, 2023,
graph neural network, J. Syst. Softw. 202 (2023) 111706. pp. 345–357.
[39] J. Zhao, F. Huang, J. Lv, et al., Do RNN and LSTM have long memory? in: [45] X. Nie, N. Li, K. Wang, et al., Understanding and tackling label errors in deep
International Conference on Machine Learning, PMLR, 2020, pp. 11365–11375. learning-based vulnerability detection (experience paper), in: Proceedings of the
[40] M. Xia, H. Shao, X. Ma, et al., A stacked GRU-RNN-based approach for predicting 32nd ACM SIGSOFT International Symposium on Software Testing and Analysis,
renewable energy and electricity load for smart grid operation, IEEE Trans. Ind. 2023, pp. 52–63.
Inform. 17 (10) (2021) 7050–7059. [46] S. Chakraborty, R. Krishna, Y. Ding, et al., Deep learning based vulnerability
detection: Are we there yet, IEEE Trans. Softw. Eng. 48 (2022) 3280–3296.

16

You might also like