An Integrated Smart Contract Vulnerability Detection Tool Using Multi-Layer Perceptron On Real-Time So
An Integrated Smart Contract Vulnerability Detection Tool Using Multi-Layer Perceptron On Real-Time So
Received 8 January 2024, accepted 30 January 2024, date of publication 8 February 2024, date of current version 16 February 2024.
Digital Object Identifier 10.1109/ACCESS.2024.3364351
ABSTRACT Smart contract vulnerabilities have led to substantial disruptions, ranging from the DAO attack
to the recent Poolz Finance. While initially, the smart contract vulnerability definition lacked standardization,
even with the advancements in Solidity, the potential for deploying malicious contracts to exploit legitimate
ones persists. The Abstract syntax tree (AST), opcodes, and control flow graph (CFG) are the intermediate
representations for Solidity contracts. In this paper, we propose an integrated and efficient smart contract
vulnerability detection algorithm based on Multi-layer perceptron (MLP). We use feature vectors from the
Opcodes and CFG for the machine learning (ML) model training. The existing ML-based approaches for
analyzing the smart contract code are constrained by the vulnerability detection space, significantly varying
Solidity versions, and no unified approach to verify against the ground truth. The primary contributions
in this paper are 1) a standardized pre-processing method for smart contract training data, 2) introducing
bugs to create a balanced dataset of flawed files across Solidity versions using AST, and 3) standardizing
vulnerability identification using the Smart Contract Weakness Classification (SWC) registry. The ML
models employed for benchmarking the proposed MLP, and a multi-input model combining MLP and Long
short-term memory (LSTM) in our study are Random forest (RF), XGBoost (XGB), Support vector machine
(SVM). The performance evaluation on real-time smart contracts deployed on the Ethereum Blockchain
show an accuracy of up to 91% using MLP with the lowest average False Positive Rate (FPR) among all
tools and models, measuring at 0.0125.
INDEX TERMS Blockchain, ethereum, machine learning, multi-layer perceptron, real-time smart contracts,
solidity smart contracts, vulnerability analysis and detection, code analysis, software testing.
2024 The Authors. This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License.
VOLUME 12, 2024 For more information, see https://creativecommons.org/licenses/by-nc-nd/4.0/ 23549
L. S. H. Colin et al.: Integrated Smart Contract Vulnerability Detection Tool Using MLP
in sharing data between heterogeneous devices using smart training dataset, (iii) scalability and run-time of the model,
contracts. A framework by [3] suggests using smart contracts and (iv) standardized verification and validation of the
for recording data in the cloud, for data security, and model. Recent works that have claimed to have sourced
accountability of Internet of Things (IoT) devices. Another data from reliable sources [18], [19], [20], are still based
use case for IoT smart contracts is unmanned aerial vehicles on the validation of software verification and validation
(UAVs) [4] that transform a centralized trusted authority into tools such as Mythril [14], Slither [13], and Oyente [16]
a secure decentralized network. However, these works use to determine the ground truth information about the smart
smart contracts as authentication mechanisms for IoT and contracts used for training. Firstly, the main concern with
assume the inherent security of smart contracts. such reliance on third-party software verification tools for
Smart contracts have not been immune to hacks, since early the training dataset is that it is prone to inaccuracies that
2016 when major hacks started to surface and gain media are inherent to these software verification tools. Secondly,
attention. These hacks are not only harmful to the industry to introduce vulnerabilities into the training dataset, a syn-
but also cast a large shadow over the longevity of blockchain thetic data generation method using the Synthetic Minority
applications. In 2022, approximately 1.9 billion USD have Oversampling Technique (SMOTE) is being used [19], [20].
been lost to various hacks by exploiting vulnerabilities in However, synthetic data can never be a true representation
smart contract logic and manipulation of human errors in of a truly vulnerable dataset. Not only does it not keep
smart contracts [5]. To add on, specifically, Poolz Finance updated with the significantly varying Solidity versions,
suffered a major arithmetic overflow hack, where one of its but also leads to an imbalance in the training dataset
methods for pool creation contains a manual summing of due to the way it is implemented. Thirdly, some existing
token count which resulted in losing USD 6, 650, 000 [6], [7]. works simply skip the pre-processing steps (that ensures the
There have been attempts to prevent such hacks by quality and correctness of the training dataset) [18] which
creating contract standards and defining vulnerabilities. makes the solution practically not useful — especially when
OpenZeppelin [8], Consensys [9] and a trail of bits [10] are the training data set that comes across different solidity
some of the front runners on the quest to support developers versions, contains commented codes with code-like syntax.
and overcome potential security vulnerabilities. This becomes a prime concern since the solc compiler cannot
In terms of standardization, smart contract vulnerabilities differentiate code-like commented syntax — which is crucial
have been loosely defined since its emergence. There for bug-insertion algorithm. Also, the existing bug-injection
have been different flavors namely — the Smart Contract tools [21] suffer from practical issues such as solidity
Weakness Classification (SWC) registry [11], DASP [12] version, syntax based on solidity version, no exception,
and Crytic’s static analyzer, i.e., the Slither’s detector and no control over bug injection logic (i.e., the existing
documentation [13]. These definitions cover most of the tools dump all bugs in a single smart contract). Motivated
notable vulnerabilities, however, there is an existing gap in by these practical concerns, our proposed solution injects
terms of compliance with any security standards and coverage known vulnerability patterns into clean contracts by ensuring
of the entire smart contract vulnerability space constrained a clear indication of a vulnerability bug type in each
by the varying solidity versions and ever-emerging smart contract while developing a standardized pre-processing
contract logic bugs. These vulnerabilities, often resulting method to generate a balanced and good quality train-
from human errors and version changes, underscore the need ing dataset and validate against well-defined vulnerability
for more robust and reliable security measures in smart standards.
contracts. While such vulnerabilities can be detected by In this paper, we will be introducing an ML approach
software verification and validation tools using static [13], that uses a runtime opcode extraction algorithm for feature
[14], dynamic [15], and formal verification [14], [16], each extraction and a trigram-based method for vectorization.
method has its own limitations. In the work done by [17], We reference the SWC [11] registry for smart contract
the authors have conducted an extensive test on software vulnerability categorization as it has been one of the most
verification and validation tools and concluded that there well-defined mappings while loosely coupled with the
is no single analysis tool that can detect all smart contract Common Weakness Enumeration (CWE) [22]. To test the
vulnerabilities. Not only that new vulnerabilities cannot be efficacy of our solutions and perform benchmarking, we will
detected if it was not predefined within the tool, but also employ Mythril, Slither, and an integrated tool known as
existing vulnerability detection had significant false positive MythSlith. We have developed MythSlith — which is a tool
rates. These findings therefore advocate the use of a Machine that integrates Mythril and Slither to increase the coverage
Learning (ML) approach that can be dynamically trained of smart contract vulnerability detection space. These tools
to newer smart contract bugs and solidity versions while will be used to compare against the ML model trained with
reducing the false positive rate. the bug-injected dataset. Our contributions in this paper are
For any ML model the key factors that decide its reliability summarized as follows:
of that model are, (i) the origin of the dataset used for • We developed a standardized pre-processing algorithm
training that model, (ii) the quality and correctness of the for cleaning smart contract training data to address
This sequence of events might seem normal for traditional III. LITERATURE REVIEW
software code, however, in the case of Solidity, a fallback In this section, we will discuss currently available software
function of the external account can re-trigger the same verification and validation tools for smart contract detection,
sending function again before an update can happen within as well as works that use a machine learning approach.
the original contract. This could result in an indirect recursive Our review will cover the data sampling techniques, fea-
function call. ture extraction methods, and smart contract vulnerability
Arithmetic vulnerabilities, more commonly known as standards. An overview of the comparison can be found in
Overflow-Underflow (Fig 2). Overflow occurs when an Table 1.
operation tries to add to a variable that is already at its
maximum possible value. Without any sort of guard, this A. SOFTWARE VERIFICATION AND VALIDATION TOOLS
variable will overflow and back to 0 or the minimum possible In recent years, a number of analysis tools have been
value. As opposed to Overflow, Underflow happens when introduced and they can be categorized into 3 different types,
an operation tries to subtract a variable when it is at the Static, Dynamic, and Formal verification. Static tools rely
minimum possible value, resulting in the value jumping to on the static information of the code to derive a prediction
the maximum possible value. This vulnerability can lead to without executing the program [27]. Some of those features
severe security issues as hackers can use this behavior to alter extracted are the abstract-syntax tree (AST), compiled
account balance or change ownership of contract. bytecode, and opcodes. Dynamic tools analyze a running
Unauthorized send arises when there is no access control program, one such example is the Fuzzer [28]. Formal
for a function that requires an access check (Fig 3). Such Verification tools rely on the mathematical definition and use
a function may contain withdrawal or reward disbursement a solver such as Z3 to resolve the derived formula [29].
functionality. Among the software verification and validation tools
Transaction origin also known as tx.origin, arises for smart contracts are Slither [13], Mythril [14], and
from the misuse of the global variable ‘tx.origin’ of Solidity DefectChecker [30] for static verification; Manticore [15]
(Fig 4). In all transactions there is an origin and a sender. for dynamic analysis; and Oyente [16], Mythril, and
Origin is the address that started the chain of calls while DefectChecker for formal verification. Although these tools
have been instrumental in numerous audits, they are con- it contains useful semantic information. Reference [19] uses
strained by the predefined patterns of each bug. Should new a simplified opcode followed by Bigram for vectorization,
bugs emerge, an expert update would be necessary. while [18] uses features from AST, [20] and [26] uses a
mixture of AST and simplified Opcode with Bigram.
B. MACHINE LEARNING BASED VULNERABILITY While traditional models typically process data in a single
DETECTION common format, this does not preclude the combination of
As the name suggests, ML method is used to construct a vectorized data from various features. Works of [20] and [26]
set of decisions from the features of the data to make a have both mixed their feature embeddings and can produce a
logical conclusion of a trend or classification. Such a set good amount of variance for classification. However, in our
of decisions is known as a model and it can be done with proposed work aside from the classical models, we will
supervised or unsupervised learning. Supervised learning is implement a multi-model approach that allows features of
an algorithm that learns with labeled data while unsupervised different shapes to collaborate effectively.
will suggest clusters of possible outcomes and then deriving
of the outcome will depend largely on the feature set. C. VULNERABILITY STANDARDS
Vulnerability standards in the smart contract field are not
1) DATA SAMPLING yet prevalent, resulting in a spew of different definitions
As discussed in Section I, data integrity is an integral part by different organizations [11], [12]. This is not healthy
of ML model training but existing works from [18], [19], for the industry and will hinder further development. It is
and [20] did not have a cleaning or pre-processing step to also clear from existing works [18], [19], [20], [26], that no
ensure contracts are clean. While [26] has done checks using vulnerability standards were taken up. Consequently, in our
three software validation tools namely, Slither, Oyente, and proposed work, we have selected the SWC Registry [11] as
DefectChecker to ensure contracts are properly labelled and the vulnerability standard to provide a clear definition.
also removed if any error were present in the tool’s output, In the proposed solution, opcodes will be used alongside
in addition, smart contracts without version numbers were CFG features. Trigram is used instead as it captures more
also removed. However, the labeling of data relies on software context than Bigram or unigram. Rather than using AST, CFG
validation tools, which face the same issue as previous works, is chosen because it contains flow information which AST
i.e., the training dataset is prone to inaccuracies inherent to does not. Simplification of opcodes is done but different from
these tools. previous works, rather than replacing the entire set of PUSH,
Whereas in our approach, bug injection ensures identified DUP, and SWAP opcodes with a constant, we will leave the
bugs to be injected. However, bug injection from Solidifi [21] first five numbers untouched, allowing more variance to be
lacks post-injection error checking, leading to the creation captured. In addition, we have also removed all hexadecimal
of an erroneous dataset. Moreover, the code from [21] did values to prevent any contract-specific values which could be
not anticipate a situation where the bug count is less than the learned by the models.
available injection location. To accommodate such scenarios, This review has highlighted the need for expert knowledge
a recursive function was incorporated to check on the bug to define new bugs for current software verification and
count and the number of available locations. In addition, validation tools, significant variations in data sampling and
we have also incorporated a validation and error-handling feature extraction methods, and the absence of vulnerability
mechanism into the proposed bug injection algorithm. standards. While each work does give clear definitions of
their selected vulnerabilities, there is no compliance. Data
2) FEATURE EXTRACTION sampling from [26] have implemented a robust method using
Using opcodes for feature extraction is one of the more validation tools for labeling, others have not, raising concerns
popular methods as it is agnostic to the expert pattern and about data integrity. Opcode is a popular representation for
presents a clear path to identify any malicious act. Abstract feature extraction and mixing of features is a popular method
Syntax Tree (AST) is also one of the common methods as as seen in works of [20] and [26] but a multi-model approach
has yet to be attempted. The disarray in vulnerability TABLE 2. Variables used in the bug injection framework.
standards indicates the need for a single, universally adopted
standard to streamline the development process. The insights
gained from this review provide a good foundation for
developing a robust and reliable smart contract vulnerability
detection method.
The research gap is summarized in Table 1 highlighting the
novelty of the paper. In this paper, we perform standardized
pre-processing steps, by introducing both erroneous solidity
version exclusion and code-like comments removal which
could generate an incorrect AST. For data source labeling
we use bug injection in our work as it provides a reliable
ground truth. This was not done in the referenced existing
works in Table 1. While the work in [21] uses bug injection,
it does not validate the bug injection nor include any
pre-processing steps. The most recent work that performs
pre-processing by removing code-like comments is found
in [26]. However, it does not verify incorrect solidity versions
(during pre-processing) while relying on third-party tools an automated bug injection tool that checks for potential
(such as Mythril, Slither, Oyente, DefectChecker, etc.) to locations using AST tree to inject a set of predefined bugs.
label their datasets. However, Solidifi cannot be directly used to inject bugs since
it suffers drawbacks such as there is (i) no solidity version
IV. METHODOLOGY check, (ii) no syntax check, (ii) no exception for bugs, and
The approach to this research will be detailed in this section more importantly, (iv) it injects all the predefined bugs to a
in the following sequence: IV-A Preparation of dataset, single solidity smart contract which is not a practical scenario
IV-B Feature extraction, IV-D Machine Learning Models for to train the ML model. We hence propose two pre-processing
Classification, IV-E Multi-Model Approach, IV-F Design of algorithms and a bug injection algorithm below to mitigate
MythSlith, IV-G Challenges. A flow diagram of the end-to- the drawbacks of Solidifi.
end pre-processing steps that lead to the ML model training
is illustrated in Figure 8. 2) BUG INJECTION
The goal of employing the bug injection technique is to
A. PREPARE DATASET FOR ML TRAINING imitate the introduction of bugs by developers [21] in
1) DATASET the smart contract logic. The number of bug snippets is
For any ML model to be efficient and usable, generating determined by a predefined bug density. The bug density is
an error-free and practical dataset is important to achieve defined as the number of vulnerable lines of code per clean
good accuracy and false positive rate. As a first step, smart contract. Refer to Table 2 for the variables used in the
the clean dataset was initially sourced from the smart Bug injection algorithm. For example, for every 100 line of
contract sanctuary [31] — an open-sourced repository and code, when we insert 1 line of bug, then the bug density for
we validated the ground truth by compiling each Solidity file that smart contract will be 1%. By setting the bug density,
using the Solidity compiler (solc) to ensure no errors were we are able to have an even spread of bugs within the entire
present before any bug injection was performed. The version dataset used for training the ML model. To be uniform across
of the compiler used for each file is determined with a JSON the smart contract vulnerability space, we represent each bug
file provided by [31], which consists of the specific version snippet as a function.
used when the smart contract was actually deployed to the The process of injecting bugs has the following steps:
mainnet. Step1: Pre-process the source file using Algorithm (1)
The second step is to use the validated clean dataset Step2: Obtain source attribute information by generating
and inject with bugs defined by Solidifi [21] — which is the abstract syntax tree (AST)
Algorithm 4 Extract Opcodes From Bugged Files Each element in Gi would map to the corresponding
1: procedure SimplifyOpcode(fc , B∗used ) property in the ABI JSON.
2: Initialize Lopcodes ← ∅ Following this, opcodes will be simplified. Simplification
3: C ← SOLC(fc ) is done because opcodes such as PUSH have 32 variations
4: for (ABI , O) ∈ C do while SWAP, and DUP have 16. Where each variation
5: found_match ← False represents the number of bytes to be pushed on the stack.
6: for bug ∈ B∗used do i.e., PUSH1 - 1 byte, PUSH4 - 4 bytes. The simplification
7: for Gi ∈ ABI do rules enforced are similar to [19] and can be found in
8: if namei = bug.name then Table 3. With this simplification, the number of opcodes
9: found_match ← True remaining will only be 77, thus reducing the dimension of the
10: O∗ ← Simplify(O) feature vector. We then employ the n-gram algorithm [32]
11: Lopcodes ← Lopcodes ∪ {O∗ } for feature extraction. In natural language processing and
12: end if computational linguistics n-gram is widely used as a
13: end for calculation of the frequency distribution of a selected
14: if found_match then n-number of tokens, where n refers to the number of adjacent
15: Break elements to consider from a string of tokens. Unigrams,
16: end if bigrams, and trigrams are some examples of n-grams where n
17: end for is 1,2 or 3, respectively [19]. For this paper, we have chosen
18: end for the trigram approach for feature extraction. This choice is
19: return Lopcodes motivated by the need to capture more information while
20: end procedure maintaining scalability, particularly as additional bugs are
incorporated, allowing for more syntax to be captured for
precise classification. Additionally, all hexadecimal values
dependency, Transaction Order Dependency, and Unhandled are removed, as they are unique to each smart contract
Exceptions. and do not provide any flow information relevant to bug
identification.
1) OPCODES The vectorizing of text features is done by Term
Opcodes are obtained by using SOLC by using the input Frequency-Inverse Document Frequency (Tfidf ). Tfidf can
and output JSON method — which is also the recommended be broken down into two sections, I) Term Frequency,
way that has a consistent interface throughout all compiler 2) Inverse Document Frequency. 1) Term frequency will
versions. As not all contracts within each Solidity file will reflect frequently occurring sections of a document by
have a bug injected, a cross-check will be needed to only take using a weighting factor, in which the weight increases
in the opcode from injected contracts. Hence, given the file proportionally to the number of times a word appears in the
name and B∗used , we will sift out contracts by comparing the document. However, this is offset by 2) Inverse Document
function name with the Abstract Binary Interface(ABI). The Frequency where it buries commonly appearing words and
full algorithm of the Opcode extraction process can be found highlights rare occurring words. This allows unique features
in Algorithm (4). to surface.
As a next step, to describe the sifting process, let’s consider Post extraction, we are left with a sparse matrix. To further
C to be the set of m contracts returned by the solc. Each reduce the dimension for model training, SelectKBest based
contract in C contains both an Application Binary Interface on Chi-square and TruncatedSVD from scikit learn API [33]
(ABI) and an ordered list of opcodes. Formally, C is defined are both employed. SelectKBest will select the best 2000 fea-
as: tures from the sparse matrix followed by the TruncatedSVD.
Rather than the more commonly known dimension reduction
C = {(ABI1 , O1 ), (ABI2 , O2 ), . . . , (ABIm , Om )} method — Principle Component Analysis (PCA), which does
not work on sparse matrix. TruncatedSVD is able to work
where ABIi represents the Application Binary Interface for on it efficiently because it does not center the data before
the ith contract, and Oi represents the ordered list of opcodes computing the single value decomposition. The number of
for the ith contract. Also, ABIi = {G1 , G2 , . . . , Gr }, where components are based on the cumulative variance of 95%
each Gi is a function descriptor with a maximum number of from the selected 2000 features. With this method, we will
r function descriptors per ABI. obtain the features that have a cumulative variance of at least
Each function descriptor Gi could be represented as a tuple: 95% and thereby reducing the number of features.
Upon completion of these steps, the Opcode trigram
Gi = constanti , inputsi , namei , outputsi , feature vector is primed and ready for the machine learning
payablei , stateMutabilityi , typei
model training.
in [24]. Furthermore, specific tools have been designated different Solidity compiler version. Inconsistency between
for addressing vulnerabilities like Reentrancy, Unauthorized the source indexes will affect the bug injection process where
send, Tx origin, and Arithmetic. This allocation stems from it is reliant on it to insert bugs.
the outcomes detailed in Table 5. Additionally, although The unseen test set is prepared by a random selection
the severity level was initially absent in Slither, it has been of 100 solidity files from the Smart Contract Sanctuary.
incorporated, referencing the SWC registry and aligning with These files are in addition to the existing 4335 contracts.
the identified vulnerabilities. The algorithm design can be These contracts will then go through the same bug injection
found in the Algorithm 6. The variables used in MythSlith and feature extraction procedure as in Algorithm 3 and
Algorithms is listed in Table 4. Algorithm 4, Algorithm 5.
Complete results can be found in both Table 6 and Table 5.
G. CHALLENGES Table 6 illustrates the overall comparison between analysis
1) MULTIPROCESSING tools and ML models.
Multiprocessing is implemented to increase the time taken for A. EVALUATION METRICS
test, however, this has spun off a new problem while changing To enable the comparison, SWC standard is utilized as the
solc version. Therefore, in order to prevent compilation error, benchmark and it can be found in Fig 10. The results of
all processes are dockerized and solc-select is included inside the analysis tools and ML models are obtained by executing
both Mythril and Slither dockerfile. Any action that has got the same set of data with seven different classifications of
to deal with solc will be spinning up a docker container to vulnerabilities. The clean category is also added for clarity.
prevent any compiler mismatch issue. To ensure a balanced and unbiased result from the analysis
tool, each tool will run the clean dataset and then the result
2) UNIFIED VULNERABILITY CLASSIFICATION will be used as the baseline, ensuring a reference ground truth.
Combining Mythril and Slither poses another challenge This is because we can never assume each smart contract is
which is the definition of vulnerabilities. Though both tools free of bugs.
are capable of detecting some common bugs, they do not use Given the result from the analysis, a confusion matrix
the same standard. While Mythril uses the more recognized will be constructed for the evaluation. The confusion matrix
SWC, Slither uses its own definition. To resolve this, we have comprises the following values in accordance with the result
added in SWC definition in the Slither code and incorporated prediction. These values with description can be found in
the output with SWC ID. This will allow both tools to Table 7.
communicate with reference to the same vulnerabilities With these values, we will be able to construct metrics:
definition standard. In this paper, we have mapped the seven • Accuracy: represents ratio of correctly predicted values
different vulnerabilities that can be found in Slither to SWC to the total dataset. It is a measure of the overall
ID. The mapping can be found in Figure 10. correctness.
TP + TN
Accuracy = (1)
V. RESULTS AND PERFORMANCE ANALYSIS TP + TN + FP + FN
In this section, experiment results from both analysis tools • Precision: can also be referred as positive predicted
and our ML based approach will be put against each other value, represents the number of correctly classified
for comparison. Effectiveness of each tool will be based positive values to the total predicted positives.
on four parameters namely, (i) accuracy, (ii) precision, TP
(iii) recall, and (iv) F1 score. These parameters will then Precision = (2)
TP + FP
form the confusion matrix for better visualization. All
experiments were conducted on a machine with the following • Recall: also known as true positive rate, hit rate or
specifications: R162-ZA1-00 with 16 CPUs x AMD EPYC sensitivity, represents the ratio of correctly classified
7282 16-Core Processor 64GB of RAM and 1.9TB of SSD. instances to all datasets.
In this paper, a total of 4335 manually verified Solidity TP
Recall = (3)
files were sourced from the smart contract sanctuary [31] TP + FN
repository to track verified smart contracts from the Ethereum • F1 Score: more useful for uneven dataset or class
mainnet and testnets, such as rinkeby, ropston, kovan, etc. distribution. This metric is the weighted average for
And also from Binance Chain, Polygon/Matic, and Tron. Precision and Recall.
In the mainnet repository, contract versions have a wide 2 × (Precision × Recall)
spread from 0.4.1 to 0.8.7 given the latest update. For F1 Score = (4)
Precision + Recall
this experiment, we will initially consider contracts from
• False Positive Rate: is a proportion for the measure of
the mainnet and Solidity versions from 0.4.11 to 0.4.26 to
incorrectly classified instances as positive.
compare against the known ground truths. Any contracts
below 0.4.11 will not be considered due to a known compiler FP
FPR = (5)
bug where source indexes could be inconsistent between FP + TN
In this study, we will be looking at recall followed missing FN has much higher consequences than an increase
by precision and f1 score then accuracy. This is because in FP, as FN can lead to potential security breaches and
FIGURE 11. Learning rate with increasing training size (a) and increasing epochs (b).
TABLE 7. Variables used in evaluation metrics calculation. showing a slightly lower accuracy of 0.8954 and an F1 score
of 0.8979. FPR of the models has shed some light on the
performance, that the MLP among the all other models has
the lowest FPR at 0.0125 — which indicates its likeliness to
flag out False positive is unlikely.
The detailed results of the individual vulnerability analysis
are presented in Table 5. This analysis revealed that MLP
has consistently demonstrated superior performance across
most of the vulnerability detection, particularly for the
detection of Clean and Unauthorized send with f1 scores
of 0.8723 and 0.9153 respectively. SVM did well across
most of the categories, however, it has a little trouble with
Reentrancy, while having a high recall of 0.9518 its’ precision
financial losses. However, it is advised to take both Precision took a toll at 0.7940. The XGB model displayed a consistent
and F1 scores into consideration to ensure a balanced performance across various categories, with a notable F1
model. score of 0.7716 in the Clean category, though it did not
outperform all other tools in this category. Surprisingly,
B. PERFORMANCE OF MACHINE LEARNING MODELS bagging ensemble learning technique like RF did not do
In this section, we illustrate the performance comparison very well, measures across the board is not more than 0.67.
between our ML models, Mythril, Slither, and MythSlith The multi-model of MLP and LSTM weak results, while
on the 100 bugged solidity files. The overall performance doing well in Transaction order with F1 score of 0.7176, it is
analysis of the tools, as summarized in Table 6, shows that less effective in other categories. One notable FPR measure
the MLP and SVM models exhibited the highest levels of is for the multi-model, where it is 0.4545 for Unhandled
accuracy, precision, recall, and F1 score, with MLP achieving Exceptions, this clearly indicates the low effectiveness of
an accuracy of 0.9129 and an F1 score of 0.9127, and SVM the features used for the models. While MLP excels in
FIGURE 12. ROC Curves for various models. From left to right, top to bottom: Random Forest, XGBoost, SVM, MLP, MLP+LSTM.
Timestamp Dependency with the lowest measure at 0.0037, VI. ANALYSIS AND INFERENCES
indicating high confidence in the detection. A. PRE-PROCESSING AND FEATURE EXTRACTION
In our methodology, opcode sequences are utilized and
C. PERFORMANCE COMPARISON WITH SOFTWARE simplified technique aimed at enhancing variance. We then
VERIFICATION AND VALIDATION TOOLS employ Tfidf technique with trigram-based feature extrac-
Results of the tools can be found at Table 5. Right off
tion. While this approach is efficient and straightforward,
the bat, it is clear that current analysis tool is unable to
it suffers from a lack of context awareness. This limitation
identify some vulnerabilities. Transaction order seem to be
stems from its focus on only three consecutive opcodes
a challenge for the analysis tools, while Arithmetic proof
at a time, leading to a sparsity issue. The sparsity results
to be too hard for Slither to handle. In addition, while it
from the model’s limited exposure to examples, potentially
seem great from the precision point of view for the 3 tools,
compromising its accuracy and robustness.
recall and F1 score of the does not do very well. This has
Furthermore, the sparse nature of the Tfidf repre-
yet again emphasised on the findings from [17]. And current
sentation presents computational challenges, particularly
tools has difficulty identifying Clean contracts, with Slither
as data volumes escalate. In the context of our opcode
leading the pack at only 0.375 in precision while the rest
dataset, this method generates 4, 824 feature columns via
is well below 0.2. Another vulnerability to highlight is the
the Tfidf vectorizer. Additionally, the process of opcode
TimestampDependency, surprisingly all 3 tools did not fare
simplification, while conserving computational resources,
well with recall being lesser than 0.2. FPR of clean contracts
introduces significant drawbacks. A primary concern is the
for Mythril is particularly high at 0.2590, and this behaviour
discarding of hexadecimal values, resulting in the loss of vital
continues in Reentrancy and Arithmetic. However, it did
address information crucial for source location tracing.
really well in Unhandled exceptions where no FP were raised.
Looking ahead, the adoption of an alternative vectorizer
Slither did really well in Tx origin with just 0.0007 for its
to Tfidf could potentially yield improved contextual
FPR. However, such performance is not backed by its recall
understanding, enhancing the models’ learning capabilities.
and f1 score which has clearly indicate poor TP. MythSlith
results were not spectacular, it is just hovering between
Mythil and Slither measures. One example is Timestamp B. COMPARISON OF MYTHSLITH AND MLP MODEL
Dependency where MythSlith has a FPR of 0.0428, Mythril From the test conducted, it is clear that relying on one tool
and Slither have 0.0505 and 0.0179 respectively. Due to its for smart contract analysis is not ideal. Current Software and
design, it can never be better than the highest measure by verification tools such as Mythril [14] and Slither [13] tend to
either tools. be on the safe side because the result has high precision with
These findings highlight the varying strengths and limi- low recall and f1 score. In our effort to combine them and
tations of each model or tool, underscoring the influence of weave through the cracks by constructing MythSlith, such
vulnerability type on the effectiveness of detection methods. behavior still exists. No significant detection progress was
FIGURE 13. Pre-processing time (in ms) for opcode extraction with respect to (from left to right): Line Count, AST Node Count, and Opcode count,
respectively.
TABLE 8. Average time taken for Smart contract vulnerability detection sometimes lesser). The average running time for MythSlith is
tool to analyze each smart contract.
1102.14 seconds whereas that of Mythril is 1270.33 seconds.
Note that unlike Mythril, the average running time of Slither
is only 5.36 seconds since it does not involve the depth of the
symbolic execution. The average running time for MythSlith
is slightly less than that of Mythril since the depth of the
symbolic execution is only increased for certain types of
vulnerabilities, whereas for all other cases where Slither has
better detection accuracy, MythSlith chooses Slither.
As observed in the Table 8, the running time to analyze
a smart contract using ML models is much lesser (in
the order of < 7 seconds) compared to the software
verification tools except for the multi-model MLP+LSTM
since it involves two ML models. It is evident that the time
taken to analyze the smart contract features and predict
using the ML-model is real-time for the smart contract
made, which then again is expected as no improvement was analysis. Whereas the pre-processing and training time for
done to each tool. However, MythSlith did cover Slither’s the ML-model is an offline process. The pre-processing time
inability to detect Arithmetic vulnerabilities. This can be for opcode extraction with respect to the number of lines
observed in Table 5. in the smart contract, number of AST nodes and number
In our proposed method, we show that detection via of CFG opcodes is depicted in Figure 13 from left to right,
extracted features from smart contracts presents viable respectively. The maximum pre-processing time observed for
patterns for machine learning models to learn and classify a smart contract with 3500 lines is 26 seconds. While most
them. However, the suitability of different models varies practical smart contracts with less than 1000 lines of code
for this specific application, this can be clearly seen from take < 10 seconds for pre-processing and opcode extraction.
the results of RF. Unlike the gradient correction method
employed by XGB, the ensemble bagging approach of RF VII. CONCLUSION AND FUTURE WORKS
did not yield equally effective results. In terms of neural Securing smart contracts is no easy feat, as they are not pro-
network method, MLP did well for all categories. In contrast, tected and visible to everyone. Current software verification
the multi-model of MLP + LSTM did not. This is primarily tools tend to be on a defensive stance, only flagging the bugs
due to the CFG features derived from Ethersolve [34]. if they are fully sure about it. This however would let the False
These CFG features did not present a strong pattern for our negatives slip through. Therefore, we proposed a machine
model to learn effectively. Furthermore, the generation of learning approach to effectively and efficiently detect seven
features using PecanPy [35] takes up a substantial amount of types of vulnerabilities while also identifying clean contracts.
time due to random walks generation. However, the limited This was done by employing models such as Random
effectiveness of Ethersolve features in this context does not Forest, XGBoost, Support Vector Machine, Multi-Layer
inherently diminish their value; the challenge may lie more Perception, and Long-Short Term Memory. To insert practical
with PecanPy’s processing demands. bugs and increase the vulnerability space, we proposed
a practical bug injection technique that injects bugs into
C. RUNNING TIME verified smart contracts that were cleaned using our proposed
In Table 8, we analyze and compare the running times pre-processing algorithm. This helps to scale up the smart
for the various software validation tools. The running contracts’ vulnerability space using features such as opcodes
time for MythSlith is comparable to that of Mythril (or and CFG that were extracted for the model training. Prior
to vectorization, simplification was done to the opcodes in [7] SolidityScan. (2023). Poolz Finance Hack Analysis: Still Experiencing
order to reduce the dimension and remove contract-specific Overflow. SolidityScan Blog. Accessed: May 23, 2023. [Online].
Available: https://blog.solidityscan.com/poolz-finance-hack-analysis-
hexadecimal values. TDIDF was utilized with trigram for the still-experiencing-overflow-fcf35ab8a6c5
vectorization and the CFG data was processed by PencanPy [8] OpenZeppelin. (2022). Developing Smart Contracts. Accessed:
where random walks are generated and vectorized with the Dec. 26, 2022. [Online]. Available: https://docs.openzeppelin.com
/learn/developing-smart-contracts
Word2Vec model. [9] ConsenSys. (2022). Smart Contract Best Practices. Accessed:
The results of the models were then benchmarked with Dec. 26, 2022. [Online]. Available: https://consensys.github.io/smart-
software verification tools such as Mythril, Slither, and contract-best-practices/
[10] Crytic. (2022). Detector Documentation. Accessed: Dec. 26, 2022.
an experimental tool proposed in our work — MythSlith. [Online]. Available: https://github.com/crytic/slither/wiki/Detector-
From the results, machine learning models have shown Documentation
superior performance in vulnerability detection than existing [11] SmartContractSecurity. (2022). Smart Contract Weakness
Classification Registry. Accessed: Dec. 26, 2022. [Online]. Available:
Software verification tools. While testing with real-time https://swcregistry.io/
smart contracts, the MLP model performs the best at having [12] NCC Group. (2024). Top 10 Decentralized Application Security Risks.
91% accuracy along with higher recall and f1-scores. Accessed: Dec. 26, 2022. [Online]. Available: https://dasp.co/
The FPR measures show that MLP achieved the best [13] J. Feist, G. Grieco, and A. Groce, ‘‘Slither: A static analysis framework for
smart contracts,’’ in Proc. IEEE/ACM 2nd Int. Workshop Emerg. Trends
performance with the lowest average among all other tools Softw. Eng. Blockchain (WETSEB), May 2019, pp. 8–15.
at 0.0125, while MLP+LSTM achieved the lowest FPR of [14] B. Mueller, ‘‘Smashing Ethereum smart contracts for fun and real profit,’’
0.0015 for unauthorized send. HITB SECCONF Amsterdam, vol. 9, p. 54, Apr. 2018.
[15] M. Mossberg, F. Manzano, E. Hennenfent, A. Groce, G. Grieco,
While the current model improves the accuracy with J. Feist, T. Brunson, and A. Dinaburg, ‘‘Manticore: A user-friendly
significantly less FPR while detecting contract-wide bugs, symbolic execution framework for binaries and smart contracts,’’ in Proc.
it can further be improved by pinpointing the exact source 34th IEEE/ACM Int. Conf. Automated Softw. Eng. (ASE), Nov. 2019,
pp. 1186–1189.
location. One possible solution is to use a combination [16] L. Luu, D.-H. Chu, H. Olickel, P. Saxena, and A. Hobor, ‘‘Making smart
of fuzzing with formal verification to extend our current contracts smarter,’’ in Proc. ACM SIGSAC Conf. Comput. Commun. Secur.,
model to add this feature. This solution however will face a 2016, pp. 254–269.
constraint that model patterns cannot be traced back to the [17] X. Tang, K. Zhou, J. Cheng, H. Li, and Y. Yuan, ‘‘The vulnerabilities in
smart contracts: A survey,’’ in Proc. 7th Int. Conf. Adv. Artif. Intell. Secur.
source location. We can explore some viable solutions for (ICAIS) 2021, Dublin, Ireland. Cham, Switzerland: Springer, Jul. 2021,
this constraint by tagging the source index (that may not pp. 177–190.
be used for model training) to a specific feature, thereby [18] P. Momeni, Y. Wang, and R. Samavi, ‘‘Machine learning model for smart
contracts security analysis,’’ in Proc. 17th Int. Conf. Privacy, Secur. Trust
allowing traceback to the source. Bug injection method (PST), Aug. 2019, pp. 1–6.
ensures that newer forms of features and inherent patterns [19] W. Wang, J. Song, G. Xu, Y. Li, H. Wang, and C. Su, ‘‘ContractWard:
are added. The current bug injection method leverages on Automated vulnerability detection models for ethereum smart contracts,’’
IEEE Trans. Netw. Sci. Eng., vol. 8, no. 2, pp. 1133–1144, Apr. 2021.
function descriptors. However, not all bugs are in the form [20] S. Shakya, A. Mukherjee, R. Halder, A. Maiti, and A. Chaturvedi, ‘‘Smart-
of functions. In future works, we will explore more generic MixModel: Machine learning-based vulnerability detection of solidity
bug injection methods. smart contracts,’’ in Proc. IEEE Int. Conf. Blockchain (Blockchain),
Aug. 2022, pp. 37–44.
[21] A. Ghaleb and K. Pattabiraman, ‘‘How effective are smart contract analysis
REFERENCES tools? Evaluating smart contract static analysis tools using bug injection,’’
in Proc. 29th ACM SIGSOFT Int. Symp. Softw. Test. Anal., Jul. 2020,
[1] PR Newswire. (2023). Global Smart Contracts Market to Reach USD pp. 415–427.
9850 Million by 2030 With 24% CAGR | Revolutionizing Contract [22] T. M. Corporation. Common Weakness Enumeration. Accessed: Dec. 28,
Management, Exploring the Opportunities and Trends Report By 2022. [Online]. Available: https://cwe.mitre.org/
Zion Market Research. Accessed: Sep. 3, 2023. [Online]. Available: [23] G. Wood, ‘‘Ethereum: A secure decentralised generalised transaction
https://finance.yahoo.com/news/global-smart-contracts-market-reach- ledger,’’ Ethereum Project Yellow Paper, vol. 151, pp. 1–32, Apr. 2014.
160000824.html
[24] T. Durieux, J. F. Ferreira, R. Abreu, and P. Cruz, ‘‘Empirical review of
[2] P. Praitheeshan, L. Pan, J. Yu, J. Liu, and R. Doss, ‘‘Security analysis automated analysis tools on 47,587 Ethereum smart contracts,’’ in Proc.
methods on Ethereum smart contract vulnerabilities: A survey,’’ 2019, IEEE/ACM 42nd Int. Conf. Softw. Eng. (ICSE), Montreal, QC, Canada,
arXiv:1908.08605. Oct. 2020, pp. 530–541.
[3] K. Ramana, R. M. Mohana, C. K. Kumar Reddy, G. Srivastava, and [25] S. S. Kushwaha, S. Joshi, D. Singh, M. Kaur, and H. Lee, ‘‘Systematic
T. R. Gadekallu, ‘‘A blockchain-based data-sharing framework for cloud review of security vulnerabilities in Ethereum blockchain smart contract,’’
based Internet of Things systems with efficient smart contracts,’’ in IEEE Access, vol. 10, pp. 6605–6621, 2022.
Proc. IEEE Int. Conf. Commun. Workshops (ICC Workshops), May 2023, [26] L. Duan, L. Yang, C. Liu, W. Ni, and W. Wang, ‘‘A new smart contract
pp. 452–457. anomaly detection method by fusing opcode and source code features for
[4] W. Wang, Z. Han, T. R. Gadekallu, S. Raza, J. Tanveer, and C. Su, blockchain services,’’ IEEE Trans. Netw. Service Manage., vol. 20, no. 4,
‘‘Lightweight blockchain-enhanced mutual authentication protocol for pp. 4354–4368, Dec. 2023.
uavs,’’ IEEE Internet Things J., early access, Oct. 13, 2023, doi: [27] J. Zheng, L. Williams, N. Nagappan, W. Snipes, J. P. Hudepohl, and
10.1109/JIOT.2023.3324543. M. A. Vouk, ‘‘On the value of static analysis for fault detection in
[5] J. Korn. (Aug. 2022). Report: $1.9 Billion Stolen in Crypto Hacks software,’’ IEEE Trans. Softw. Eng., vol. 32, no. 4, pp. 240–253, Apr. 2006.
So Far This Year. Accessed: Aug. 16, 2022. [Online]. Available: [28] T. Ball, ‘‘The concept of dynamic analysis,’’ ACM SIGSOFT Softw. Eng.
https://edition.cnn.com/2022/08/16/tech/crypto-hack-rise-2022/index. Notes, vol. 24, no. 6, pp. 216–234, Nov. 1999.
html [29] R. Calinescu, C. Ghezzi, K. Johnson, M. Pezzé, Y. Rafiq, and
[6] Binance. (2023). Poolz Finance Hacked, Token Price Drops 93%. G. Tamburrelli, ‘‘Formal verification with confidence intervals to establish
Accessed: May 23, 2023. [Online]. Available: https://www.binance. quality of service properties of software systems,’’ IEEE Trans. Rel.,
com/en/feed/post/309330 vol. 65, no. 1, pp. 107–125, Mar. 2016.
[30] J. Chen, X. Xia, D. Lo, J. Grundy, X. Luo, and T. Chen, ‘‘DefectChecker: PURNIMA MURALI MOHAN (Member, IEEE)
Automated smart contract defect detection by analyzing EVM bytecode,’’ received the M.S. and Ph.D. degrees in electrical
IEEE Trans. Softw. Eng., vol. 48, no. 7, pp. 2189–2207, Jul. 2022. and computer engineering from the National
[31] M. Ortner and S. Eskandari. Smart Contract Sanctuary. [Online]. University of Singapore, in 2014 and 2018, respec-
Available: https://github.com/tintinweb/smart-contract-sanctuary tively. She held a postdoctoral researcher position
[32] W. B. Cavnar and J. M. Trenkle, ‘‘N -gram-based text categorization,’’ in with the National University of Singapore, until
Proc. 3rd Annu. Symp. Document Anal. Inf. Retr. (SDAIR), vol. 161175. 2018. She is currently an Assistant Professor with
Las Vegas, NV, USA, 1994, pp. 1–14. the Information and Communications Technology
[33] L. Buitinck, G. Louppe, M. Blondel, F. Pedregosa, A. Mueller,
Cluster, Singapore Institute of Technology. She
O. Grisel, V. Niculae, P. Prettenhofer, A. Gramfort, J. Grobler, R. Layton,
has expertise in Layer 2 and Layer 3 network pro-
J. VanderPlas, A. Joly, B. Holt, and G. Varoquaux, ‘‘API design for
machine learning software: Experiences from the scikit-learn project,’’ in tocols while working with the industry. Her current research interests include
Proc. ECML PKDD Workshop Lang. Data Mining Mach. Learn., 2013, blockchain and AI, security in next-generation networks, optimization, and
pp. 108–122. heuristics algorithm design.
[34] F. Contro, M. Crosara, M. Ceccato, and M. D. Preda, ‘‘EtherSolve:
Computing an accurate control-flow graph from Ethereum bytecode,’’ in
Proc. 29th IEEE/ACM Int. Conf. Program Comprehension, May 2021,
pp. 127–137.
[35] R. Liu and A. Krishnan, ‘‘PecanPy: A fast, efficient and parallelized JONATHAN PAN (Member, IEEE) received the
Python implementation of node2vec,’’ Bioinformatics, vol. 37, no. 19, Ph.D. degree in information technology and cyber
pp. 3377–3379, Oct. 2021. security from Murdoch University, Australia. He is
[36] A. Grover and J. Leskovec, ‘‘node2vec: Scalable feature learning for currently the Chief of the Disruptive Technolo-
networks,’’ in Proc. 22nd ACM SIGKDD Int. Conf. Knowl. Discovery Data gies Office and the Director of Cybersecurity
Mining, Aug. 2016. of the Home Team Science and Technology
[37] P. Qian, Z. Liu, Q. He, B. Huang, D. Tian, and X. Wang, ‘‘Smart contract Agency, which is a statutory board formed under
vulnerability detection technique: A survey,’’ 2022, arXiv:2209.05872. Singapore’s Ministry of Home Affairs to develop
science and technology capabilities for the Home
Team. He is also an Adjunct Associate Professor
with Nanyang Technological University, Singapore. His research interests
include cybersecurity, AI, and blockchain.