I202021074 - Report - Automatically Detect Software Security Vulnerabilities
I202021074 - Report - Automatically Detect Software Security Vulnerabilities
B a s e d o n N a t u r a l L a n g u a g e P r o c e s s i n g Te c h n i q u e s a n d
Machine Learning Algorithms
Nowadays, software vulnerabilities pose a serious problem, because cyber-attackers often find ways to attack a system by exploiting
software vulnerabilities. Detecting software vulnerabilities can be done using two main methods:
i) signature-based detection, i.e. methods based on a list of known security vulnerabilities as a basis for contrasting and
comparing;
This study proposes a new approach based on a technique of analyzing and standardizing software code and the random forest (RF)
classification algorithm. The novelty and advantages of this method are that to determine abnormal behavior of functions in the
software, instead of trying to define behaviors of functions, this study uses the Word2vec natural language processing model to
normalize and extract features of functions. Finally, to detect security vulnerabilities in the functions, this study proposes to use a
popular and effective supervised machine learning algorithm.
USENIX Security Symposium.2022
Background
Challenges
When the software brings convenience to people, they also bring many safety problems.(e.g. Criminals exploiting
software vulnerabilities can cause significant threats, steal personal private informa- tion, and even launch attacks that
threaten national security.) According to statistics of Common Vulnerabilities and Exposures (CVE), in 2020 and the
first six months of 2021, the world saw a record number of exploited software security vulnerabilities.
Existing approaches to classifying security vulnerabilities are detecting based on known CVEs and using behavior
analysis techniques.
No abnormal behavior is the same for all vulnerabilities. Behavior analysis though efficient, presents a main
difficulty. In the real world, it is difficult to calculate, synthesize and extract abnormal behaviors indicating
vulnerabilities based on a single definition because software is designed based on different programming languages
and because the characteristics of the vulnerabilities are different.
USENIX Security Symposium.2022
Background
The Word2vec algorithm is used for data normalization. As described above, the program or software is
preprocessed to look for abnormal signs and behaviors indicating software vulnerabilities. The originality
of this proposal is that instead of trying to extract abnormal behaviors, an embedding technique is used
to aggregate and normalize the data. This is a new approach that has only been applied and evaluated
by a small number of studies in different contexts.
USENIX Security Symposium.2022
System
Split functions. In this step, the software with detailed code is normalized to separate each function of
the software.
Normalize functions. In this step, after the functions have been successfully split, the proposed method
analyzes and normalizes them to homogenize the length of each function.
Evaluate functions. This is the process of evaluating and concluding security vulnerabilities for each
function.
USENIX Security Symposium.2022
System
Splitting Functions
Use of SySeVR framework to parse C/C++ programs into individual functions so that data can be
represented in numeric vector format accepted by machine learning models.
USENIX Security Symposium.2022
System
Data Normalization
Use of Word2vec model to normalize the previously splitted functions to put them into the classification
model.
Use the skip-gram model to calculate the conditional probability of generating the context word for a
given target word.
Contain
Vulnerability type vulnerabilities Normal Total
Evaluation Criteria
The ratio between the number of samples classified correctly and total
number of samples
The ratio between the true positive value and the total number of samples
classified as positive. The higher the value of precision, the more accurate
the software vulnerability detection
The ratio between the true positive value and the total real software
security vulnerabilities. The higher the value of recall, the lower rate of
missing positive samples.
The harmonic mean of precision and recall. The higher the F1 score, the
better the model.
USENIX Security Symposium.2022
Evaluation
Experimental Scenario
Scenario 1
Q : How effective is the skip-gram model compared to the BOW model ?
Method : Evaluate the effectiveness of the skip-gram model by test running the BOW model.
This table shows the results of processing and normalizing functions by the BOW model and
classification by the RF algorithm.
Evaluation
Vulnerability type
Accuracy Precision Recall F1_score
Pointer vulnerability 84 69 84 72
Array vulnerability 81 76 75 75
Arithmetic vulnerability 89 81 80 80
USENIX Security Symposium.2022
Evaluation
Experimental Scenario
Scenario 2
Q : How effective is RF algorithm compared to other classification algorithms ?
Method : the effectiveness of the RF machine learning algorithm by replacing RF with other
classification algorithms.
Experimental results using the Naive Bayes algorithm. Experimental results using the Perceptron algorithm.
Confusion Matrix where a is the pointer vulnerability, b is the array vulnerability, c is the
arithmetic vulnerability.
USENIX Security Symposium.2022
Evaluation
Experimental Scenario
Scenario 4
Q : How is this proposed model compared to another research proposed model ?
Method : Conduct experiments using the CNN model and compare results to the proposed
model. With a dataset of 47,205 different samples containing pointer vulnerabilities,
the CNN model gave the following classification results:
a. 39,614 correctly labeled samples: 35,154 samples with label 1; 4,460
samples with label 0.
b. 7,591 incorrectly labeled samples: 6,704 samples were mislabeled from
0 to 1; 887 samples were mislabeled from 1 to 0
2. With a dataset of 6,637 different samples containing array vulnerabilities,
there were:
a. 5,300 correctly labeled samples: 4,048 normal samples; 1,252 attack
samples.
b. 1,337 incorrectly labeled samples: 835 samples were mislabeled from
normal to attack; 502 samples were mislabeled from attack to normal.
(a) 3. With a dataset of 4,218 different samples containing arithmetic
(b) (c) vulnerabilities, the confusion matrix results were:
a. 3,787 correctly labeled samples: 3,437 normal samples; 350 attack
Confusion Matrix where a is the pointer vulnerability, b is the array samples.
vulnerability, c is the arithmetic vulnerability. b. 431 incorrectly labeled samples: 91 samples were mislabeled from
normal to attack; 340 samples were mislabeled from attack to normal.
USENIX Security Symposium.2022
Evaluation
Based on the experimental results in Scenarios 1, 2, 3, 4, it can be seen that the combination of the skip-
gram model with the RF algorithm gave better results than the other classification algorithms in the task of
detecting software security vulnerabilities. For array usage vulnerabilities, the RF algorithm had an
Accuracy measure of 84.38%, higher than 2.38% with the MLP model and 4.38% with the 84 Cho Do
Xuan, et al. CNN model. Likewise, the Recall measure when using the RF algorithm was also higher than
with the other algorithms (4% higher than that of MLP and 3% higher than that of CNN). For vulnerabilities
related to arithmetic expressions, RF, CNN, MLP all had relatively same efficiency, with only 1% to 2%
difference. Finally, for the pointer usage vulnerability type, the RF algorithm continued to show its
superiority, yielding results higher than the other algorithms from 10% to 12%. However, for accurately
detecting pointer vulnerabilities, the RF algorithm gave results 0.1% worse than the CNN model and 1.9%
worse than the MLP model.
USENIX Security Symposium.2022
Conclusion
This paper presented an approach of software security vulnerability detection based on embedding techniques and the RF
classification algorithm. This proposal yielded good experimental results in the task of classifying known software
vulnerabilities. The RF algorithm gave better performance than the other classification algorithms. The reason for this is that
the experimental dataset in the algorithm was relatively small and there was not too much difference between the number of
vulnerability data and normal data, in which case machine learning algorithms often have better results. Besides, based on
the experimental results for four scenarios, it was seen that there was a huge difference in the classification results. We
think that the difference between the security vulnerabilities leads to different embedding processes, so there will be
differences in the characteristics and features of the embedding vectors. Therefore, the BOW model did not work as well as
the skip-gram model.
Thanks!