0% found this document useful (0 votes)
13 views18 pages

I202021074 - Report - Automatically Detect Software Security Vulnerabilities

The document proposes a new approach to automatically detect software security vulnerabilities based on natural language processing techniques and machine learning algorithms. It uses the Word2vec model to normalize source code into embeddings and the random forest algorithm to classify functions as vulnerable or not. The method is evaluated on a dataset of C/C++ programs and shows improvements over traditional bag-of-words representations and other classification algorithms.

Uploaded by

AGOSSOU AGOSSOU
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views18 pages

I202021074 - Report - Automatically Detect Software Security Vulnerabilities

The document proposes a new approach to automatically detect software security vulnerabilities based on natural language processing techniques and machine learning algorithms. It uses the Word2vec model to normalize source code into embeddings and the random forest algorithm to classify functions as vulnerable or not. The method is evaluated on a dataset of C/C++ programs and shows improvements over traditional bag-of-words representations and other classification algorithms.

Uploaded by

AGOSSOU AGOSSOU
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 18

A u t o m a t i c a l l y D e t e c t S o f t w a r e S e c u r i t y Vu l n e r a b i l i t i e s

B a s e d o n N a t u r a l L a n g u a g e P r o c e s s i n g Te c h n i q u e s a n d
Machine Learning Algorithms

Student name : AGOSSOU Kodjo Edem


Student ID : I202021074
USENIX Security Symposium.2022
Abstract

Nowadays, software vulnerabilities pose a serious problem, because cyber-attackers often find ways to attack a system by exploiting
software vulnerabilities. Detecting software vulnerabilities can be done using two main methods:

i) signature-based detection, i.e. methods based on a list of known security vulnerabilities as a basis for contrasting and
comparing;

ii) behavior analysis-based detection using classification algorithms.

This study proposes a new approach based on a technique of analyzing and standardizing software code and the random forest (RF)
classification algorithm. The novelty and advantages of this method are that to determine abnormal behavior of functions in the
software, instead of trying to define behaviors of functions, this study uses the Word2vec natural language processing model to
normalize and extract features of functions. Finally, to detect security vulnerabilities in the functions, this study proposes to use a
popular and effective supervised machine learning algorithm.
USENIX Security Symposium.2022
Background

Challenges
 When the software brings convenience to people, they also bring many safety problems.(e.g. Criminals exploiting
software vulnerabilities can cause significant threats, steal personal private informa- tion, and even launch attacks that
threaten national security.) According to statistics of Common Vulnerabilities and Exposures (CVE), in 2020 and the
first six months of 2021, the world saw a record number of exploited software security vulnerabilities.
 Existing approaches to classifying security vulnerabilities are detecting based on known CVEs and using behavior
analysis techniques.
 No abnormal behavior is the same for all vulnerabilities. Behavior analysis though efficient, presents a main
difficulty. In the real world, it is difficult to calculate, synthesize and extract abnormal behaviors indicating
vulnerabilities based on a single definition because software is designed based on different programming languages
and because the characteristics of the vulnerabilities are different.
USENIX Security Symposium.2022
Background

Two specifics contributions


 A novel security vulnerability detection model based on embedding techniques and the RF machine
learning algorithm. Specifically, instead of trying to extract anomalous behavior indicating software
vulnerabilities, this study developed a way to analyze and normalize a program or software and then
use a classification algorithm to determine whether the program or software is safe or contains
vulnerabilities.

 The Word2vec algorithm is used for data normalization. As described above, the program or software is
preprocessed to look for abnormal signs and behaviors indicating software vulnerabilities. The originality
of this proposal is that instead of trying to extract abnormal behaviors, an embedding technique is used
to aggregate and normalize the data. This is a new approach that has only been applied and evaluated
by a small number of studies in different contexts.
USENIX Security Symposium.2022
System

The proposed model is divided into three parts

 Split functions. In this step, the software with detailed code is normalized to separate each function of
the software.

 Normalize functions. In this step, after the functions have been successfully split, the proposed method
analyzes and normalizes them to homogenize the length of each function.

 Evaluate functions. This is the process of evaluating and concluding security vulnerabilities for each
function.
USENIX Security Symposium.2022
System

The overall architecture of the proposed model


USENIX Security Symposium.2022
System

Splitting Functions
 Use of SySeVR framework to parse C/C++ programs into individual functions so that data can be
represented in numeric vector format accepted by machine learning models.
USENIX Security Symposium.2022
System
Data Normalization
 Use of Word2vec model to normalize the previously splitted functions to put them into the classification
model.
 Use the skip-gram model to calculate the conditional probability of generating the context word for a
given target word.

Security Vulnerability Detection


 The extracted and normalized functions are put in the classification model to identify vulnerabilities in
each function. The Random Forest machine learning algorithm has been used to achieve this purpose.
USENIX Security Symposium.2022
Evaluation
 Dataset
 For the experimental dataset, the SARD dataset was used, which consists of 15,591 C/C++
programs.
 After splitting the C/C++ programs in the dataset into functions, the dataset consisted of 267,227
files containing function data.
 SySeVR toolkit has been used to check for vulnerabilities that appear in the dataset and divided the
dataset into three main vulnerability types as presented in the following table.

Contain
Vulnerability type vulnerabilities Normal Total

Array Usage 31,303 10,926 42,229


Pointer Usage 28,391 263,400 291,791
Arithmetic Expression 3,475 18,679 22,154
Total 64,169 293,005 356,174
USENIX Security Symposium.2022
Evaluation

 Evaluation Criteria

The ratio between the number of samples classified correctly and total
number of samples

The ratio between the true positive value and the total number of samples
classified as positive. The higher the value of precision, the more accurate
the software vulnerability detection

The ratio between the true positive value and the total real software
security vulnerabilities. The higher the value of recall, the lower rate of
missing positive samples.

The harmonic mean of precision and recall. The higher the F1 score, the
better the model.
USENIX Security Symposium.2022
Evaluation
 Experimental Scenario
 Scenario 1
 Q : How effective is the skip-gram model compared to the BOW model ?
 Method : Evaluate the effectiveness of the skip-gram model by test running the BOW model.
This table shows the results of processing and normalizing functions by the BOW model and
classification by the RF algorithm.

 
Evaluation
Vulnerability type  
Accuracy Precision Recall F1_score

Pointer vulnerability 84 69 84 72
Array vulnerability 81 76 75 75
Arithmetic vulnerability 89 81 80 80
USENIX Security Symposium.2022
Evaluation
 Experimental Scenario
 Scenario 2
 Q : How effective is RF algorithm compared to other classification algorithms ?
 Method : the effectiveness of the RF machine learning algorithm by replacing RF with other
classification algorithms.
Experimental results using the Naive Bayes algorithm. Experimental results using the Perceptron algorithm.

Experimental results using the MLP model.


USENIX Security Symposium.2022
Evaluation
 Experimental Scenario
 Scenario 3
 Q : How effective is the proposed model ?
 Method : Evaluate the effectiveness of the skip-gram model combined to the RF algorithm.

The RF algorithm gave classification results at an


acceptable accuracy level. Accordingly, with the
pointer vulnerability type, the RF algorithm had
relatively good results. These results had a great
difference when changing the parameters of the
decision tree in the algorithm. Specifically, when
increasing the number of decision trees, the accuracy
of the classification process also increased. Similarly,
for array usage vulnerabilities, the RF algorithm also
gave relatively
high efficiency, and these results also had a large
difference between the best classification model and
the worst classification model.
USENIX Security Symposium.2022
Evaluation
 Proposed model confusion matrix

For the pointer vulnerability type, the RF algorithm correctly


predicted 3,927 functions containing vulnerabilities, incorrectly
predicted 2,131 normal functions as functions containing
vulnerabilities, and wrongly predicted 1,420 functions
containing vulnerabilities. Similarly, for array usage
vulnerabilities, the RF algorithm correctly predicted 1,748
functions containing vulnerabilities, incorrectly predicted 650
normal functions, and wrongly predicted 904 functions
containing vulnerabilities. With the test dataset of arithmetic
expression vulnerabilities, the RF algorithm correctly predicted
616 functions containing vulnerabilities, incorrectly predicted
175 normal functions, and wrongly predicted 419 functions
containing vulnerabilities. The algorithm worked best with
parameter n_estimator at 1000.

Confusion Matrix where a is the pointer vulnerability, b is the array vulnerability, c is the
arithmetic vulnerability.
USENIX Security Symposium.2022
Evaluation
 Experimental Scenario
 Scenario 4
 Q : How is this proposed model compared to another research proposed model ?
 Method : Conduct experiments using the CNN model and compare results to the proposed
model. With a dataset of 47,205 different samples containing pointer vulnerabilities,
the CNN model gave the following classification results:
a. 39,614 correctly labeled samples: 35,154 samples with label 1; 4,460
samples with label 0.
b. 7,591 incorrectly labeled samples: 6,704 samples were mislabeled from
0 to 1; 887 samples were mislabeled from 1 to 0
2. With a dataset of 6,637 different samples containing array vulnerabilities,
there were:
a. 5,300 correctly labeled samples: 4,048 normal samples; 1,252 attack
samples.
b. 1,337 incorrectly labeled samples: 835 samples were mislabeled from
normal to attack; 502 samples were mislabeled from attack to normal.
(a) 3. With a dataset of 4,218 different samples containing arithmetic
(b) (c) vulnerabilities, the confusion matrix results were:
a. 3,787 correctly labeled samples: 3,437 normal samples; 350 attack
Confusion Matrix where a is the pointer vulnerability, b is the array samples.
vulnerability, c is the arithmetic vulnerability. b. 431 incorrectly labeled samples: 91 samples were mislabeled from
normal to attack; 340 samples were mislabeled from attack to normal.
USENIX Security Symposium.2022
Evaluation

 Comments on the result

Based on the experimental results in Scenarios 1, 2, 3, 4, it can be seen that the combination of the skip-
gram model with the RF algorithm gave better results than the other classification algorithms in the task of
detecting software security vulnerabilities. For array usage vulnerabilities, the RF algorithm had an
Accuracy measure of 84.38%, higher than 2.38% with the MLP model and 4.38% with the 84 Cho Do
Xuan, et al. CNN model. Likewise, the Recall measure when using the RF algorithm was also higher than
with the other algorithms (4% higher than that of MLP and 3% higher than that of CNN). For vulnerabilities
related to arithmetic expressions, RF, CNN, MLP all had relatively same efficiency, with only 1% to 2%
difference. Finally, for the pointer usage vulnerability type, the RF algorithm continued to show its
superiority, yielding results higher than the other algorithms from 10% to 12%. However, for accurately
detecting pointer vulnerabilities, the RF algorithm gave results 0.1% worse than the CNN model and 1.9%
worse than the MLP model.
USENIX Security Symposium.2022
Conclusion

This paper presented an approach of software security vulnerability detection based on embedding techniques and the RF
classification algorithm. This proposal yielded good experimental results in the task of classifying known software
vulnerabilities. The RF algorithm gave better performance than the other classification algorithms. The reason for this is that
the experimental dataset in the algorithm was relatively small and there was not too much difference between the number of
vulnerability data and normal data, in which case machine learning algorithms often have better results. Besides, based on
the experimental results for four scenarios, it was seen that there was a huge difference in the classification results. We
think that the difference between the security vulnerabilities leads to different embedding processes, so there will be
differences in the characteristics and features of the embedding vectors. Therefore, the BOW model did not work as well as
the skip-gram model.
Thanks!

You might also like