Mishra Thesis AI Augmented Vulnerability
Mishra Thesis AI Augmented Vulnerability
Patching
by
Poornaditya Mishra
Thesis Committee:
Associate Professor Birhanu Eshete, Chair
Assistant Professor Zheng Song
Professor Bruce Maxim
© Poornaditya Mishra 2024
All Rights Reserved
Dedication
ii
ACKNOWLEDGEMENTS
This thesis would not have been possible without the support and guidance of
many individuals. First and foremost, I would like to express my deepest gratitude
to my advisor, Dr. Birhanu Eshete, for his unwavering support, encouragement,
and mentorship throughout my thesis journey. His invaluable guidance, insightful
feedback, and ability to push me beyond my perceived limits were instrumental in
shaping this work and allowing me to grow as a researcher. I am sincerely grateful to
Dr. Jin Lu for igniting the spark of this research. His initial guidance and support
were invaluable in helping me take the first steps on this path. My sincere thanks go
to Dr. Probir Roy, who opened my eyes to the fascinating world of Abstract Syntax
Trees and Intermediate Representations. His insights and enthusiasm in this area were
truly inspiring. I would also like to thank my friends and family for their constant love
and support, providing a much-needed source of strength and motivation throughout
the challenging times. To everyone who contributed to this thesis in ways big and
small, thank you from the bottom of my heart.
iii
TABLE OF CONTENTS
ACKNOWLEDGEMENTS . . . . . . . . . . . . . . . . . . . . . . . . . . iii
ABSTRACT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix
CHAPTER
I. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Prior Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.3 Approach Overview . . . . . . . . . . . . . . . . . . . . . . . 3
1.4 Evalaution Overview . . . . . . . . . . . . . . . . . . . . . . . 5
1.5 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.6 Thesis Organization . . . . . . . . . . . . . . . . . . . . . . . 7
II. Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
iv
3.2.2 Graph Neural Networks for Code Analysis . . . . . 21
3.3 Leveraging LLMs for Code Remediation . . . . . . . . . . . . 22
3.3.1 Machine Learning and LLM-based Approaches for
Code Patching . . . . . . . . . . . . . . . . . . . . . 23
3.4 Bridging the Gap: The Need for Our Approach . . . . . . . . 24
V. Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
v
6.2.2 Results and Analysis . . . . . . . . . . . . . . . . . 68
6.3 LLM-Generated Patch Evaluation: A Multi-Perspective As-
sessment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
6.3.1 Evaluation Methodology . . . . . . . . . . . . . . . 69
6.3.2 Results and Analysis . . . . . . . . . . . . . . . . . 70
6.4 Sample Result: End-to-End Vulnerability Detection and Patch-
ing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
6.4.1 Input Code Snippet . . . . . . . . . . . . . . . . . . 71
6.4.2 GAT Model Prediction and Localization . . . . . . 72
6.4.3 Generated LLM Prompt . . . . . . . . . . . . . . . 72
6.4.4 LLM-Generated Patch . . . . . . . . . . . . . . . . 74
6.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
vi
LIST OF FIGURES
Figure
vii
LIST OF TABLES
Table
viii
ABSTRACT
Software vulnerabilities remain a persistent threat, and the increasing use of AI-
generated code introduces new security challenges. While Large Language Models
(LLMs) excel at code generation, they often struggle to consistently produce secure
code or apply targeted vulnerability fixes. This work proposes a novel system that
bridges this gap by combining the strengths of graph-based deep learning and LLMs
for automated vulnerability detection and patching. We first model vulnerability
detection as graph representation learning via Graph Attention Network (GAT) to
accurately identify vulnerabilities in code, leveraging the rich structural information
encoded in Code Property Graphs (CPGs) and Abstract Syntax Trees (ASTs). Our
system then leverages the GAT’s predictions to guide an LLM, providing both the
vulnerability type and the precise location within the code requiring a patch. This
targeted guidance enables the LLM to generate more secure and contextually appro-
priate code modifications. Through experiments on a dataset of real-world vulnerable
code, we demonstrate the effectiveness of our approach in detecting critical vulnera-
bilities like SQL injection and session hijacking. We further evaluate the quality of the
LLM-generated patches, showing a significant improvement in security when guided
ix
by our system. This research paves the way for more secure and reliable AI-assisted
software development by integrating deep learning-based vulnerability analysis with
the generative capabilities of LLMs.
Keywords: Generative AI, Cybersecurity, Large Language Model
x
CHAPTER I
Introduction
1.1 Motivation
1
1.2 Prior Work
2
and their dependencies.
Graph Neural Networks (GNNs) have emerged as a promising solution to address
this challenge. Designed to learn from data represented as graphs, GNNs are par-
ticularly well-suited for code analysis as they can effectively capture the complex
dependencies within code [10, 11]. Specifically, Graph Attention Networks (GATs)
[12], a type of GNN with an attention mechanism, have shown remarkable capabili-
ties in learning from graph-structured data, selectively focusing on the most relevant
connections within the code. Research has demonstrated the effectiveness of GNNs
in learning from code representations like Code Property Graphs (CPGs) [13], which
encode not only syntactic information but also data flow, control flow, and other
important semantic relationships.
Large Language Models (LLMs), with their impressive code generation capa-
bilities, offer a potential avenue for automated code repair [14]. However, simply
prompting an LLM to “fix vulnerabilities” without informed guidance often yields
sub-optimal results. LLMs, while adept at generating syntactically correct code, of-
ten lack the ability to understand the security implications of the code they generate.
Therefore, they require careful context and explicit instructions regarding security
considerations to generate patches that are both functionally correct and secure.
This thesis proposes a novel framework that seamlessly integrates the power of
GATs for precise vulnerability detection with the generative capabilities of LLMs for
automated vulnerability patching. Recognizing the limitations of relying solely on
LLMs for security-critical tasks, our approach leverages GATs to not only identify
vulnerabilities but also to provide targeted guidance to an LLM, thereby ensuring the
generation of secure and effective patches.
The core insight of our approach lies in representing code as Abstract Syntax
3
Trees (ASTs) and Code Property Graphs (CPGs), capturing the hierarchical struc-
ture and syntactic relationships in code. We train GAT models on AST and CPG
representations, enabling them to learn and recognize complex patterns indicative
of vulnerabilities. The GAT’s attention mechanism plays a crucial role in precisely
pinpointing vulnerable code sections by highlighting the nodes most relevant to its
vulnerability prediction.
This localized vulnerability information, enriched with relevant contextual data
extracted from the AST, forms the basis for crafting carefully engineered prompts for
a powerful LLM (Google Gemini Pro). These prompts, unlike generic instructions,
provide the LLM with specific guidance on the vulnerability’s nature and location
within the code, guiding it towards generating effective patches that address the
identified issue while preserving the original code’s functionality.
While LLMs exhibit remarkable code generation capabilities, relying solely on
them for security-critical tasks presents significant challenges. Their training on vast
codebases, while advantageous for code fluency, inadvertently exposes them to both
secure and insecure coding practices. This can lead to the unintentional generation of
code susceptible to common vulnerabilities, even from seemingly innocuous prompts.
For instance, an LLM tasked with creating a simple web application might produce
functional code that lacks essential security measures like input sanitization or secure
session management. This vulnerability could stem from the model learning prevalent,
but insecure, coding patterns from its training data. While the generated code might
function as intended on the surface, it would remain vulnerable to common exploits
like cross-site scripting (XSS) or SQL injection.
Furthermore, simply instructing an LLM to “improve security” without specific
guidance often results in superficial fixes or overly broad security measures. This lack
of targeted guidance stems from the LLM’s inability to independently analyze the
code for specific vulnerabilities and devise appropriate remediation strategies.
4
This highlights a critical need to guide LLMs toward secure code generation
through a more structured and informative approach. Our proposed framework ad-
dresses this challenge by incorporating:
5
Our analysis demonstrates an 87% accuracy in identifying the vulnerable code
span, highlighting the efficacy of this approach in guiding the subsequent patch
generation process.
The findings from our evaluation underscore the significance of this thesis in
demonstrating the viability of a hybrid approach, combining graph-based deep learn-
ing and LLMs, for automated vulnerability detection and patching. Our work high-
lights the importance of choosing a suitable code representation, the effectiveness of
attention-based localization for targeted patch generation, and the promising capabil-
ities of LLMs in generating secure code fixes when provided with accurate contextual
information. These results pave the way for further research and development in
automating code security and improving the reliability of software systems.
1.5 Contributions
This thesis makes the following key contributions to the field of AI-assisted soft-
ware security:
6
effectiveness of our approach in accurately detecting vulnerabilities, localizing
them within code, and generating promising patches.
7
CHAPTER II
Background
This chapter lays the groundwork for understanding the key concepts and tech-
niques employed throughout this thesis. We begin by delving into the nature of
software vulnerabilities and their systematic categorization using frameworks like the
Common Weakness Enumeration (CWE). We then explore the importance of repre-
senting code as graphs, focusing on Code Property Graphs (CPGs) as a powerful tool
for capturing the intricate relationships between code elements. Finally, we introduce
Graph Neural Networks (GNNs), specifically Graph Attention Networks (GATs), as a
sophisticated deep learning technique well-suited for analyzing graph-structured data
like CPGs, paving the way for more accurate and efficient vulnerability detection.
Software vulnerabilities are weaknesses or flaws in code that deviate from secure
coding practices, potentially enabling attackers to compromise system security. These
flaws can manifest in various forms, often arising from complexities in software design,
implementation errors, or a lack of awareness regarding secure coding principles.
Understanding the nature and characteristics of these vulnerabilities is crucial for
developing effective detection and remediation techniques.
Common Weakness Enumeration (CWE). To effectively address the diverse
8
landscape of software vulnerabilities, systematic categorization is essential. The Com-
mon Weakness Enumeration (CWE) [16], maintained by MITRE, provides a widely-
used framework for classifying software weaknesses based on their underlying nature
and potential impact. CWE acts as a common language for security professionals,
researchers, and developers, facilitating communication and collaboration in vulner-
ability analysis, mitigation, and prevention.
The CWE framework categorizes weaknesses based on several factors, including
the affected software development lifecycle phase, the exploited weakness type (e.g.,
input validation, access control), and the potential impact of successful exploitation.
This structured approach enables developers to identify common security flaws, prior-
itize remediation efforts based on risk assessments, and adopt secure coding practices
that minimize the introduction of vulnerabilities.
This thesis focuses on five prevalent and impactful vulnerability types:
• CWE-89: SQL Injection (SQLi): SQLi vulnerabilities occur when user in-
put used in constructing SQL queries is not properly sanitized [18]. This can
allow attackers to manipulate the structure of the SQL query, potentially by-
passing authentication mechanisms, accessing sensitive data, or even executing
arbitrary commands on the database server.
9
access to the user’s account [19].
10
This graph-based approach provides a holistic view of the code, enabling more accu-
rate and comprehensive vulnerability analysis.
CPGs: Combining Structure and Semantics. CPGs [13] build upon the
foundation of graph-based code representation by integrating information from vari-
ous program analyses, including Abstract Syntax Trees (ASTs), Control Flow Graphs
(CFGs), and Data Flow Graphs (DFGs). This integration of syntactic and semantic
information provides a rich and comprehensive representation of the code, enabling
deeper insights into code behavior and vulnerability detection.
Leveraging ASTs for Syntactic Structure. ASTs capture the grammatical
structure of the code, representing the hierarchical relationship between different
code constructs. In the context of CPGs, AST information helps define the basic
building blocks of the graph. Nodes representing variables, functions, and classes are
derived from the AST, as are edges representing parent-child relationships between
code blocks (e.g., a function definition containing multiple statements).
Incorporating CFGs for Control Flow Analysis. CFGs depict the possible
execution paths within a program, showing how the program’s control flow can branch
based on conditional statements and loops. Integrating CFG information into the
CPG allows for analyzing how data flows through different execution paths. This
is particularly useful for detecting vulnerabilities like Path Traversal (CWE-22),
where an attacker manipulates the control flow to access files outside the intended
directory, and Improper Control Flow (CWE-201), where vulnerabilities arise
from unexpected or manipulated program execution order.
Enhancing with DFGs for Data Flow Analysis. DFGs track how data prop-
agates through the program, showing which variables influence other variables and
how user input ultimately affects sensitive operations. Integrating DFG information
into the CPG is crucial for identifying vulnerabilities related to data handling, such as
Cross-Site Scripting (XSS - CWE-79), where unsanitized user input flows into
11
web page outputs, potentially enabling script injection attacks, and SQL Injection
(SQLi - CWE-89), where unsanitized user input used in SQL queries can allow
attackers to manipulate database operations.
Properties: Adding Depth and Context. CPGs go beyond merely represent-
ing code elements and their relationships. They also incorporate properties associated
with nodes and edges, providing additional context and information crucial for vul-
nerability analysis. These properties can include data types, indicating the type of
data a variable can hold (e.g., integer, string, object); variable scopes, defining where
a variable can be accessed within the code; function signatures, specifying the pa-
rameters a function accepts and the value it returns; and access modifiers, defining
the accessibility of classes, methods, and variables (e.g., public, private, protected).
These properties enhance the CPG’s analytical capabilities by providing detailed in-
formation about each code element and relationship.
Advantages of CPGs for Vulnerability Detection. The comprehensive and
unified nature of CPGs offers significant advantages for vulnerability detection, sur-
passing the limitations of traditional code analysis methods:
12
range of languages, facilitating cross-language vulnerability analysis.
Traditional code analysis techniques often struggle to capture the complex, non-
linear relationships present within software. This limitation has led to increasing
interest in graph-based representations of code, enabling the application of power-
ful machine learning techniques like Graph Neural Networks (GNNs) for enhanced
vulnerability detection.
Graph Neural Networks (GNNs) [22] excel at learning from graph-structured data.
They operate through a process called message passing, where nodes iteratively ex-
change information with their neighbors. This allows each node to aggregate informa-
tion from its surroundings, learning a vector representation (embedding) that encodes
its local graph structure and features. These learned embeddings can then be used for
various downstream tasks like node classification to identify vulnerable code snippets,
graph classification to predict the presence of vulnerabilities within a program, and
edge prediction to anticipate relationships between code elements.
Focus on Graph Attention Networks (GATs). While standard GNNs treat
all neighboring nodes equally during message passing, Graph Attention Networks
(GATs) [12] introduce a crucial advancement: an attention mechanism. This allows
GATs to differentiate the importance of each neighboring node when aggregating
information, similar to how humans focus on specific details when understanding a
complex system.
This attention mechanism brings distinct advantages to code analysis. GATs
excel at focusing on the most relevant connections in the code graph, such as a data
flow edge connecting user input to a SQL query (critical for SQLi detection), while
downplaying less informative connections, like a syntactic edge between unrelated
variables. They also adapt well to the variable size and structure of code graphs
13
by selectively attending to significant connections, potentially outperforming regular
GNNs, which may struggle with the noise of less informative relationships in larger
graphs.
GNNs for Vulnerability Detection. Recent research, such as the work by
Zhou et al. [10], highlights the effectiveness of GNNs, particularly GATs, in achiev-
ing state-of-the-art accuracy in detecting code vulnerabilities. They have demonstra-
bly outperformed traditional methods that rely on handcrafted features or shallower
learning architectures.
2.4 Summary
This chapter established the foundational concepts and motivations for this the-
sis. We began by highlighting the critical need to address software vulnerabilities,
particularly those identifiable through code structure analysis. We then introduced
CWE classifications as a framework for categorizing these vulnerabilities and focusing
our research on specific types.
CPGs emerged as a powerful approach for representing code in a manner that
facilitates comprehensive analysis, capturing the intricate relationships between code
elements often overlooked by traditional methods. We discussed the advantages of
CPGs and the process of constructing them from source code.
Finally, we GNNs, specifically highlighting the strengths of GATs, as a potent
deep learning technique for learning from graph-structured data. The combination
of CPGs, with their rich representation of code, and GNNs, with their ability to
learn complex patterns from graph data, sets the stage for developing more accurate
and efficient automated vulnerability detection systems. The following chapters delve
into related work, our proposed methodology, and the experimental evaluation of our
approach, showcasing the potential of this synergy for enhancing software security.
14
CHAPTER III
Related Work
This chapter delves into the existing landscape of code vulnerability detection,
surveying a range of approaches from traditional methods to cutting-edge applications
of artificial intelligence (AI). We analyze their strengths, limitations, and how they
relate to our proposed method of combining CPGs, GATs, and LLMs for automated
vulnerability detection and patching.
Before the advent of AI, security researchers and practitioners primarily relied on
manual code review and various automated but less sophisticated techniques. These
methods, while still relevant in certain contexts, often face challenges in terms of
scalability, accuracy, and their ability to cope with the ever-increasing complexity of
modern software systems.
Manual code review involves the meticulous inspection of source code by secu-
rity experts, who leverage their knowledge and experience to identify potential vul-
nerabilities. This process typically entails scrutinizing the code for insecure coding
practices, logical flaws, potential security loopholes, and violations of established se-
15
curity guidelines. While manual code review remains a valuable approach for critical
software systems, where accuracy and thoroughness are paramount, it suffers from
inherent limitations that hinder its applicability to large-scale software projects.
The primary challenge lies in scalability. Reviewing large codebases manually is
an incredibly time-consuming and resource-intensive endeavor. As software projects
grow in size and complexity, the time required for comprehensive manual review
becomes impractical, especially under tight development timelines [1]. Additionally,
manual review is inherently subjective, as vulnerability assessments can vary between
reviewers based on their experience, expertise, and understanding of the codebase
[23]. This subjectivity can lead to inconsistencies in the identification and reporting
of vulnerabilities, making it difficult to ensure a consistent level of security across
a project. Furthermore, certain vulnerabilities stem from the complex interaction
of multiple code components, making them exceptionally difficult to detect through
manual inspection alone. A human reviewer might not readily grasp the intricate
interplay of different code sections and their combined security implications [24]. This
limitation becomes particularly pronounced in modern software architectures, which
often involve distributed systems, microservices, and complex data flows, making it
difficult for a single reviewer to track all potential vulnerabilities arising from the
interaction of multiple components.
Static analysis tools aim to automate the code review process by examining source
code without executing it. These tools utilize various techniques to identify poten-
tial vulnerabilities, ranging from simple pattern matching to more sophisticated data
flow and control flow analyses. Simple static analysis tools rely on pattern match-
ing, searching for specific code structures or keywords known to be associated with
vulnerabilities. For example, a tool might flag the use of dangerous functions like
16
”strcpy” in C, which is known to be susceptible to buffer overflow vulnerabilities.
More sophisticated static analysis tools employ data flow analysis to track how data
moves through the program and identify potential security issues. This involves trac-
ing the flow of data from its source, such as user input, to sensitive operations, such
as database queries or file system interactions, to detect vulnerabilities like SQL in-
jection or cross-site scripting (XSS) [25].
Control flow analysis examines the possible execution paths within a program to
identify potential security flaws. This analysis can detect vulnerabilities like path
traversal or denial-of-service attacks, where an attacker can manipulate the control
flow to gain unauthorized access or disrupt the application’s functionality. Static anal-
ysis offers advantages in terms of efficiency and broad coverage compared to manual
code review. These tools can analyze large codebases relatively quickly, making them
suitable for integration into the software development lifecycle to provide continuous
feedback to developers. However, static analysis tools also face limitations. They
are often plagued by false positives, flagging code that is not actually vulnerable,
which can lead to wasted effort in manual verification and potentially erode trust in
the tool’s results [3]. Many static analysis tools operate primarily on syntactic rules
and predefined patterns, making them less effective at identifying vulnerabilities that
arise from subtle interactions between code components or those requiring a deeper
semantic understanding of the code’s functionality [2]. They also struggle to handle
code exhibiting dynamic behavior, relying heavily on external libraries, or using re-
flection, as it becomes challenging to reason about all possible execution paths and
their potential security implications statically [26]. Despite these limitations, static
analysis remains a widely used technique for vulnerability detection, with popular
tools like SonarQube, Coverity, and Checkmarx offering varying degrees of sophisti-
cation and coverage, catering to different programming languages and development
environments.
17
3.1.3 Fuzzing
18
3.2 AI-Powered Vulnerability Detection
Deep learning models, with their ability to learn complex patterns and represen-
tations from large datasets, have shown promise in identifying vulnerabilities in code.
These models can analyze code and learn to differentiate between secure and insecure
coding patterns, potentially discovering vulnerabilities that might be missed by tra-
ditional rule-based methods. One approach treats code as natural language, applying
Natural Language Processing (NLP) techniques to learn vulnerability patterns from
code syntax and semantics. This approach leverages the success of NLP techniques in
processing and understanding natural language text and applies them to the domain
of code analysis. Models like Recurrent Neural Networks (RNNs) [7] and Transform-
ers [29] have been used to analyze code as a sequence of tokens, similar to sentences,
and learn to identify patterns and anomalies that might suggest vulnerabilities.
Another approach focuses on extracting handcrafted features from code, such as
software metrics, control flow patterns, or data flow characteristics, and using these
features to train classifiers like Support Vector Machines (SVMs) or deep neural net-
works. This approach relies on domain expertise to define features that capture
relevant aspects of the code’s structure and behavior for vulnerability detection. For
example, researchers have used software metrics like cyclomatic complexity, which
measures the number of independent paths through a program, to predict the likeli-
hood of vulnerabilities [30]. Others have focused on extracting control flow patterns,
19
such as the presence of loops or conditional statements, to identify potential vulner-
abilities related to program logic [31]. Data flow analysis techniques have also been
used to extract features related to how data flows through the program, such as the
sources of data, the operations performed on the data, and the sinks where the data
is ultimately used, to detect vulnerabilities like SQL injection or cross-site scripting
[32].
Deep learning offers several potential advantages for vulnerability detection, in-
cluding scalability, accuracy, and generalizability. These models can learn complex
relationships and patterns from data, potentially identifying vulnerabilities that might
be missed by rule-based approaches. Deep learning models can be trained on massive
datasets of code, allowing them to learn from a wide range of coding practices and
potentially identify vulnerabilities across diverse programming languages and soft-
ware domains. Well-trained deep learning models can potentially generalize to new,
unseen code, allowing them to identify vulnerabilities in code that was not part of the
training data. However, they also face challenges related to their dependence on high-
quality data and their inherent black-box nature. The performance of deep learning
models heavily relies on the quality and diversity of the training data. Obtaining
large, well-labeled datasets for security-related tasks can be challenging, as manually
labeling vulnerabilities is time-consuming, requires specialized expertise, and might
not always be feasible for certain types of vulnerabilities [33]. The decision-making
process of deep learning models is often opaque, making it difficult to understand
why a particular code snippet is flagged as vulnerable. This lack of interpretability
can hinder trust in the model’s predictions and make it challenging for developers to
understand and fix the identified vulnerabilities [34].
20
3.2.2 Graph Neural Networks for Code Analysis
21
model relationships, contextual awareness, and potential for interpretability. GNNs
excel at capturing complex relationships within code, making them well-suited for
identifying vulnerabilities that arise from the interaction of multiple code elements.
They can learn representations of code that incorporate contextual information from
the surrounding code, enabling them to identify vulnerabilities that might be missed
by methods relying solely on local code patterns. While GNNs can be complex, they
offer better interpretability compared to some other deep learning approaches, as it
is often possible to analyze the attention weights or message passing mechanisms
to understand which parts of the graph were most influential in the model’s deci-
sion. However, GNNs also face challenges. Building and analyzing large CPGs can
be computationally expensive, posing scalability challenges for analyzing very large
codebases [37]. The complexity of GNN models themselves can also contribute to
computational challenges, especially when dealing with deep GNN architectures or
large graphs. While GNNs offer better interpretability compared to some other deep
learning approaches, explaining their predictions can still be challenging, especially
when dealing with complex graph structures and numerous features [38].
Large Language Models (LLMs) like Codex [39] and GPT-3 [40] have gained
significant attention for their impressive code generation capabilities. These mod-
els, trained on massive code datasets, can generate code in various programming
languages, complete code snippets, translate natural language descriptions into func-
tional code, and even refactor or optimize existing code. Recent research has begun
exploring the potential of LLMs for automated code repair, including fixing security
vulnerabilities [14, 41]. These approaches leverage the LLM’s vast knowledge of cod-
ing practices and security vulnerabilities, acquired during training, to generate code
patches that aim to address identified security issues.
22
However, relying solely on LLMs for security-critical tasks can be risky. Despite
their impressive capabilities, LLMs face challenges in guaranteeing the security and
contextual appropriateness of the generated code. LLMs are primarily trained to gen-
erate code that is syntactically correct and consistent with common coding patterns
observed in the training data. However, this does not guarantee that the generated
code is inherently secure. The LLM’s training data might contain insecure examples,
potentially leading to the propagation of vulnerabilities in the generated code [42].
Without sufficient context, LLMs might misinterpret the intent of the code or apply
overly general fixes that are not appropriate for the specific situation. This can lead
to the introduction of new vulnerabilities or the breaking of existing functionality [43].
Providing LLMs with the necessary context to understand the specific vulnerability
and the surrounding code is crucial for generating secure and effective code patches.
While LLMs alone have shown promise in generating code, leveraging machine
learning and combining it with the capabilities of LLMs has yielded further advance-
ments in code patching. One such approach involves using machine learning to guide
the LLM’s patch generation process. For instance, Tufano et al. [44] developed a
technique that utilizes a sequence-to-sequence model to predict the location of a bug
and then uses this information to guide an LLM in generating a patch. Another study
by Dinella et al. [45] employed machine learning to rank candidate patches generated
by an LLM, using features derived from the code and the vulnerability description.
This ranking helps prioritize the most likely correct patches, improving the efficiency
of the repair process.
Another direction involves training specialized machine learning models to gener-
ate patches directly. For example, Lutellier et al. [46] proposed using a transformer-
based model to learn patch generation from a dataset of bug fixes, allowing the model
23
to generate patches without relying on an LLM. While this approach offers potential
advantages in terms of efficiency and control, it requires large, high-quality datasets
of code patches for effective training.
Combining both machine learning and LLMs allows for leveraging the strengths
of both approaches. Machine learning can provide guidance, ranking, or even direct
patch generation, while LLMs can contribute their vast code knowledge and ability
to generate syntactically and semantically coherent code.
24
CHAPTER IV
Proposed Methodology
This chapter details our novel methodology for automatically detecting and recti-
fying code vulnerabilities, leveraging a powerful synergy of graph-based deep learning
and advanced language models. Recognizing the limitations of traditional vulnera-
bility detection techniques, which often stop at identifying potential flaws, we aim
to bridge the gap towards automated code remediation. Our approach harnesses the
combined strengths of ASTs, CPGs, GATs, and prompt engineering to create a sys-
tem that not only pinpoints security flaws but also guides the generation of targeted
and context-aware code fixes.
The following example highlights the limitations of relying solely on LLMs for
secure code generation and motivates the need for our proposed framework.
Consider the prompt:
Create a Python Flask app that allows users to register, log in, and view
This seemingly innocuous prompt, devoid of any explicit security directives, led
ChatGPT (as of the current model version) to generate the following Python code:
25
from flask import Flask, render_template, request, redirect, url_for,
session, g
import sqlite3
app = Flask(__name__)
app.secret_key = ’your_secret_key’
DATABASE = ’database.db’
def get_db():
if db is None:
db = g._database = sqlite3.connect(DATABASE)
return db
@app.teardown_appcontext
def close_connection(exception):
if db is not None:
db.close()
def init_db():
with app.app_context():
db = get_db()
cursor = db.cursor()
26
)’’’)
db.commit()
def register():
if request.method == ’POST’:
username = request.form[’username’]
password = request.form[’password’]
db = get_db()
cursor = db.cursor()
try:
db.commit()
return redirect(url_for(’login’))
except sqlite3.IntegrityError:
return render_template(’register.html’)
def login():
if request.method == ’POST’:
username = request.form[’username’]
password = request.form[’password’]
db = get_db()
27
cursor = db.cursor()
(username,))
user = cursor.fetchone()
session[’user_id’] = user[0]
return redirect(url_for(’profile’))
else:
return render_template(’login.html’)
@app.route(’/profile’)
def profile():
return redirect(url_for(’login’))
db = get_db()
cursor = db.cursor()
(session[’user_id’],))
user = cursor.fetchone()
@app.route(’/logout’)
def logout():
session.pop(’user_id’, None)
28
return redirect(url_for(’login’))
if __name__ == ’__main__’:
init_db()
app.run(debug=True)
• Cross-Site Scripting (XSS): The code doesn’t explicitly sanitize user input
(username, password) before displaying it.
This experiment highlights the inherent security risks lurking within AI-generated
code when security is not explicitly prioritized. As the adoption of AI-powered code
29
generation proliferates, the urgency to address this security gap intensifies. The
generated code lacks input sanitization, leaving it susceptible to Cross-Site Scripting
(XSS) attacks, where malicious users could inject and execute arbitrary JavaScript
code within the browsers of unsuspecting users.
Attempting to leverage the LLM’s knowledge for remediation, we can provide the
generated Flask app code back to ChatGPT with the following prompt:
login system using Flask and SQLite. Please review the code and
session, g
import sqlite3
app = Flask(__name__)
app.secret_key = ’your_secret_key’
DATABASE = ’database.db’
def get_db():
if db is None:
db = g._database = sqlite3.connect(DATABASE)
return db
@app.teardown_appcontext
def close_connection(exception):
if db is not None:
30
db.close()
def init_db():
with app.app_context():
db = get_db()
cursor = db.cursor()
)’’’)
db.commit()
def register():
if request.method == ’POST’:
username = request.form[’username’]
password = request.form[’password’]
db = get_db()
cursor = db.cursor()
try:
db.commit()
return redirect(url_for(’login’))
except sqlite3.IntegrityError:
31
return render_template(’register.html’)
def login():
if request.method == ’POST’:
username = request.form[’username’]
password = request.form[’password’]
db = get_db()
cursor = db.cursor()
(username,))
user = cursor.fetchone()
session[’user_id’] = user[0]
return redirect(url_for(’profile’))
else:
return render_template(’login.html’)
@app.route(’/profile’)
def profile():
return redirect(url_for(’login’))
db = get_db()
32
cursor = db.cursor()
(session[’user_id’],))
user = cursor.fetchone()
@app.route(’/logout’)
def logout():
session.pop(’user_id’, None)
return redirect(url_for(’login’))
if __name__ == ’__main__’:
init_db()
app.run(debug=True)
session, g
import sqlite3
import os
app = Flask(__name__)
DATABASE = ’database.db’
33
limiter = Limiter(
get_remote_address,
app=app,
def get_db():
if db is None:
db = g._database = sqlite3.connect(DATABASE)
return db
@app.teardown_appcontext
def close_connection(exception):
if db is not None:
db.close()
def init_db():
with app.app_context():
db = get_db()
cursor = db.cursor()
)’’’)
db.commit()
34
@app.route(’/register’, methods=[’GET’, ’POST’])
def register():
if request.method == ’POST’:
username = request.form[’username’]
password = request.form[’password’]
# Input validation
hashed_password = generate_password_hash(password,
method=’pbkdf2:sha256’, salt_length=16)
db = get_db()
cursor = db.cursor()
try:
db.commit()
return redirect(url_for(’login’))
except sqlite3.IntegrityError:
return render_template(’register.html’)
35
@app.route(’/login’, methods=[’GET’, ’POST’])
def login():
if request.method == ’POST’:
username = request.form[’username’]
password = request.form[’password’]
db = get_db()
cursor = db.cursor()
(username,))
user = cursor.fetchone()
session[’user_id’] = user[0]
return redirect(url_for(’profile’))
else:
return render_template(’login.html’)
@app.route(’/profile’)
def profile():
return redirect(url_for(’login’))
db = get_db()
cursor = db.cursor()
36
cursor.execute("SELECT * FROM users WHERE id = ?",
(session[’user_id’],))
user = cursor.fetchone()
@app.route(’/logout’)
def logout():
session.pop(’user_id’, None)
return redirect(url_for(’login’))
if __name__ == ’__main__’:
init_db()
app.run(debug=True, ssl_context=’adhoc’)
While ChatGPT responds with some generic security recommendations and im-
plements minor changes, it fails to address the core XSS vulnerability. Following are
the major takeaways from the generated response:
• Superficial Fixes:
– Using ssl context=’adhoc’ for HTTPS is only suitable for development and
is not recommended for production.
37
• Missed the Mark (XSS): The most glaring omission is that ChatGPT com-
pletely failed to address the Cross-Site Scripting (XSS) vulnerability. It added
input validation for length but didn’t implement any output encoding/sanitiza-
tion, leaving the application still vulnerable.
This example underscores a critical limitation of relying solely on LLMs for security-
critical tasks. Without specific guidance on the nature and location of vulnerabilities
within the code, LLMs tend to apply superficial fixes or suggest overly broad secu-
rity measures, often failing to effectively mitigate the underlying risks. Our proposed
framework addresses this limitation by combining the strengths of a vulnerability de-
tection model (based on GATs) with the generative capabilities of LLMs, providing
targeted guidance to ensure the generation of secure and effective patches.
Our methodology unfolds in two distinct phases, as depicted in Figures 4.1 and
4.2. The training phase focuses on developing a robust vulnerability detection model
using GATs, while the inference phase leverages this trained model for real-time
vulnerability identification and guides an LLM for code patching.
Figure 4.1: Training Pipeline: An illustration of the process of training our GAT
model for vulnerability detection.
38
Figure 4.2: Inference Pipeline: Details of the steps involved in using our trained
GAT model for real-time vulnerability identification and leveraging an LLM for code
patching.
39
and security ensures that our dataset captures a realistic range of coding practices
and potential vulnerabilities.
Systematic Vulnerability Labeling. Accurate vulnerability labeling is essen-
tial for training a model that can effectively distinguish between secure and vulnerable
code. Our labeling process combines three approaches:
• Manual Code Review: Security experts manually inspect a subset of the col-
lected code snippets to identify and label vulnerabilities. This time-consuming
but highly accurate approach provides a gold standard for evaluating the per-
formance of our automated labeling methods.
40
data point consists of:
• Code Snippet: The raw Python code (e.g., a function or a block of code)
being analyzed.
Code Representation: ASTs and CPGs. We explore two primary graph rep-
resentations of code to evaluate their impact on vulnerability detection performance:
• Abstract Syntax Trees (ASTs): ASTs [49] represent the grammatical struc-
ture of code, showing how different language constructs are nested within each
other. In an AST, nodes typically represent language keywords, identifiers (e.g.,
variable names), and literals, while edges represent syntactic relationships be-
tween them.
41
Training Methodology. We utilize a multi-layer GAT architecture for vul-
nerability detection, as illustrated in Figure 4.1. The chosen graph representation
(AST or CPG) of each code snippet is provided as input to the GAT. Each node
in the graph is initialized with a feature vector encoding relevant information about
the corresponding code element, such as its type, data type (for variables), and any
associated literals.
The GAT’s attention mechanism [51] allows it to learn which nodes and edges in
the graph are most relevant for identifying vulnerabilities. During message passing,
nodes selectively attend to their neighbors, assigning higher weights to connections
that carry more information about the potential vulnerability.
We train the GAT to minimize the binary cross-entropy loss between its predicted
vulnerability probabilities and the true labels from our dataset. This encourages the
model to accurately distinguish between vulnerable and non-vulnerable code snippets.
We use the Adam optimizer [52] to update the model’s parameters during training,
an effective optimization algorithm commonly used in deep learning due to its ability
to handle sparse gradients and converge efficiently.
The second phase of our methodology focuses on using the trained GAT model
for real-time vulnerability detection and leveraging an LLM for generating potential
code fixes. This phase involves accurately localizing the vulnerability within the code,
extracting relevant contextual information, and crafting carefully engineered prompts
to guide the LLM in generating appropriate patches.
Vulnerability Localization. The attention mechanism within GATs provides
valuable insights into the model’s decision-making process, allowing us to pinpoint
the vulnerable code section. After obtaining a vulnerability prediction from the GAT,
we analyze the attention weights assigned to each node in the input graph during
42
the model’s forward pass. Nodes with higher attention weights are considered more
influential in the model’s decision, suggesting a higher likelihood of their involvement
in the vulnerability. We identify a contiguous span of highly attentive nodes as the
most likely location of the vulnerability. This span represents the section of code that
the model focused on most when making its prediction.
Contextual Information Extraction. To provide the LLM with a compre-
hensive understanding of the identified vulnerability and the surrounding code, we
extract relevant contextual information:
• Variable Information: We extract details about the variables used within the
vulnerable span, including data types, data sources (e.g., user input, database
queries, function calls), and data usage within the vulnerable span (e.g., in
calculations, string concatenation, conditional statements).
• Function Call Analysis: We analyze the functions called within the vulner-
able span, extracting information such as function names, arguments passed to
the functions, and the data types and potential values returned by the functions.
• Control Flow Paths: If using CPGs, we leverage the control flow information
to reconstruct the possible execution paths leading to the vulnerable code. This
analysis can reveal potential entry points for malicious input or unexpected
program states that might trigger the vulnerability.
43
crafted prompts to elicit effective and contextually relevant patches. The structure
and content of the prompts are crucial for guiding the LLM’s code generation process.
Our prompts typically include:
• Original Code Snippet: The original code snippet provided to the system,
including the vulnerable portion.
We utilize a powerful pre-trained LLM, such as Codex [39] or GPT-3 [40], for code
generation. The LLM, drawing upon its extensive knowledge of coding practices,
security best practices, and the context provided in the prompt, generates candidate
code patches to address the identified vulnerability. These generated patches are
then subject to automated testing, static analysis, and potentially manual review
to evaluate their quality and security, ensuring that they effectively remediate the
vulnerability without introducing new issues.
44
4.5 Summary
45
CHAPTER V
Implementation
This chapter details the implementation of our proposed framework for automated
vulnerability detection and patching. We present a detailed account of the techniques,
tools, and resources used to construct our dataset, train our GAT model, and inte-
grate an LLM for generating code fixes guided by the insights from our vulnerability
analysis.
46
”web development,” ”flask,” ”django,” and ”security.” This strategic selection
ensured the inclusion of code likely to contain common web application vulner-
abilities.
This combined approach yielded an initial dataset of 4000 real-world Python files,
which served as the foundation for vulnerability labeling, snippet extraction, and
subsequent data augmentation.
Vulnerability Labeling. Accurate and comprehensive vulnerability labeling is
paramount for training a model that can effectively distinguish between secure and
insecure code. We adopted a multi-pronged strategy to achieve this:
47
• Static Analysis Tool Assistance: We leveraged the Bandit static analy-
sis tool [15], specifically designed for Python code, to assist in vulnerability
identification. Bandit’s findings were manually verified to ensure accuracy and
relevance, minimizing the inclusion of false positives in our dataset.
Code Snippet Extraction. To facilitate efficient processing and focus our mod-
els on relevant code segments, we extracted self-contained code snippets from both
the labeled GitHub-sourced files and the synthetically generated vulnerable exam-
ples. Each snippet, typically representing a function, a class, or a cohesive block of
related code, was treated as an individual data point for subsequent analysis and
model training.
Dataset Structure and Statistics. The final dataset consists of a diverse
collection of 16,000 Python code snippets, meticulously labeled and categorized. Each
data point includes the following information:
To address the potential for model bias towards vulnerable data, we included
a substantial number of benign (non-vulnerable) code snippets. The final dataset
composition is detailed in Table 5.1.
48
Table 5.1: Dataset Composition
import os
def handle_file_upload(filename):
base_dir = "/var/www/uploads/"
pass
49
5.2.1 Abstract Syntax Trees (ASTs)
• Control Flow Analysis: Determines the possible execution paths within the
code, capturing how control is transferred between different statements and
functions.
• Data Flow Analysis: Tracks the flow of data through the program, identifying
50
how variables are defined, used, and modified, revealing potential paths for data
manipulation and vulnerabilities.
A comparison of Figures 5.1 and 5.2 highlights the key distinctions between ASTs
and CPGs and underscores the advantages of CPGs for vulnerability detection.
The AST, in Figure 5.1, primarily focuses on the syntactic structure of the code.
It accurately represents the function definition, variable assignments, conditional
statement, and function calls. However, it lacks the semantic context to recognize
the potential flow of user input (filename) into the sensitive file operation (open),
which constitutes the path traversal vulnerability. The AST, by itself, cannot discern
that the filename variable, potentially controlled by a malicious user, influences the
filepath variable used in the open function, leaving the vulnerability undetected.
The CPG, depicted in Figure 5.2, offers a richer and more revealing representation.
In addition to the syntactic structure captured by the AST, it includes data flow edges
that explicitly show the movement of data through the code. These edges visually
demonstrate how the filename variable, passed as input to the handle file upload
function, flows through the os.path.join function and ultimately influences the
51
filepath variable used in the open function. This visual representation clearly
highlights how a malicious filename could potentially be used to access files outside
the intended directory.
This additional information embedded within the CPG is crucial for vulnerability
detection. It provides the necessary context for understanding how user input might
be manipulated to exploit vulnerabilities, enabling machine learning models to learn
more nuanced patterns and dependencies within the code. The richer semantic in-
formation captured by CPGs makes them a more suitable representation for training
effective vulnerability detection models, enabling them to identify a wider range of
vulnerabilities, including those that rely on understanding data flow and control flow
relationships.
52
Figure 5.2: CPG Representation for the Same Code Snippet
In our framework, we initially experimented with both AST and CPG represen-
tations. However, our empirical evaluation confirmed that the GAT model trained
on CPGs consistently outperformed the model trained on ASTs, showcasing the sig-
nificance of incorporating semantic information for effective vulnerability detection.
Therefore, we selected the CPG representation for all subsequent experiments and
evaluations.
To enable our machine learning models to effectively learn from the AST and CPG
representations, we encoded relevant information about each node in the graph as a
feature vector. The specific features used for both representations are detailed below.
53
• Node Type: The grammatical category of the node (e.g., FunctionDef,
Assign, Call, Name, Constant). This feature captures the syntactic role of
the node within the code.
• Data Type: For variables, the inferred data type (e.g., str, int, float).
This feature provides information about the kind of data stored in the variable.
• Literal Value: For constant nodes, the actual literal value (e.g., "user input",
10, 3.14). This feature captures the value associated with the constant.
For each node in the CPG, we extracted a richer set of features, leveraging the
additional semantic information available in the CPG:
• Node Type: The type of the node (e.g., ”Identifier,” ”Call,” ”Literal”). This
feature categorizes the node based on its role in the code.
• Code: The actual code associated with the node (e.g., ”os.path.join,” ”open”).
This feature provides a more specific representation of the code element.
• Data Type: The data type associated with the node (e.g., ”String,” ”Integer”).
This feature provides information about the kind of data the node represents.
• User Input Flag: A boolean flag indicating whether the node is part of a
function handling user input. This feature helps the model identify potential
entry points for malicious data.
54
5.4 Model Architecture and Training
The training process involved splitting the dataset into training (70%), validation
(15%), and testing sets (15%) using stratified sampling to ensure an even distribution
of vulnerability types across the splits. This stratification ensures that the model
is exposed to a representative sample of each vulnerability type during training and
evaluation.
We utilized the Adam optimizer [52] for training our GAT model. Adam is a
popular optimization algorithm known for its effectiveness in training deep learning
55
models. We used a learning rate of 1e−4 and a weight decay of 1e−3 to regularize the
model parameters and prevent overfitting.
The model was trained for a maximum of 50 epochs with a batch size of 16. We
incorporated early stopping based on the validation loss to prevent overfitting. If
the validation loss did not improve for a predefined number of epochs (patience), the
training process was stopped to prevent the model from memorizing the training data
and losing its ability to generalize to unseen examples.
We observed that our initial models exhibited a bias towards predicting vulnera-
bilities due to the class imbalance in the dataset, with a higher number of vulnerable
samples compared to benign ones. To address this, we implemented a weighted cross-
entropy loss function. Weighted cross entropy assigns higher weights to the minority
classes, ensuring that the model is penalized more for misclassifying vulnerable code
snippets. This weighting strategy helps the model learn to focus on detecting vulner-
abilities more effectively, even when they are less frequent in the training data.
We utilize the attention weights learned by our GAT model during training to
guide the process of vulnerability localization. The attention mechanism within GATs
assigns weights to different nodes and edges in the graph, reflecting their importance
in the model’s prediction. After obtaining a vulnerability prediction from the GAT
56
model, we extract the attention weights assigned to each node in the input AST or
CPG during the model’s forward pass. Nodes with higher attention weights indicate
greater influence on the model’s prediction, suggesting a higher likelihood of their
involvement in the vulnerability.
Rather than relying solely on the top-ranked node, we identify a contiguous span of
highly attentive nodes as the most likely location of the vulnerability. This reflects the
understanding that vulnerabilities often involve interactions between multiple code
elements rather than a single isolated node. We empirically determined a threshold,
selecting the top 5% of nodes with the highest attention weights to form the vulnerable
span. This span represents the section of code that the model focused on most when
making its vulnerability prediction.
– Variable names
57
– Data flow relationships, if available in the CPG, showing how the variable’s
value propagates through the code
– Conditions that might lead to the execution of the vulnerable code, pro-
viding insights into the circumstances under which the vulnerability might
manifest.
58
5.6 AI-Powered Code Patching with LLMs
We leverage the extracted contextual information, along with the GAT model’s
vulnerability prediction and localization, to guide a Large Language Model (LLM)
in generating potential code fixes. We employed Google Gemini Pro [62], a powerful
LLM accessible through the Google AI Platform, for our code generation tasks.
The design and structure of the prompts provided to the LLM are crucial for
eliciting effective and contextually relevant code patches. Our prompts are carefully
structured to provide the LLM with a comprehensive understanding of the vulner-
ability and the surrounding code, enabling it to generate targeted and appropriate
fixes. The general structure of our LLM prompts is as follows:
Type]** vulnerability.
‘‘‘python
Please provide a corrected version of the entire code that addresses this
Important:
59
- Focus on fixing the specific vulnerability identified.
explain why.
We replace the placeholders in the prompt template with the following informa-
tion:
• [Original Code Snippet]: The entire code snippet submitted for analysis.
We provide the generated prompts to the Google Gemini Pro model, which then
produces candidate code patches. The LLM, drawing upon its vast knowledge of
coding practices and security best practices acquired during training, attempts to
generate patches that address the identified vulnerability while preserving the func-
tionality of the original code.
The generated patches are then evaluated for both correctness and security using
a multi-faceted approach:
60
• Static Analysis: We re-run the Bandit static analysis tool [15] on the patched
code to check for the following:
– The presence of the original vulnerability: This verifies whether the gen-
erated patch effectively addresses the identified security issue.
– The introduction of new vulnerabilities: This ensures that the patch itself
does not inadvertently introduce new security risks into the code.
5.7 Summary
This chapter detailed the implementation of our framework for automated vul-
nerability detection and patching. We described our methodology for constructing a
comprehensive dataset, training a GAT model to identify vulnerabilities, localizing
vulnerabilities within the code, extracting contextual information, and leveraging an
LLM to generate potential code fixes. The following chapter will evaluate the per-
formance of our framework, demonstrating its effectiveness in accurately identifying
and addressing various vulnerability types.
61
CHAPTER VI
62
6.1.1 Evaluation Metrics
• Recall: Measures the model’s ability to identify all actual vulnerabilities, rep-
resenting the proportion of correctly identified vulnerable snippets out of all
actual vulnerable snippets in the dataset. A high recall signifies a low rate of
false negatives.
Tables 6.1 and 6.2 summarize the performance of our GAT model on the valida-
tion set, trained separately on AST and CPG representations. As hypothesized, the
GAT model trained on the CPG representation consistently outperformed the model
trained on the AST representation across all evaluation metrics.
63
Metric AST Representation
Accuracy 0.74
Precision 0.74
Recall 0.76
F1-Score 0.77
Accuracy 0.86
Precision 0.81
Recall 0.85
F1-Score 0.86
64
Figure 6.1: Confusion Matrix for GAT Model Trained on AST Representation
Figure 6.2: Confusion Matrix for GAT Model Trained on CPG Representation
Figures 6.3 and 6.4 illustrate the training loss and accuracy, respectively, over
epochs for the AST-based model. These graphs provide insights into the training
65
process and highlight the convergence behavior of the model. Similarly, Figures 6.5
and 6.6 depict the training loss and accuracy over epochs for the CPG-based model.
Figure 6.3: Training Loss for GAT Model Trained on AST Representation
Figure 6.4: Training Accuracy for GAT Model Trained on AST Representation
66
Figure 6.5: Training Loss for GAT Model Trained on CPG Representation
Figure 6.6: Training Accuracy for GAT Model Trained on CPG Representation
Based on the superior performance of the GAT model trained on the CPG rep-
resentation, we selected the CPG representation for all subsequent experiments and
evaluations. The richer semantic information embedded within CPGs proved crucial
for achieving higher accuracy and recall in vulnerability detection.
67
the vulnerable code sections within the snippets correctly classified as vulnerable by
our GAT model. Accurate localization is crucial for guiding the LLM in generating
targeted and effective code fixes.
We manually examined a subset of 200 code snippets randomly selected from the
test set where the GAT model, trained on the CPG representation, correctly predicted
the presence of a vulnerability. For each snippet, we compared the vulnerable span
identified by our attention-based method to the ground truth vulnerable lines of code,
which were determined during the manual labeling process. This involved visually
inspecting the highlighted code sections and comparing them to the actual lines of
code known to contain the vulnerability.
68
6.3 LLM-Generated Patch Evaluation: A Multi-Perspective
Assessment
We evaluated the quality of the patches generated by Google Gemini Pro using
three distinct evaluation methods: human evaluation, static analysis with Bandit,
and re-evaluation using our trained GAT model. This multi-perspective assessment
provides a comprehensive understanding of the effectiveness and reliability of the
generated patches.
We selected 100 code snippets from the test set where the GAT model (trained
on CPGs) correctly predicted a vulnerability, and our attention-based localization
accurately identified the vulnerable span. For each of these snippets, we generated
prompts following the structure described in the previous chapter. These prompts
were provided to the Gemini Pro LLM to generate candidate patches.
Each generated patch was then evaluated using the following methods:
• Code Quality: Whether the patch adhered to good coding practices and
did not introduce any new issues, such as syntax errors or logical flaws.
2. Static Analysis with Bandit: We re-ran Bandit [15] on the patched code
snippets to automatically check for the presence of the original vulnerability
69
and to identify any new vulnerabilities that might have been introduced by the
patch.
Table 6.3 summarizes the results of our LLM-generated patch evaluation. The
human evaluation revealed that 78% of the patches were deemed correct, effectively
addressing the identified vulnerabilities without introducing new issues or altering
the original functionality. Bandit analysis confirmed these findings, with 75% of
the patched snippets no longer triggering the original vulnerability warnings. Re-
evaluation with our trained GAT model further supported these results, with 79% of
the patched snippets now classified as non-vulnerable.
70
However, it is important to acknowledge that not all generated patches were suc-
cessful. The remaining patches (around 20-25%) either failed to fully address the
vulnerability, introduced new issues, or altered the original functionality of the code.
This highlights the inherent limitations of current LLMs in fully understanding the
nuances of security vulnerabilities and the complexities of code repair. Further re-
search is needed to improve the accuracy and reliability of LLM-generated patches,
potentially by incorporating more sophisticated reasoning capabilities or by providing
even richer contextual information to the LLM.
Patching
Consider the following Python code snippet, which contains a Cross-Site Scripting
(XSS) vulnerability:
def display_user_profile(username):
profile_html = f"""
<h1>Welcome, {username}!</h1>
"""
return profile_html
71
vulnerability arises because a malicious user could provide a username containing
JavaScript code, which would then be executed in the browser of other users viewing
the generated HTML.
Our trained GAT model, using the CPG representation of this code, correctly
predicts the presence of an XSS vulnerability. The attention-based localization mech-
anism highlights the following line as the most likely location of the vulnerability:
def display_user_profile(username):
profile_html = f"""
<span style="color:red">"---"</span>
<h1>Welcome, {username}!</h1>
"""
return profile_html
This accurate localization is crucial because it directs the LLM’s attention to the
specific line of code that needs to be modified.
Based on the GAT model’s prediction and localization, the following prompt is
generated for the Google Gemini Pro LLM:
72
‘‘‘python
def display_user_profile(username):
profile_html = f"""
<span style="color:red">"---"</span>
<h1>Welcome, {username}!</h1>
"""
return profile_html
‘‘‘
Please provide a corrected version of the entire code that addresses this
‘‘‘python
def display_user_profile(username):
profile_html = f"""
<h1>Welcome, {username}!</h1>
"""
return profile_html
**Important:**
explain why.
The prompt clearly instructs the LLM to fix the XSS vulnerability, provides the
original code, highlights the vulnerable line, and emphasizes the need to maintain the
original functionality.
73
6.4.4 LLM-Generated Patch
The Google Gemini Pro model, in response to the prompt, generated a compre-
hensive analysis of the vulnerability and a suggested patch:
LLM Response
import html
def display_user_profile(username):
profile_html = f"""
<h1>Welcome, {html.escape(username)}!</h1>
"""
return profile_html
74
While this code addresses the specific XSS vulnerability in the provided exam-
ple, it’s essential to consider other potential XSS attack vectors and implement
broader security measures:
- **Input Validation:** Always validate user input to prevent unexpected
or malicious data from being processed. - **Output Encoding:** Consistently
encode output to prevent XSS, SQL injection, and other injection attacks. -
**Content Security Policy (CSP):** Implement a CSP to further restrict the
execution of scripts and other resources. - **Regular Security Audits:** Conduct
regular security assessments to identify and address vulnerabilities.
By following these guidelines, you can significantly enhance the security of
your web application.
**Note:** This solution assumes that the ‘username‘ is directly obtained from
user input. If the ‘username‘ is retrieved from a trusted source (e.g., a database),
the risk of XSS might be mitigated, but it’s still recommended to escape the
output for best practices.
**Would you like to explore additional security measures or discuss a specific
use case?**
The LLM not only provided a corrected code snippet that uses the html.escape
function to sanitize the user input but also included a detailed explanation of the
vulnerability, the fix, and additional security considerations. This demonstrates a
deeper level of understanding and response from the LLM, making the patch more
informative and actionable for developers.
6.5 Discussion
75
• CPG Representation Superiority: The GAT model trained on CPGs con-
sistently outperformed the model trained on ASTs, highlighting the importance
of leveraging richer semantic information for vulnerability detection.
These results suggest that our hybrid approach, integrating deep learning-based
vulnerability detection with LLM-powered code generation, offers a promising av-
enue for enhancing code security and improving the reliability of software systems.
However, it is crucial to acknowledge the limitations of current LLMs in fully under-
standing the intricacies of code vulnerabilities and repair. Also, a larger and more
detailed dataset can further enhance the GAT model as well, improving the overall
effectiveness of the proposed system. Further research is needed to enhance the accu-
racy and reliability of LLM-generated patches and to address the remaining challenges
in automating code security.
76
CHAPTER VII
77
strated that using Code Property Graphs (CPGs) as the code representation signifi-
cantly enhances vulnerability detection accuracy compared to using Abstract Syntax
Trees (ASTs). Furthermore, our attention-based localization technique, leveraging
the attention weights learned by our GAT model, exhibited remarkable accuracy in
pinpointing the vulnerable code sections within snippets correctly classified as vulner-
able. Our evaluation of LLM-generated patches, guided by our framework, revealed
another promising finding. We observed a high success rate, ranging from 75% to
80%, in generating correct and effective fixes for the identified vulnerabilities. This
result underscores the significant potential of LLMs for automated code repair when
provided with accurate contextual information about the vulnerability and its precise
location within the code.
Our work makes several significant contributions to the field of automated vul-
nerability detection and rectification. We provide compelling empirical evidence of
the effectiveness of CPGs as a code representation for vulnerability detection. We
introduce and validate an attention-based localization technique that effectively pin-
points vulnerable code sections. Most importantly, we demonstrate the viability of a
hybrid approach that combines the strengths of graph-based deep learning and LLMs,
offering a promising direction for automating code security.
While our framework has shown promising results, there are limitations and areas
for future research that can further advance the field of automated code security.
One limitation lies in the size and diversity of our dataset. Although compre-
hensive, further expansion to incorporate code from diverse programming languages,
software domains, and vulnerability types would significantly enhance the generaliz-
ability and robustness of our model. A larger and more diverse dataset would enable
the model to learn from a wider range of coding practices, vulnerability patterns, and
78
semantic contexts, making it more adaptable to real-world codebases.
Another area for improvement is the scalability of CPG construction. Building
CPGs for large codebases can be computationally expensive, potentially limiting the
applicability of our approach to massive software projects. Investigating more effi-
cient methods for CPG construction or exploring techniques for selectively generating
CPGs for specific code sections, particularly those flagged as potentially vulnerable,
could enhance the scalability of our framework.
There is also room for improvement in the quality and reliability of LLM-generated
patches. While our framework provides targeted guidance to LLMs, achieving con-
sistently accurate and secure code fixes remains a challenge. Incorporating more
sophisticated reasoning capabilities into LLMs, such as symbolic execution or formal
verification, could lead to more robust and trustworthy code repairs. Furthermore,
exploring alternative methods for encoding contextual information or experimenting
with different prompt engineering techniques might also enhance the LLM’s under-
standing of the vulnerability and its impact on the code.
The current scope of our framework is limited to addressing individual vulnerabil-
ities. Future research could explore extending it to handle more complex scenarios,
such as vulnerabilities that involve the interaction of multiple code components or
those requiring a sequence of patches to be fully addressed. Finally, while we con-
ducted human evaluation of the generated patches, a more extensive and systematic
human-in-the-loop evaluation would be highly beneficial. Additionally, integrating
our framework into real-world development workflows and conducting user studies to
assess its usability and effectiveness in practice would provide valuable insights into
its real-world impact and guide further improvements.
By addressing the identified limitations and pursuing the proposed directions for
future work, we can strive towards more comprehensive and reliable automated code
security solutions, contributing to the development of safer and more resilient software
79
systems that are essential for the functioning of our increasingly digital world.
80
Bibliography
[1] J. Viega and G. McGraw, Building Secure Software: How to Avoid Security
Problems the Right Way. Addison-Wesley, 2003.
[4] M. Sutton, A. Greene, and P. Amini, Fuzzing: Brute Force Vulnerability Discov-
ery. Indianapolis, IN: Addison-Wesley Professional, 2007.
[7] J. Li, Y. Zhou, S. Xu, H. Liu, Y. Fang, and X. Guan, “Vuldeepecker: A deep
learning-based system for vulnerability detection,” Proceedings of the ACM on
Programming Languages, vol. 4, no. OOPSLA, pp. 1–27, 2020.
[9] Z. Feng, D. Guo, D. Tang, N. Duan, X. Feng, M. Gong, L. Shou, B. Qin, T. Liu,
D. Jiang, et al., “Codebert: A pre-trained model for programming language
representation learning,” arXiv preprint arXiv:2002.08155, 2020.
81
[10] Y. Zhou, S. Liu, X. Du, X. Xie, and H. Peng, “Devign: Effective vulnerability
identification by learning node representations for cpgs,” in Proceedings of the
2019 ACM SIGSAC Conference on Computer and Communications Security,
pp. 1413–1428, 2019.
[11] Y. Li, S. Zheng, Y. Pei, and S. He, “Graph representation learning for code
analysis tasks: A survey,” ACM Computing Surveys (CSUR), vol. 56, no. 4,
pp. 1–56, 2023.
[13] D. Yan, A. Rountev, and S. Malik, “Code property graphs: Towards a unified
system for program analysis,” in Proceedings of the 38th International Conference
on Software Engineering, pp. 931–942, ACM, 2016.
[22] J. Zhou, G. Cui, Z. Zhang, C. Y. Yang, Z. Liu, L. Wang, C. Li, and M. Sun,
“Graph neural networks: A review of methods and applications,” AI Open, vol. 1,
pp. 57–81, 2020.
82
[23] C. Sadowski, K. T. Stolee, and S. Elbaum, “Lessons from building static analysis
tools at google,” in Proceedings of the 40th International Conference on Software
Engineering: Software Engineering in Practice, pp. 1–10, 2018.
[24] A. Shostack, Threat modeling: designing for security. John Wiley & Sons, 2014.
[25] B. Chess and J. West, Secure Programming with Static Analysis. Addison-Wesley
Professional, 2007.
[28] Y. Chen, Z. Cui, X. Wang, J. Xue, and X. Cai, “A survey of fuzzing techniques,”
ACM Computing Surveys (CSUR), vol. 51, no. 4, pp. 1–35, 2018.
[32] Q. Feng, Y. Zhou, C. Xu, Y. Yu, X. Lu, and B. Xu, “Automated vulnerability
detection for web applications based on deep learning,” in Proceedings of the
27th International Conference on World Wide Web, pp. 845–854, 2018.
[34] W. Samek, T. Wiegand, and K.-R. Müller, “Explainable ai: interpreting, ex-
plaining and visualizing deep learning,” arXiv preprint arXiv:1708.08296, 2017.
[35] L. Tang, Z. Xu, Z. Yan, Y. Zhou, and Y. Liu, “Cpgsec: A cpg-based approach
for identifying security-related code smells,” in Proceedings of the 29th ACM
SIGSOFT International Symposium on Software Testing and Analysis, pp. 301–
312, 2020.
[36] Y. Sun, Z. Liu, W. Li, and Y. Sun, “Cpg-vul: Cpg-based vulnerability detection
for smart contracts,” in Proceedings of the 2022 International Conference on
Software Engineering (ICSE), pp. 1541–1552, 2022.
83
[37] T. Nguyen, A. Nguyen, T. Tran, H. A. Rajan, and T. N. Nguyen, “Graph neu-
ral networks for software vulnerability detection: A survey,” ACM Computing
Surveys (CSUR), vol. 55, no. 8, pp. 1–35, 2022.
[42] J. Pearl and D. Mackenzie, “The seven deadly sins of ai predictions,” ACM
Computing Surveys (CSUR), vol. 55, no. 4, pp. 1–36, 2022.
[45] E. Dinella, C. Henning, J. Si, and Z. Su, “Learning to rank code-change sugges-
tions for automated program repair,” arXiv preprint arXiv:2004.05827, 2020.
[46] T. Lutellier, M. Tan, Y. Qi, S.-W. Zhou, and D. Poshyvanyk, “Coconut: Combin-
ing context-aware neural translation models using ensemble for program repair,”
in Proceedings of the 28th ACM Joint Meeting on European Software Engineer-
ing Conference and Symposium on the Foundations of Software Engineering,
pp. 1021–1033, 2020.
84
[48] r2c, “Semgrep: A fast, open-source, static analysis tool for finding and preventing
security vulnerabilities.” https://semgrep.dev/.
[50] F. E. Allen, “Control flow analysis,” SIGPLAN Notices, vol. 5, no. 7, pp. 1–19,
1970.
[52] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” arXiv
preprint arXiv:1412.6980, 2014.
[54] W. G. Halfond, J. Viegas, and A. Orso, “Sql injection attacks and defense tech-
niques,” in Proceedings of the 2006 international workshop on dynamic analysis,
pp. 1–7, 2006.
[55] M. Zalewski, Cross site scripting attacks: Cross-site scripting vulnerabilities and
techniques to prevent them. No Starch Press, 2002.
[60] M. Fey and J. E. Lenssen, “Fast graph representation learning with pytorch
geometric,” arXiv preprint arXiv:1903.02428, 2019.
85