Usenixsecurity24 Deng
Usenixsecurity24 Deng
Gelei Deng1§ , Yi Liu1§ , Víctor Mayoral-Vilches23 , Peng Liu4 , Yuekang Li5∗, Yuan Xu1 ,
Tianwei Zhang1 , Yang Liu1 , Martin Pinzger3 , Stefan Rass6
1 Nanyang Technological University, 2 Alias Robotics, 3 Alpen-Adria-Universität Klagenfurt,
4 Institute for Infocomm Research (I 2 R), A*STAR, Singapore, 5 University of New South Wales, 6 Johannes
Kepler University Linz
Abstract red teaming are now essential in the security lifecycle. As ex-
plained by Applebaum [1], these approaches involve security
Penetration testing, a crucial industrial practice for ensur- teams attempting breaches to reveal vulnerabilities, providing
ing system security, has traditionally resisted automation due advantages over traditional defenses, which rely on incom-
to the extensive expertise required by human professionals. plete system knowledge and modeling. This study, guided by
Large Language Models (LLMs) have shown significant ad- the principle “the best defense is a good offense”, focuses on
vancements in various domains, and their emergent abilities offensive strategies, specifically penetration testing.
suggest their potential to revolutionize industries. In this work,
Penetration testing is a proactive offensive technique for
we establish a comprehensive benchmark using real-world
identifying, assessing, and mitigating security vulnerabili-
penetration testing targets and further use it to explore the
ties [2]. It involves targeted attacks to confirm flaws, yielding
capabilities of LLMs in this domain. Our findings reveal that
a comprehensive inventory of vulnerabilities with actionable
while LLMs demonstrate proficiency in specific sub-tasks
recommendations. This widely-used practice empowers orga-
within the penetration testing process, such as using testing
nizations to detect and neutralize network and system vulner-
tools, interpreting outputs, and proposing subsequent actions,
abilities before malicious exploitation. However, it typically
they also encounter difficulties maintaining a whole context
relies on manual effort and specialized knowledge [3], result-
of the overall testing scenario.
ing in a labor-intensive process, creating a gap in meeting the
Based on these insights, we introduce P ENTEST GPT, an growing demand for efficient security evaluations.
LLM-empowered automated penetration testing framework
Large Language Models (LLMs) have demonstrated pro-
that leverages the abundant domain knowledge inherent in
found capabilities, showcasing intricate comprehension of
LLMs. P ENTEST GPT is meticulously designed with three
human-like text and achieving remarkable results across a
self-interacting modules, each addressing individual sub-tasks
multitude of tasks [4, 5]. An outstanding characteristic of
of penetration testing, to mitigate the challenges related to
LLMs is their emergent abilities [6], cultivated during training,
context loss. Our evaluation shows that P ENTEST GPT not
which empower them to undertake intricate tasks such as rea-
only outperforms LLMs with a task-completion increase of
soning, summarization, and domain-specific problem-solving
228.6% compared to the GPT-3.5 model among the bench-
without task-specific fine-tuning. This versatility posits LLMs
mark targets, but also proves effective in tackling real-world
as potential game-changers in various fields, notably cyber-
penetration testing targets and CTF challenges. Having been
security. Although recent works [7–9] posit the potential of
open-sourced on GitHub, P ENTEST GPT has garnered over
LLMs to reshape cybersecurity practices, including the con-
6,500 stars in 12 months and fostered active community en-
text of penetration testing, there is an absence of a systematic,
gagement, attesting to its value and impact in both the aca-
quantitative assessment of their aptitude in this regard. Con-
demic and industrial spheres.
sequently, an imperative question presents: To what extend
can LLMs automate penetration testing?
1 Introduction Motivated by this question, we set out to explore the ca-
pability boundary of LLMs on real-world penetration test-
Securing a system presents a formidable challenge. Offensive ing tasks. Unfortunately, the current benchmarks for pen-
security methods like penetration testing (pen-testing) and etration testing [10, 11] are not comprehensive and fail to
assess progressive accomplishments fairly during the pro-
∗ Corresponding author. cess. To address this limitation, we construct a robust bench-
§ Equal Contribution mark that includes test machines from HackTheBox [12] and
Following the criteria outlined previously, we develop a com- We conduct an exploratory study to assess the capabilities
prehensive benchmark that closely reflects real-world pene- of LLMs in penetration testing, with the primary objective
tration testing tasks. The design process progresses through of determining how well LLMs can adapt to the real-world
several stages. complexities and challenges in this task. Specifically, we aim
to address the following two research questions:
Task Selection. We begin by selecting tasks from HackThe-
RQ1 (Capability): To what extent can LLMs perform pene-
Box [12] and VulnHub [13], two leading penetration testing
tration testing tasks?
training platforms. Our selection criteria are designed to en-
RQ2 (Comparative Analysis): How do the problem-solving
sure that our benchmark accurately reflects the challenges
strategies of human penetration testers and LLMs differ?
encountered in practical penetration testing environments. We
We utilize the benchmark described in Section 3 to evaluate
meticulously review the latest machines available on both
the performance of LLMs on penetration testing tasks. In the
platforms, aiming to identify and select a subset that compre-
following, we first delineate our testing strategy for this study.
hensively covers all vulnerabilities listed in the OWASP [14]
Subsequently, we present the testing results and an analytical
Top 10 Project. Additionally, we choose machines that repre-
discussion to address the above research questions.
sent a mix of difficulties, classified according to traditional
standards in the penetration testing domain into easy, medium,
and hard categories. This process guarantees that our bench- 4.1 Testing Strategy
mark spans the full spectrum of vulnerabilities and difficulties. LLMs are text-based and cannot independently perform pen-
Note that our benchmark does not include benign targets to etration testing operations. To address this, we develop a
assess false positives. In penetration testing, benign targets are human-in-the-loop testing strategy, serving as an intermediary
sometimes explored. Our main objective remains identifying method to accurately assess LLMs’ capabilities. This strategy
true vulnerabilities. features an interactive loop where a human expert executes
Task Decomposition. We further parse the testing process of the LLM’s penetration testing directives. Importantly, the hu-
each target into a series of sub-tasks, following the standard man expert functions purely as an executor, strictly following
solution commonly referred to as the “walkthrough” in pen- the LLM’s instructions without adding any expert insights or
etration testing. Each sub-task corresponds to a unique step making independent decisions.
in the overall process. We decompose sub-tasks following Figure 1 decipits the testing strategy with the following
NIST 800-115 [29], the Technical Guide to Security Testing. steps: ❶ We initiate the looped testing procedure by pre-
Each sub-task is one step declared in the Guide (e.g., network senting the target specifics to the LLM, seeking its guidance
discovery, password cracking), or an operation that exploits a on potential penetration testing steps. ❷ The human expert
unique vulnerability categorised in the Common Weakness strictly follows the LLM’s recommendations and conducts the
Enumeration (CWE) [15] (e.g., exploiting SQL injection - suggested actions in the penetration testing environment. ❸
CWE-89 [30]). In the end, we formulate an exhaustive list of Outcomes of the testing actions are collected and summarized:
sub-tasks for every benchmark target. direct text outputs such as terminal outputs or source code
Benchmark Validation. The final stage of our benchmark are documented; non-textual results, such as graphical repre-
development involves rigorous validation, which ensures the sentations, are translated by the human expert into succinct
reproducibility of these benchmark machines. To do this, three textual summaries. The data is then fed back to the LLM,
certified penetration testers independently attempt the pene- setting the stage for its subsequent recommendations. ❹ This
tration testing targets and write their walkthrough. We then iterative process persists either until a conclusive solution
adjust our task decomposition accordingly because some tar- is identified or an deadlock is reached. We then compile a
gets may have multiple valid solutions. record of the testing procedures, encompassing successful
sub-tasks, ineffective actions, and any reasons for failure, if
Ultimately, we have compiled a benchmark that effec-
applicable. For a more tangible grasp of this strategy, we offer
tively encompasses all types of vulnerabilities listed in the
illustrative examples of prompts and corresponding outputs
OWASP [14] Top 10 Project. It comprises 13 penetration
from GPT-4 related to one of our benchmark targets in the
testing targets, each at varying levels of difficulty. These tar-
Appendix Section A.
gets are broken down into 182 sub-tasks across 26 categories,
To ensure the evaluation’s fairness and accuracy, we em-
covering 18 distinct CWE items. This number of targets is
ploy several strategies. First, we involve expert-level penetra-
deemed sufficient to represent a broad spectrum of vulnera-
tion testers1 as the human testers. With their deep pentesting
bilities, difficulty levels, and varieties essential for compre-
knowledge, these testers can precisely comprehend and ex-
hensive penetration testing training. To foster community de-
ecute LLM-generated operations, thus accurately assessing
velopment, we have made this benchmark publicly available
online [22]. 1We selected Offensive Security Certified Professionals (OSCP) testers.
Table 2: Top 10 Types of Sub-tasks completed by each tool. Table 4: Top causes for failed penetration testing trials
Sub-Tasks WT GPT-3.5 GPT-4 Bard Failure Reasons GPT3.5 GPT4 Bard Total
Web Enumeration 18 4 (22.2%) 8 (44.4%) 4 (22.2%) Session context lost 25 18 31 74
Code Analysis 18 4 (22.2%) 5 (27.2%) 4 (22.2%) False Command Generation 23 12 20 55
Port Scanning 12 9 (75.0%) 9 (75.0%) 9 (75.0%) Deadlock operations 19 10 16 45
Shell Construction 11 3 (27.3%) 8 (72.7%) 4 (36.4%) False Scanning Output Interpretation 13 9 18 40
File Enumeration 11 1 (9.1%) 7 (63.6%) 1 (9.1%) False Source Code Interpretation 16 11 10 37
Configuration Enumeration 8 2 (25.0%) 4 (50.0%) 3 (37.5%) Cannot craft valid exploit 11 15 8 34
Cryptanalysis 8 2 (25.0%) 3 (37.5%) 1 (12.5%)
Network Enumeration 7 1 (14.3%) 3 (42.9%) 2 (28.6%)
Command Injection 6 1 (16.7%) 4 (66.7%) 2 (33.3%)
Known Exploits 6 2 (33.3%) 3 (50.0%) 1 (16.7%)
Table 3: Top Unnecessary Operations Prompted by LLMs on tasks. We employ the same method to formulate benchmark
the Benchmark Targets sub-tasks, as Section 3 outlines. By comparing this to a stan-
dard walkthrough, we identify the primary sub-task trials that
Unnecessary Operations GPT-3.5 GPT-4 Bard Total
fall outside the standard walkthrough and are thus irrelevant to
Brute-Force 75 92 68 235
Exploit Known Vulnerabilities (CVEs) 29 24 28 81 the penetration testing process. The results are summarized in
SQL Injection 14 21 16 51 Table 3. We find that the most prevalent unnecessary operation
Command Injection 18 7 12 37
prompted by LLMs is brute force. For all services requiring
password authentication, LLMs typically advise brute-forcing
it. This is an ineffective strategy in penetration testing. We
This often culminates in identifying potential vulnerabilities surmise that many hacking incidents in enterprises involve
from code snippets and crafting the corresponding exploits. password cracking and brute force. LLMs learn these reports
Notably, GPT-4 outperforms the other two models regard- from accident reports and are consequently considered viable
ing code interpretation and generation, marking it the most solutions. Besides brute force, LLMs suggest that testers en-
suitable candidate for penetration testing tasks. gage in CVE studies, SQL injections, and command injections.
These recommendations are common, as real-world penetra-
Finding 2: LLMs can efficiently use penetration test- tion testers often prioritize these techniques, even though they
ing tools, identify common vulnerabilities, and interpret may not always provide the exact solution.
source codes to identify vulnerabilities.
PentestGPT
Parsing Module Reasoning Module Generation Module
Testing Envrionment
(Optional) User
Testing Targets Testing Tools Operations
Verification
a) PTT Representation
The Reasoning Module plays a pivotal role in our system,
Task Tree:
analogous to a team lead overseeing the penetration testing 1. Perform port scanning (completed)
task from a macro perspective. It obtains testing results or - Port 21, 22 and 80 are open.
- Services are FTP, SSH, and Web Service.
intentions from the user and prepares the testing strategy for 2. Perform the testing
the next step. This testing strategy is passed to the generation 2.1 Test FTP Service
2.1.1 Test Anonymous Login (success)
module for further planning.
2.1.1.1 Test Anonymous Upload (success)
To effectively supervise the penetration testing process and 2.2 Test SSH Service
provide precise guidance, it is crucial to translate the test- 2.2.1 Brute-force (failed)
2.3 Test Web Service (ongoing)
ing procedures and outcomes into a natural language format. 2.3.1 Directory Enumeration
Drawing inspiration from the concept of an attack tree [45], 2.3.1.1 Find hidden admin (to-do)
which is often used to outline penetration testing procedures, 2.3.2 Injection Identification (todo)
we introduce the notion of a pentesting task tree (PTT). This b) PTT Representation in Natural Language
Definition 2 (Pentesting Task Tree) A PTT T is a pair As outlined in Figure 2, the Reasoning Module’s operation
(N, A), where: (1) N is a set of nodes organized in a tree unfolds over four key steps operating over the PTT. ❶ The
Cross
Check
Figure 4: A demonstration of the task-tree update process on the testing target HTB-Carrier
promising and chooses to investigate the web service, often 5.6 Active Feedback
seen as more vulnerable. This task is passed to the Generation
Module. The Generation Module turns this general task into While LLMs can produce insightful outputs, their outcomes
a detailed process, employing nikto [49], a commonly used sometimes require revisions. To facilitate this, we introduce
web scanning script. The iterative process continues until the an interactive handle in P ENTEST GPT, known as active feed-
tester completes the penetration testing task. back, which allows the user to interact directly with the Rea-
soning Module. A vital feature of this process is that it does
not alter the context within the Reasoning Module unless the
user explicitly desires to update some information. The rea-
5.5 Parsing Module
soning context, including the PTT, is stored as a fixed chunk
The Parsing Module operates as a supportive interface, en- of tokens. This chunk of tokens is provided to a new LLM
abling effective processing of the natural language informa- session during an active feedback interaction, and users can
tion exchanged between the user and the other two core mod- pose questions regarding them. This ensures that the original
ules. Two needs can primarily justify the existence of this session remains unaffected, and users can always query the
module. First, security testing tool outputs are typically ver- reasoning context without making unnecessary changes. If
bose, laden with extraneous details, making it computationally the user believes it necessary to update the PTT, they can
expensive and unnecessarily redundant to feed these extended explicitly instruct the model to update the reasoning context
outputs directly into the LLMs. Second, users without spe- history accordingly. This provides a robust and flexible frame-
cialized knowledge in the security domain may struggle to work for the user to participate in the decision-making process
extract key insights from security testing outputs, presenting actively.
challenges in summarizing crucial testing information. Con-
sequently, the Parsing Module is essential in streamlining and 5.7 Discussion
condensing this information.
In P ENTEST GPT, the Parsing Module is devised to handle We explore various design alternatives for P ENTEST GPT
four distinct types of information: (1) user intentions, which to tackle the challenges identified in Exploratory Study. We
are directives provided by the user to dictate the next course have experimented with different designs, and here we discuss
of action, (2) security testing tool outputs, which represent the some key decisions.
raw outputs generated by an array of security testing tools, (3) Addressing Context Loss with Token Size: a straight-
raw HTTP web information, which encompasses all raw in- forward solution to alleviate context loss is the employment
formation derived from HTTP web interfaces, and (4) source of LLM models with an extended token size. For instance,
codes extracted during the penetration testing process. Users GPT-4 provides versions with 8k and 32k token size limits.
must specify the category of the information they provide, This approach, however, confronts two substantial challenges.
and each category is paired with a set of carefully designed First, even a 32k token size might be inadequate for penetra-
prompts. For source code analysis, we integrate the GPT-4 tion testing scenarios, as the output of a single testing tool
code interpreter [50] to execute the task. like dirbuster [51] may comprise thousands of tokens. Con-
0 0 0 0 0 0
3 4 6
Arbitrary Arbitrary Reverse
Easy Medium Hard
File Upload File Upload Shell
(a) Overall completion status. Flow 1 Flow 2 Flow 1 Flow 2
Flow 1 & 2 are independent Flow 1 & 2 are interrelated
GPT-3.5 PentestGPT-GPT-3.5
69 GPT-4 PentestGPT-GPT-4
57
Figure 6: Penetration testing strategy comparison between
52 GPT-3.5 and P ENTEST GPT on VulnHub-Hackable II.
31
27
24 unable to process images, which are crucial in certain penetra-
13 14 12 tion testing scenarios. Addressing this limitation may require
8
5 5
the development of advanced multimodal models that can
Easy Medium Hard interpret both text and visual data. Second, P ENTEST GPT
lacks the ability to employ certain social engineering tech-
(b) Subtask completion status.
niques and to detect subtle cues. For example, while a human
Figure 5: The performance of GPT-3.5, GPT-4, tester might generate a brute-force wordlist from information
P ENTEST GPT-GPT-3.5, and P ENTEST GPT-GPT-4 extracted from a target service, P ENTEST GPT can retrieve
on overall target completion and sub-task completion. names from a web service but fails to guide the usage of
tools needed to create a wordlist from these names. Third, the
similarly to human experts and prioritizes effectively. Rather models struggle with accurate exploitation code construction
than just addressing the latest identified task, P ENTEST GPT within a limited number of trials. Despite some proficiency in
identifies key sub-tasks that can result in success. code comprehension and generation, the LLM falls short in
Figure 6 contrasts the strategies of GPT-4 and P ENTEST- producing detailed exploitation scripts, particularly with low-
GPT on the VulnHub machine, Hackable II [58]. This ma- level bytecode operations. These limitations underline the
chine features two vulnerabilities: an FTP service for file necessity for improvement in areas where human insight and
uploads and a web service to view FTP files. A valid exploit intricate reasoning are still more proficient than automated
requires both services. The figure shows GPT-4 starting with solutions.
the FTP service and identifying the upload vulnerability (❶-
❸). Yet, it does not link this to the web service, causing an 6.4 Ablation Study (RQ5)
incomplete exploit. In contrast, P ENTEST GPT shifts between
We perform an ablation study on how the three modules:
the FTP and web services. It first explores both services (❶-
Reasoning Module, Generation Module, and Parsing Module,
❷), then focuses on the FTP (❸-❹), realizing the FTP and
contribute to the performance of P ENTEST GPT. We imple-
web files are identical. With this insight, P ENTEST GPT in-
ment three variants:
structs the tester to upload a shell (❺), achieving a successful
reverse shell (❻). This matches the solution guide and un- 1. P ENTEST GPT- NO -PARSING: the Parsing Module is de-
derscores P ENTEST GPT’s adeptness at integrating various activated, causing all data to be directly fed into the
testing aspects. system.
Our second observation is that although P ENTEST GPT be-
2. P ENTEST GPT- NO -G ENERATION: the Generation Mod-
haves more similarly to human experts, it still exhibits some
ule is deactivated, leading to the completion of task gen-
strategies that humans will not apply. For instance, P ENTEST-
eration within the Reasoning Module itself. The prompts
GPT still prioritizes brute-force attacks before vulnerability
for task generation remain consistent.
scanning. This is obvious in cases where P ENTEST GPT al-
ways tries to brute-force the SSH service on target machines. 3. P ENTEST GPT- NO -R EASONING: the Reasoning Module
We analyze cases where penetration testing with P ENTEST- is disabled. Instead of PTT, this variant adopts the same
GPT failed, identifying three primary limitations. First, P EN - methodology utilized with LLMs for penetration testing,
TEST GPT struggles with image interpretation. LLMs are as delineated in the Exploratory Study.
H. Wang, T. Zhang, and Y. Liu, “Masterkey: Automated Yes, I understand. Please provide the initial details and the
jailbreaking of large language model chatbots,” in scope of the penetration test so I can assist accordingly.
Proceedings 2024 Network and Distributed System Remember to follow ethical guidelines and obtain proper
Security Symposium, ser. NDSS 2024. Internet Society, authorization for any systems or networks you are testing.
2024. [Online]. Available: http://dx.doi.org/10.14722/
ndss.2024.24188 Human Tester
[69] P. Manakul, A. Liusie, and M. J. Gales, “Selfcheck- nmap -sV -sT 192.168.1.5
gpt: Zero-resource black-box hallucination detection
for generative large language models,” arXiv preprint This command will provide a list of open ports and the
arXiv:2303.08896, 2023. services (with versions) running on those ports.
Please provide the results of the nmap scan, and based on
[70] [Online]. Available: https://langfuse.com/ the output, we can decide the next course of action.
3 -Vulnerbale 10 -Trigger
8 -Examine
File Enumeration Reverse Shell
uploaded file
(not vulnerable)
Web User
Access File Uploaded to Web user (www-
Web Service data) access
9 -Reverse Shell
Construction and 11 -System 14 -cron Vulnerable
15 -Local File
Upload Configuration enumeration (not Service
Enumeration
Enumeration useful) Enumeration
An interesting
A user named the user controls "runme.sh"
"shrek" is presented Apache service
16 -Crach the
12 -enumerate 13 -enumerate
hash in the file
Privilege "shrek" files (no Apache Service
Escalation to access) (not vulnerable)
Normal User Get Password; use
"shrek" as username
17 -Privilege
Escalation to
user "shrek"
User "shrek"
access obtained
18 -System Vulnerable
cron Local File
Configuration Service
Privilege enumeration Enumeration
Enumeration Enumeration
Escalation to
root
"shrek" can run
Python with sudo
access
19 -Privilege
Escalation to root