0% found this document useful (0 votes)

54 views19 pages

Usenixsecurity24 Deng

Uploaded by

alandean888

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

54 views19 pages

Usenixsecurity24 Deng

Uploaded by

alandean888

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 19

PentestGPT: Evaluating and Harnessing Large Language

Models for Automated Penetration Testing

Gelei Deng and Yi Liu, Nanyang Technological University; Víctor Mayoral-Vilches,
Alias Robotics and Alpen-Adria-Universität Klagenfurt; Peng Liu, Institute for Infocomm
Research (I2R), A*STAR, Singapore; Yuekang Li, University of New South Wales; Yuan Xu,
Tianwei Zhang, and Yang Liu, Nanyang Technological University; Martin Pinzger,
Alpen-Adria-Universität Klagenfurt; Stefan Rass, Johannes Kepler University Linz
https://www.usenix.org/conference/usenixsecurity24/presentation/deng

This paper is included in the Proceedings of the

33rd USENIX Security Symposium.
August 14–16, 2024 • Philadelphia, PA, USA
978-1-939133-44-1

Open access to the Proceedings of the

33rd USENIX Security Symposium
is sponsored by USENIX.
P ENTEST GPT: Evaluating and Harnessing Large Language Models for Automated
Penetration Testing

Gelei Deng1§ , Yi Liu1§ , Víctor Mayoral-Vilches23 , Peng Liu4 , Yuekang Li5∗, Yuan Xu1 ,
Tianwei Zhang1 , Yang Liu1 , Martin Pinzger3 , Stefan Rass6
1 Nanyang Technological University, 2 Alias Robotics, 3 Alpen-Adria-Universität Klagenfurt,
4 Institute for Infocomm Research (I 2 R), A*STAR, Singapore, 5 University of New South Wales, 6 Johannes
Kepler University Linz

Abstract red teaming are now essential in the security lifecycle. As ex-
plained by Applebaum [1], these approaches involve security
Penetration testing, a crucial industrial practice for ensur- teams attempting breaches to reveal vulnerabilities, providing
ing system security, has traditionally resisted automation due advantages over traditional defenses, which rely on incom-
to the extensive expertise required by human professionals. plete system knowledge and modeling. This study, guided by
Large Language Models (LLMs) have shown significant ad- the principle “the best defense is a good offense”, focuses on
vancements in various domains, and their emergent abilities offensive strategies, specifically penetration testing.
suggest their potential to revolutionize industries. In this work,
Penetration testing is a proactive offensive technique for
we establish a comprehensive benchmark using real-world
identifying, assessing, and mitigating security vulnerabili-
penetration testing targets and further use it to explore the
ties [2]. It involves targeted attacks to confirm flaws, yielding
capabilities of LLMs in this domain. Our findings reveal that
a comprehensive inventory of vulnerabilities with actionable
while LLMs demonstrate proficiency in specific sub-tasks
recommendations. This widely-used practice empowers orga-
within the penetration testing process, such as using testing
nizations to detect and neutralize network and system vulner-
tools, interpreting outputs, and proposing subsequent actions,
abilities before malicious exploitation. However, it typically
they also encounter difficulties maintaining a whole context
relies on manual effort and specialized knowledge [3], result-
of the overall testing scenario.
ing in a labor-intensive process, creating a gap in meeting the
Based on these insights, we introduce P ENTEST GPT, an growing demand for efficient security evaluations.
LLM-empowered automated penetration testing framework
Large Language Models (LLMs) have demonstrated pro-
that leverages the abundant domain knowledge inherent in
found capabilities, showcasing intricate comprehension of
LLMs. P ENTEST GPT is meticulously designed with three
human-like text and achieving remarkable results across a
self-interacting modules, each addressing individual sub-tasks
multitude of tasks [4, 5]. An outstanding characteristic of
of penetration testing, to mitigate the challenges related to
LLMs is their emergent abilities [6], cultivated during training,
context loss. Our evaluation shows that P ENTEST GPT not
which empower them to undertake intricate tasks such as rea-
only outperforms LLMs with a task-completion increase of
soning, summarization, and domain-specific problem-solving
228.6% compared to the GPT-3.5 model among the bench-
without task-specific fine-tuning. This versatility posits LLMs
mark targets, but also proves effective in tackling real-world
as potential game-changers in various fields, notably cyber-
penetration testing targets and CTF challenges. Having been
security. Although recent works [7–9] posit the potential of
open-sourced on GitHub, P ENTEST GPT has garnered over
LLMs to reshape cybersecurity practices, including the con-
6,500 stars in 12 months and fostered active community en-
text of penetration testing, there is an absence of a systematic,
gagement, attesting to its value and impact in both the aca-
quantitative assessment of their aptitude in this regard. Con-
demic and industrial spheres.
sequently, an imperative question presents: To what extend
can LLMs automate penetration testing?
1 Introduction Motivated by this question, we set out to explore the ca-
pability boundary of LLMs on real-world penetration test-
Securing a system presents a formidable challenge. Offensive ing tasks. Unfortunately, the current benchmarks for pen-
security methods like penetration testing (pen-testing) and etration testing [10, 11] are not comprehensive and fail to
assess progressive accomplishments fairly during the pro-
∗ Corresponding author. cess. To address this limitation, we construct a robust bench-
§ Equal Contribution mark that includes test machines from HackTheBox [12] and

USENIX Association 33rd USENIX Security Symposium 847

VulnHub [13]—two leading platforms for penetration test- prising Reasoning, Generation, and Parsing Modules, each
ing challenges. Comprising 13 targets with 182 sub-tasks, reflecting specific roles within penetration testing teams. The
our benchmark encompasses all vulnerabilities appearing in Reasoning Module emulates the function of a lead tester,
OWASP’s top 10 vulnerability list [14] and 18 Common focusing on maintaining a high-level overview of the penetra-
Weakness Enumeration (CWE) items [15]. The benchmark tion testing status. We introduce a novel representation, the
offers a more detailed evaluation of the tester’s performance Pentesting Task Tree (PTT), based on the cybersecurity attack
by monitoring the completion status for each sub-task. tree [19]. This structure encodes the testing process’s ongoing
With this benchmark, we perform an exploratory study status and steers subsequent actions. Uniquely, this representa-
using GPT-3.5 [16], GPT-4 [17], and Bard [18] as representation can be translated into natural language and interpreted by
tive LLMs. Our test strategy is interactive and iterative. We the LLM, thereby comprehended by the Generation Module
craft tailored prompts to guide the LLMs through penetration and directing the testing procedure. The Generation Module,
testing. Each LLM, presented with prompts and target ma- mirroring a junior tester’s role, is responsible for construct-
chine information, generates step-by-step penetration testing ing detailed procedures for specific sub-tasks. Translating
operations. We then execute the suggested operations in a these into exact testing operations augments the generation
controlled environment, document the results, and feed them process’s accuracy. Meanwhile, the Parsing Module deals
back to the LLM to inform and refine its next steps. This with diverse text data encountered during penetration testing,
cycle (prompting, executing, and feedback) is repeated un- such as tool outputs, source codes, and HTTP web pages. It
til the LLM completes the entire penetration testing process condenses and emphasizes these texts, extracting essential
autonomously. To evaluate LLMs, we compare their results information. Collectively, these modules function as an inte-
against baseline solutions from official walkthroughs and grated system. P ENTEST GPT completes complex penetration
certified penetration testers. By analyzing similarities and testing tasks by bridging high-level strategies with precise exe-
differences in their problem-solving approaches, we aim to cution and intelligent data interpretation, thereby maintaining
better understand LLMs’ capabilities in penetration testing a coherent and effective testing process.
and how their strategies differ from human experts. We assessed P ENTEST GPT across diverse testing scenar-
Our investigation yields intriguing insights into the capa- ios to validate its effectiveness and breadth. In our custom
bilities and limitations of LLMs in penetration testing. We benchmarks, P ENTEST GPT significantly outperformed direct
discover that LLMs demonstrate proficiency in managing spe- applications of GPT-3.5 and GPT-4, showing increases in
cific sub-tasks within the testing process, such as utilizing sub-task completion rates of 228.6% and 58.6%, respectively.
testing tools, interpreting their outputs, and suggesting subse- Furthermore, when applied to real-world challenges such as
quent actions. Compared to human experts, LLMs are espe- the HackTheBox active machine penetration tests [20] and
cially adept at executing complex commands and options with picoMini [21] CTF competition, P ENTEST GPT demonstrated
testing tools, while models like GPT-4 excel in comprehend- its practical utility. It successfully resolved 4 out of 10 pene-
ing source code and pinpointing vulnerabilities. Furthermore, tration testing challenges, incurring a total cost of 131.5 US
LLMs can craft appropriate test commands and accurately de- Dollars for the OpenAI API usage. In the CTF competition,
scribe graphical user-interface operations needed for specific P ENTEST GPT achieved a score of 1500 out of a possible
tasks. Leveraging their vast knowledge base, they can design 4200, placing 24th among 248 participating teams. This eval-
inventive testing procedures to unveil potential vulnerabili- uation underscores P ENTEST GPT’s practical value in enhanc-
ties in real-world systems and CTF challenges. However, we ing penetration testing tasks’ efficiency and precision. The
also note that LLMs have difficulty in maintaining a coherent solution has been made publicly available on GitHub [22],
grasp of the overarching testing scenario, a vital aspect for receiving widespread acclaim with over 6,200 stars to the
attaining the testing goal. As the dialogue advances, they may date of writing, active community engagement, and ongoing
lose sight of earlier discoveries and struggle to apply their collaboration with multiple industrial partners.
reasoning consistently toward the final objective. Addition- In summary, we make the following contributions:
ally, LLMs overemphasize recent tasks in the conversation
history, regardless of their vulnerability status. As a result, • Development of a Comprehensive Penetration Testing
they tend to neglect other potential attack surfaces exposed in Benchmark. We craft a robust and representative penetra-
prior tests and fail to complete the penetration testing task. tion testing benchmark, encompassing a multitude of test
Building on our insights into LLMs’ capabilities in pene- machines from leading platforms such as HackTheBox and
tration testing, we present P ENTEST GPT, an interactive sys- VulnHub. This benchmark includes 182 sub-tasks covering
tem designed to enhance the application of LLMs in this OWASP’s top 10 vulnerabilities, offering fair and compre-
domain. Drawing inspiration from the collaborative dynamics hensive evaluation of penetration testing. To the best of
commonly observed in real-world human penetration testing our knowledge, this is the first benchmark in the field that
teams, P ENTEST GPT is particularly tailored to manage large can provide progressive accomplishments assessments and
and intricate projects. It features a tripartite architecture com- comparisons.

848 33rd USENIX Security Symposium USENIX Association

• Comprehensive Evaluation of LLMs for Penetration analysis [26] and vulnerability repairment [27]. These models
Testing Tasks. By employing models like GPT-3.5, GPT- are equipped with wide-ranging general knowledge and the
4, and Bard, our exploratory study rigorously investigates capacity for elementary reasoning. They can comprehend,
the strengths and limitations of LLMs in penetration testing. infer, and produce text resembling human communication,
To the best of our knowledge, this is the first systematic and aided by a training corpus encompassing diverse domains
quantitative study for the capability of LLMs in performing like computer science and cybersecurity. Their ability to in-
automated penetration testing. The insights gleaned from terpret context and recognize patterns enables them to adapt
this study shed valuable light on the capabilities and chal- knowledge to new scenarios. This adaptability, coupled with
lenges faced by LLMs, enriching our understanding of their their proficiency in interacting with systems in a human-like
applicability in this specialized domain. way, positions them as valuable assets in enhancing penetra-
tion testing processes. Despite inherent limitations, LLMs
• Development of an Innovative LLM-powered Penetra-
offer distinct attributes that can substantially aid in the au-
tion Testing System. We engineer P ENTEST GPT, a novel
tomation and improvement of penetration testing tasks. The
interactive system that leverages the strengths of LLMs
realization of this potential, however, requires the creation
to carry out penetration testing tasks automatically. Draw-
and application of a specialized and rigorous benchmark.
ing inspiration from real-world human penetration testing
teams, P ENTEST GPT integrates a tripartite design that mir-
rors the collaborative dynamics between senior and junior
testers. This architecture optimizes LLMs’ usage, signifi-
cantly enhancing the efficiency and effectiveness of auto- 3 Penetration Testing Benchmark
mated penetration testing. We have open-sourced P ENTEST-
GPT and it has received over 6,500 stars on GitHub, active 3.1 Motivation
community contributions, and several industry partners in-
cluding AWS, Huawei, and TikTok to collaborate.
The comprehensive evaluation of LLMs in penetration testing
necessitates a robust and representative benchmark. Existing
2 Background & Related Work benchmarks in this domain [10, 11] have several limitations.
First, they are often restricted in scope, focusing on a narrow
2.1 Penetration Testing range of potential vulnerabilities, and thus fail to capture the
complexity and diversity of real-world cyber threats. For in-
Penetration testing, or “pentesting”, is a critical practice to
stance, the OWASP juiceshop project [28] is the most widely
enhance organizational systems’ security. In a typical penetra-
adopted benchmark for web vulnerability evaluation. How-
tion test, security professionals, known as penetration testers,
ever, it does not include privilege escalation vulnerabilities,
analyze the target system, often leveraging automated tools.
which is an essential aspect of penetration testing. Second, ex-
The standard process is divided into five key phases [23]:
isting benchmarks may not recognize the cumulative value of
Reconnaissance, Scanning, Vulnerability Assessment, Ex-
progress through the different stages of penetration testing, as
ploitation, and Post Exploitation (including reporting). These
they tend to evaluate only the final exploitation success. This
phases enable testers to understand the target system, identify
approach overlooks the nuanced value each step contributes
vulnerabilities, and exploit them to gain access.
to the overall process, resulting in metrics that might not accu-
Despite significant advancements [11, 24, 25], a fully auto-
rately represent actual performance in real-world scenarios.
mated penetration testing system remains out of reach. This
gap results from the need for deep vulnerability understanding To address these concerns, we propose the construction of
and a strategic action plan. Typically, testers combine depth- a comprehensive penetration testing benchmark that meets
first and breadth-first search techniques [23]. They first grasp the following criteria:
the target environment’s scope, then drill down into specific Task Variety. The benchmark must encompass diverse tasks,
vulnerabilities. This method ensures comprehensive analysis, reflecting various operating systems and emulating the di-
leaning on expertise and experience. The multitude of spe- versity of scenarios encountered in real-world penetration
cialized tools further complicate the automation. Thus, even testing.
with artificial intelligence, achieving a seamless automated
Challenge Levels. To ensure broad applicability, the bench-
penetration testing solution is a daunting task.
mark must include tasks of varying difficulty levels suitable
for challenging novice and expert testers.
2.2 Large Language Models
Progress Tracking. Beyond mere success or failure met-
Large Language Models (LLMs), including OpenAI’s GPT- rics, the benchmark must facilitate tracking of incremental
3.5 and GPT-4, are prominent tools with applications ex- progress, thereby recognizing and scoring the value added at
tending to various cybersecurity-related fields, such as code each stage of the penetration testing process.

USENIX Association 33rd USENIX Security Symposium 849

3.2 Benchmark Design 4 Exploratory Study

Following the criteria outlined previously, we develop a com- We conduct an exploratory study to assess the capabilities
prehensive benchmark that closely reflects real-world pene- of LLMs in penetration testing, with the primary objective
tration testing tasks. The design process progresses through of determining how well LLMs can adapt to the real-world
several stages. complexities and challenges in this task. Specifically, we aim
to address the following two research questions:
Task Selection. We begin by selecting tasks from HackThe-
RQ1 (Capability): To what extent can LLMs perform pene-
Box [12] and VulnHub [13], two leading penetration testing
tration testing tasks?
training platforms. Our selection criteria are designed to en-
RQ2 (Comparative Analysis): How do the problem-solving
sure that our benchmark accurately reflects the challenges
strategies of human penetration testers and LLMs differ?
encountered in practical penetration testing environments. We
We utilize the benchmark described in Section 3 to evaluate
meticulously review the latest machines available on both
the performance of LLMs on penetration testing tasks. In the
platforms, aiming to identify and select a subset that compre-
following, we first delineate our testing strategy for this study.
hensively covers all vulnerabilities listed in the OWASP [14]
Subsequently, we present the testing results and an analytical
Top 10 Project. Additionally, we choose machines that repre-
discussion to address the above research questions.
sent a mix of difficulties, classified according to traditional
standards in the penetration testing domain into easy, medium,
and hard categories. This process guarantees that our bench- 4.1 Testing Strategy
mark spans the full spectrum of vulnerabilities and difficulties. LLMs are text-based and cannot independently perform pen-
Note that our benchmark does not include benign targets to etration testing operations. To address this, we develop a
assess false positives. In penetration testing, benign targets are human-in-the-loop testing strategy, serving as an intermediary
sometimes explored. Our main objective remains identifying method to accurately assess LLMs’ capabilities. This strategy
true vulnerabilities. features an interactive loop where a human expert executes
Task Decomposition. We further parse the testing process of the LLM’s penetration testing directives. Importantly, the hu-
each target into a series of sub-tasks, following the standard man expert functions purely as an executor, strictly following
solution commonly referred to as the “walkthrough” in pen- the LLM’s instructions without adding any expert insights or
etration testing. Each sub-task corresponds to a unique step making independent decisions.
in the overall process. We decompose sub-tasks following Figure 1 decipits the testing strategy with the following
NIST 800-115 [29], the Technical Guide to Security Testing. steps: ❶ We initiate the looped testing procedure by pre-
Each sub-task is one step declared in the Guide (e.g., network senting the target specifics to the LLM, seeking its guidance
discovery, password cracking), or an operation that exploits a on potential penetration testing steps. ❷ The human expert
unique vulnerability categorised in the Common Weakness strictly follows the LLM’s recommendations and conducts the
Enumeration (CWE) [15] (e.g., exploiting SQL injection - suggested actions in the penetration testing environment. ❸
CWE-89 [30]). In the end, we formulate an exhaustive list of Outcomes of the testing actions are collected and summarized:
sub-tasks for every benchmark target. direct text outputs such as terminal outputs or source code
Benchmark Validation. The final stage of our benchmark are documented; non-textual results, such as graphical repre-
development involves rigorous validation, which ensures the sentations, are translated by the human expert into succinct
reproducibility of these benchmark machines. To do this, three textual summaries. The data is then fed back to the LLM,
certified penetration testers independently attempt the pene- setting the stage for its subsequent recommendations. ❹ This
tration testing targets and write their walkthrough. We then iterative process persists either until a conclusive solution
adjust our task decomposition accordingly because some tar- is identified or an deadlock is reached. We then compile a
gets may have multiple valid solutions. record of the testing procedures, encompassing successful
sub-tasks, ineffective actions, and any reasons for failure, if
Ultimately, we have compiled a benchmark that effec-
applicable. For a more tangible grasp of this strategy, we offer
tively encompasses all types of vulnerabilities listed in the
illustrative examples of prompts and corresponding outputs
OWASP [14] Top 10 Project. It comprises 13 penetration
from GPT-4 related to one of our benchmark targets in the
testing targets, each at varying levels of difficulty. These tar-
Appendix Section A.
gets are broken down into 182 sub-tasks across 26 categories,
To ensure the evaluation’s fairness and accuracy, we em-
covering 18 distinct CWE items. This number of targets is
ploy several strategies. First, we involve expert-level penetra-
deemed sufficient to represent a broad spectrum of vulnera-
tion testers1 as the human testers. With their deep pentesting
bilities, difficulty levels, and varieties essential for compre-
knowledge, these testers can precisely comprehend and ex-
hensive penetration testing training. To foster community de-
ecute LLM-generated operations, thus accurately assessing
velopment, we have made this benchmark publicly available
online [22]. 1We selected Offensive Security Certified Professionals (OSCP) testers.

850 33rd USENIX Security Symposium USENIX Association

Interactive Loop tions). Occasionally, versioning discrepancies may lead the
1 Penetration Large Language
3 4 Flag and
Testing Outputs LLMs to provide incorrect instructions for tool usage. In such
Testing Goal Model Conclusion
instances, our penetration testing experts evaluate whether the
2 Operations to Testing
instructions would have been valid for a previous version of
Data
Perform Environment the tool. They then make any necessary adjustments to ensure
Human Expert Entity
the tool’s correct operation.
Figure 1: Overview of strategy to use LLMs for penetration
testing.
4.3 Capability Evaluation (RQ1)
To address RQ1, we evaluate the performance of three lead-
LLMs’ true capabilities. Second, we instruct the penetration ing LLMs: GPT-4, Bard, and GPT-3.5. We summarize these
testers to strictly execute the commands given by the LLMs, findings in Table 1. Each LLM successfully completes at least
without altering any content or information, even upon identi- one end-to-end penetration test, highlighting their versatility
fying clear errors. They are also instructed to faithfully report in simpler environments. Of these, GPT-4 excels, achieving
the testing results back to the LLM without any additional success on 4 easy and 1 medium difficulty targets. Bard and
commentary. Third, for managing UI-based operations and GPT-3.5 follow with success on 2 and 1 easy targets, respec-
graphical results, we have adopted specific measures. Initially, tively. In sub-tasks, GPT-4 completes 55 out of 77 on easy
we instruct the LLMs to minimize the use of GUI-based tools. targets and 30 out of 71 on medium. Bard and GPT-3.5 also
For indispensable tools that cannot be avoided (e.g., Burp- show potential, finishing 16 (22.54%) and 13 (18.31%) of
Suite), we propose a result-oriented approach: upon receiving medium difficulty sub-tasks, respectively. However, on hard
a GUI operation instruction, the testers first execute the oper- targets, all models’ performance declines. Though they can
ation based on their expert knowledge. Subsequently, they are initiate the reconnaissance phase, they struggle to exploit iden-
required to provide detailed, step-by-step textual descriptions tified vulnerabilities. This is anticipated since hard targets
of their actions and the observed responses at each step, which are designed to be especially challenging. They often fea-
are then communicated back to the LLM. Should the LLM ture seemingly vulnerable services that are non-exploitable,
express any objections or comments concerning a particular known as rabbit holes [37]. The pathways to exploit these
step, the operation is to be repeated. This protocol ensures machines are unique and unpredictable, resisting automated
the integrity of the feedback loop, guaranteeing that the LLM tool replication. For example, the target Falafel has special-
obtains a comprehensive understanding of the testing results. ized SQL injection vulnerabilities resistant to sqlmap. Current
LLMs cannot tackle these without human expert input.
4.2 Evaluation Settings
Finding 1: Large Language Models (LLMs) have shown
We proceed to assess the performances of various LLMs in proficiency in conducting end-to-end penetration testing
penetration testing tasks using the strategy mentioned above. tasks but struggle to overcome challenges presented by
Model Selection. Our study focuses on three cutting-edge more difficult targets.
LLMs that are currently accessible: GPT-3.5 with 8k to-
ken limit, GPT-4 with 32k token limit from OpenAI, and We further examine the detailed sub-task completion per-
LaMDA [31] from Google. These models are selected based formances of the three LLMs compared to the walkthrough
on their prominence in the research community and consis- (WT), as presented in Table 2. Analyzing the completion sta-
tent availability. To interact with the LLMs mentioned above, tus, we identify several areas where LLMs excel. First, they
we utilize chatbot services provided by OpenAI and Google, adeptly utilize common penetration testing tools to interpret
namely ChatGPT [32] and Bard [18]. For this paper, the terms the corresponding outputs, especially in enumeration tasks
GPT-3.5, GPT-4, and Bard will represent these three LLMs. correctly. For example, all three evaluated LLMs successfully
Experimental Setup. Our experiments occur in a local setting perform nine Port Scanning sub-tasks. They can configure
with both target and testing machines on the same private the widely-used port scanning tool, nmap [38], comprehend
network. The testing machine runs on Kali Linux [33], version the scan results, and formulate subsequent actions. Second,
2023.1. the LLMs reveal a deep understanding of prevalent vulner-
Tool Usage. Our study aims to assess the innate capabilities ability types, connecting them to the services on the target
of LLMs on penetration testing, without reliance on end-to- system. This understanding is evidenced by the successful
end automated vulnerability scanners such as Nexus [34] completion of sub-tasks related to various vulnerability types.
and OpenVAS [35]. Consequently, we explicitly instruct the Finally, LLMs demonstrate their effectiveness in code analy-
LLMs to refrain from using these tools. We follow the LLMs’ sis and generation, particularly in the tasks of Code Analysis
recommendations for utilizing other tools designed to validate and Shell Construction. These tasks require the models to
specific vulnerability types (e.g., sqlmap [36] for SQL injec- read and generate codes in different programming languages.

USENIX Association 33rd USENIX Security Symposium 851

Table 1: Overall performance of LLMs on Penetration Testing Benchmark.

Easy Medium Hard Average

Tools Overall (7) Sub-task (77) Overall (4) Sub-task (71) Overall (2) Sub-task (34) Overall (13) Sub-task (182)
GPT-3.5 1 (14.29%) 24 (31.17%) 0 (0.00%) 13 (18.31%) 0 (0.00%) 5 (14.71%) 1 (7.69%) 42 (23.07%)
GPT-4 4 (57.14%) 55 (71.43%) 1 (25.00%) 30 (42.25%) 0 (0.00%) 10 (29.41%) 5 (38.46%) 95 (52.20%)
Bard 2 (28.57%) 29 (37.66%) 0 (0.00%) 16 (22.54%) 0 (0.00%) 5 (14.71%) 2 (15.38%) 50 (27.47%)
Average 2.3 (33.33%) 36 (46.75%) 0.33 (8.33%) 19.7 (27.70%) 0 (0.00%) 6.7 (19.61%) 2.7 (20.5%) 62.3 (34.25%)

Table 2: Top 10 Types of Sub-tasks completed by each tool. Table 4: Top causes for failed penetration testing trials

Sub-Tasks WT GPT-3.5 GPT-4 Bard Failure Reasons GPT3.5 GPT4 Bard Total
Web Enumeration 18 4 (22.2%) 8 (44.4%) 4 (22.2%) Session context lost 25 18 31 74
Code Analysis 18 4 (22.2%) 5 (27.2%) 4 (22.2%) False Command Generation 23 12 20 55
Port Scanning 12 9 (75.0%) 9 (75.0%) 9 (75.0%) Deadlock operations 19 10 16 45
Shell Construction 11 3 (27.3%) 8 (72.7%) 4 (36.4%) False Scanning Output Interpretation 13 9 18 40
File Enumeration 11 1 (9.1%) 7 (63.6%) 1 (9.1%) False Source Code Interpretation 16 11 10 37
Configuration Enumeration 8 2 (25.0%) 4 (50.0%) 3 (37.5%) Cannot craft valid exploit 11 15 8 34
Cryptanalysis 8 2 (25.0%) 3 (37.5%) 1 (12.5%)
Network Enumeration 7 1 (14.3%) 3 (42.9%) 2 (28.6%)
Command Injection 6 1 (16.7%) 4 (66.7%) 2 (33.3%)
Known Exploits 6 2 (33.3%) 3 (50.0%) 1 (16.7%)

Table 3: Top Unnecessary Operations Prompted by LLMs on tasks. We employ the same method to formulate benchmark
the Benchmark Targets sub-tasks, as Section 3 outlines. By comparing this to a stan-
dard walkthrough, we identify the primary sub-task trials that
Unnecessary Operations GPT-3.5 GPT-4 Bard Total
fall outside the standard walkthrough and are thus irrelevant to
Brute-Force 75 92 68 235
Exploit Known Vulnerabilities (CVEs) 29 24 28 81 the penetration testing process. The results are summarized in
SQL Injection 14 21 16 51 Table 3. We find that the most prevalent unnecessary operation
Command Injection 18 7 12 37
prompted by LLMs is brute force. For all services requiring
password authentication, LLMs typically advise brute-forcing
it. This is an ineffective strategy in penetration testing. We
This often culminates in identifying potential vulnerabilities surmise that many hacking incidents in enterprises involve
from code snippets and crafting the corresponding exploits. password cracking and brute force. LLMs learn these reports
Notably, GPT-4 outperforms the other two models regard- from accident reports and are consequently considered viable
ing code interpretation and generation, marking it the most solutions. Besides brute force, LLMs suggest that testers en-
suitable candidate for penetration testing tasks. gage in CVE studies, SQL injections, and command injections.
These recommendations are common, as real-world penetra-
Finding 2: LLMs can efficiently use penetration test- tion testers often prioritize these techniques, even though they
ing tools, identify common vulnerabilities, and interpret may not always provide the exact solution.
source codes to identify vulnerabilities.

To understand penetration testing trial failures, we cate-

4.4 Comparative Analysis (RQ2) gorize the reasons for the 195 trials, as shown in Table 4.
The primary failure cause is loss of session context. This
To address RQ2, we examine the problem-solving strategies means models often lose awareness of previous test outcomes,
that LLMs employ, contrasting them with human penetration missing essential past results. This issue arises from LLMs’
testers. In each penetration testing trial, we concentrate on challenge in handling conversation context. Each LLM has a
two main aspects: (1) Identifying the unnecessary operations fixed token window, such as GPT-4 with a capacity of 8,000
that LLMs prompt, which are not conducive to successful tokens [39]. If critical information for a complex task exceeds
penetration testing, as compared to a standard walkthrough; this limit, trimming it causes the loss of important details.
and (2) Understanding the specific factors that prevent LLMs This is problematic in intricate tests where identifying vul-
from successfully executing penetration tests. nerabilities across services and forming a cohesive exploit
We analyze the unnecessary operations prompted by LLMs strategy is vital. This design flaw impacts the model’s efficacy
by breaking down the recorded testing procedures into sub- in dealing with layered, detailed tasks.

852 33rd USENIX Security Symposium USENIX Association

Finding 3: LLMs struggle to maintain long-term memory, seamlessly with P ENTEST GPT, where distinct modules pro-
which is vital to link vulnerabilities and develop exploita- cess different types of messages. This interaction culminates
tion strategies effectively. in a final decision, suggesting the subsequent step of the pen-
etration testing process that the user should undertake. In the
Secondly, LLMs strongly prefer the most recent tasks, ad- following sections, we elucidate our design reasoning and
hering rigorously to a depth-first search approach. They tend provide a detailed breakdown of the engineering processes
to immerse deeply into resolving the issues mentioned in behind P ENTEST GPT.
the most recent conversation, seldom branching out to new
targets until the ongoing path is exhaustively explored. This
behavior aligns with the studies [40, 41] that LLMs primar-
5.2 Design Rationale
ily concentrate their attention at the prompt’s beginning and Our central design considerations emerged from the three
end. In contrast, seasoned penetration testers adopt a more challenges observed in the previous Exploratory Study (Sec-
holistic approach, strategically plotting moves that promise tion 4): The first challenge (Finding 3) pertains to the issue
the highest potential outcomes. When coupled with the afore- of penetration testing context loss due to memory retention.
mentioned session context loss, this proclivity drives LLMs to LLMs in their original form struggle to maintain such long-
become excessively anchored to one specific service. As the term memory due to token size limits. The second obstacle
testing advances, the models often neglect prior discoveries, (Finding 4) arises from the LLM chatbots’ tendency to em-
leading to an impasse. phasize recent conversation content. In penetration testing
tasks, this focuses on optimizing the immediate task. This
Finding 4: LLMs strongly prefer recent tasks and a depth- approach falls short in the complex, interconnected task envi-
first search approach, often resulting in an over-focus on ronment of penetration testing. The third obstacle (Finding 5)
one service and forgetting previous findings. is tied to the inaccurate results generation by LLMs. When
tasked to produce specific operations for a step in penetration
Lastly, LLMs have inaccurate result generation and halluci-
testing directly, the outputs are often imprecise, sometimes
nation issues, as noted in [42]. This phenomenon ranks as the
even leading to false directions.
second most frequent cause of failures and is characterized by
P ENTEST GPT has been engineered to address these chal-
the generation of false commands. In our study, we observe
lenges, rendering it more apt for penetration testing tasks.
that LLMs frequently identify the appropriate tool for the task
We draw inspiration from the methodologies employed by
but stumble in configuring the tools with the correct settings.
real-world penetration testing teams, where directors plan
In some cases, they even concoct non-existent testing tools or
overarching procedures, subdividing them into subtasks for
tool modules.
individual testers. Each tester independently performs their
Finding 5: LLMs may generate inaccurate operations or task, reporting results without an exhaustive understanding
commands, often stemming from inherent inaccuracies of the broader context. The director then determines the fol-
and hallucinations. lowing steps, possibly redefining tasks, and triggers the subse-
quent round of testing. Essentially, the director manages the
Our exploratory study on three LLMs in penetration testing overall strategy without becoming entrenched in the minutiae
highlights their capability to complete sub-tasks. However, of the tests. This approach is mirrored in P ENTEST GPT’s
they face issues with long-term memory retention, reliance functionality, enhancing its efficiency and adaptability in con-
on a depth-first strategy, and ensuring operation accuracy. In ducting penetration tests. Our strategy divides penetration
the subsequent section, we detail our approach to mitigate testing into two processes: identifying the next task and gen-
these challenges and describe the design of our LLM-based erating the concrete operation to complete the task. Each
penetration testing tool. process is powered by one LLM session. In this setup, the
LLM session responsible for task identification retains the
complete context of the ongoing penetration testing status.
5 Methodology
At the same time, the generation of detailed operations and
parsing of information is managed by other sessions. This
5.1 Overview
division of responsibilities fosters effective task execution
In light of the challenges identified in the preceding section, while preserving the overarching context.
we present our proposed solution, P ENTEST GPT, which lever- To assist LLMs in effectively carrying out penetration test-
ages the synergistic interplay of three LLM-powered modules. ing tasks, we design a series of prompts that align with user
As illustrated in Figure 2, P ENTEST GPT incorporates three inputs. We utilize the Chain-of-Thought (CoT) [43] methodol-
core modules: the Reasoning Module, the Generation Mod- ogy during this process. As CoT reveals, LLMs’ performance
ule, and the Parsing Module. Each module reserves one LLM and reasoning capabilities can be significantly enhanced using
session with its conversation and context. The user interacts the input, chain-of-thought, output prompting format. Here,

USENIX Association 33rd USENIX Security Symposium 853

§5.5 §5.3 Finding 3 & 4 §5.4 Finding 5

PentestGPT
Parsing Module Reasoning Module Generation Module

User Intention Token 2 Task Tree 3 Task Subsequent 5

Task Expansion
Compression Verification Identification Task

Condenced 1 Task Tree Candidate 4 6 Operation

Task Decision
Testing Outputs Information Update Tasks Generation

Testing Envrionment
(Optional) User
Testing Targets Testing Tools Operations
Verification

Completed by LLM User Controlled Message Information to User Hidden Information

Figure 2: Overview of P ENTEST GPT.

the chain-of-thought represents a series of intermediate nat- Port Scanning

ural language reasoning steps leading to the outcome. We
dissect the penetration testing tasks into micro-steps and de-
sign prompts with examples to guide LLMs through process- FTP Service SSH Service Web Service

ing penetration testing information step-by-step, ultimately

leading to the desired outcomes. The complete prompts are Anonymous Brute Force Direct Injection Point
Login (Succ) (Fail) Enumeration Identification
available at our anonymized open-source project [44].
Arbitrary File Hidden Admin
5.3 Reasoning Module Upload (Succ) Page Login

a) PTT Representation
The Reasoning Module plays a pivotal role in our system,
Task Tree:
analogous to a team lead overseeing the penetration testing 1. Perform port scanning (completed)
task from a macro perspective. It obtains testing results or - Port 21, 22 and 80 are open.
- Services are FTP, SSH, and Web Service.
intentions from the user and prepares the testing strategy for 2. Perform the testing
the next step. This testing strategy is passed to the generation 2.1 Test FTP Service
2.1.1 Test Anonymous Login (success)
module for further planning.
2.1.1.1 Test Anonymous Upload (success)
To effectively supervise the penetration testing process and 2.2 Test SSH Service
provide precise guidance, it is crucial to translate the test- 2.2.1 Brute-force (failed)
2.3 Test Web Service (ongoing)
ing procedures and outcomes into a natural language format. 2.3.1 Directory Enumeration
Drawing inspiration from the concept of an attack tree [45], 2.3.1.1 Find hidden admin (to-do)
which is often used to outline penetration testing procedures, 2.3.2 Injection Identification (todo)

we introduce the notion of a pentesting task tree (PTT). This b) PTT Representation in Natural Language

novel approach to testing status representation is rooted in the

concept of an attributed tree [46]: Figure 3: Pentesting Task Tree in a) visualized tree format,
and b) natural language format encoded in LLM.
Definition 1 (Attributed Tree) A attributed tree is an edge-
labeled, attributed polytree G = (V, E, λ, µ) where V is a set
of nodes (or vertices), E is a set of directed edges, λ : E → Σ is structure. Each node has a unique identifier, and there is a
an edge labeling function assigning a label from the alphabet special node called the root that has no parent. Each node,
Σ to each edge and µ : (V ∪E)×K → S is a function assigning other than the root, has exactly one parent and zero or more
key(from K)-value(from S) pairs of properties to the edges children. (2) A is a function that assigns to each node n ∈ N
and nodes. a set of attributes A(n). Each attribute is a pair (a, v), where
a is the attribute name and v is the attribute value. The set of
Given the definition of attributed tree, PTT is defined as
attributes can be different for each node.
follows:

Definition 2 (Pentesting Task Tree) A PTT T is a pair As outlined in Figure 2, the Reasoning Module’s operation
(N, A), where: (1) N is a set of nodes organized in a tree unfolds over four key steps operating over the PTT. ❶ The

854 33rd USENIX Security Symposium USENIX Association

module begins by interpreting the user’s objectives to create focus entirely on generating specific commands.
an initial PTT, formatted in natural language. This involves Instead of directly transforming the received sub-task into
instructing the LLM with designed prompts that contain the specific operations, our design employs the CoT strategy [43]
above PTT definition and real-world examples. The outputs to partition this process into two sequential steps. This design
from the LLM are parsed to ensure that the tree structure decision directly addresses the challenges associated with
is correctly represented, which can be formatted in natural model inaccuracy and hallucination by enhancing the model’s
language through layered bullets, as shown in Figure 3. The reasoning capability. In particular, ❺ upon the receipt of a
Reasoning Module effectively overcomes the memory-loss concise sub-task from the Reasoning Module, the Generation
issue by maintaining a task tree that encompasses the entire Module begins by expanding it into a sequence of detailed
penetration testing process. ❷ After updating the tree infor- steps. Notably, the prompt associated with this sub-task re-
mation, a verification step is conducted on the newly updated quires the LLM to consider the possible tools and operations
PTT to ascertain its correctness. This process checks explic- available within the testing environment. ❻ Subsequently, the
itly that only the leaf nodes of the PTT have been modified, Generation Module transforms each of these expanded steps
aligning with the principle that atomic operations in the pen- into precise terminal commands ready for execution or into de-
etration testing process should only influence the status of tailed descriptions of specific Graphical User Interface (GUI)
the lowest-level sub-tasks. This step confirms the correctness operations to be carried out. This stage-by-stage translation
of the reasoning process, safeguarding against any potential eliminates potential ambiguities, enabling testers to follow the
alterations to the overall tree structure due to hallucination by instructions directly and seamlessly. Implementing this two-
the LLM. If discrepancies arise, the information is reverted to step process effectively precludes the LLM from generating
the LLM for correction and regeneration. ❸ With the updated operations that may not be feasible in real-world scenarios,
PTT, the Reasoning Module evaluates the current tree state thereby improving the success rate of the penetration testing
and pinpoints viable sub-tasks that can serve as candidate procedure.
steps for further testing. ❹ Finally, the module evaluates the By acting as a bridge between the strategic insights pro-
likelihood of these sub-tasks leading to successful penetra- vided by the Reasoning Module and the actionable steps
tion testing outcomes. It then recommends the top task as required for conducting a penetration test, the Generation
the output. The expected results of this task are subsequently Module ensures that high-level plans are converted into pre-
forwarded to the Generation Module for an in-depth analy- cise and actionable steps. This transformation process sig-
sis. This is feasible, as demonstrated in the exploratory study, nificantly bolsters the overall efficiency of the penetration
since LLMs, particularly GPT-4, can identify potential vul- testing procedure, and also provides human-readable outputs
nerabilities when provided with system status information. of the complete testing process. We present a detailed PTT
This procedural approach enables the Reasoning Module to generation process for a complete penetration testing target in
address one of the inherent limitations of LLMs, precisely Appendix Figure 8, accompanied by an illustrative example
their tendency to concentrate solely on the most recent task. to aid understanding.
Note that in cases where the tester identifies that the correct An Illustrative Example. We utilize a real-world running
task is incorrect or not completed in a preferred way, he could example to illuminate how the Reasoning Module and the
also manually revise the PTT through the interactive handle Generation Module collaboratively operate to complete pene-
further discussed in Section 5.6. tration testing tasks. Figure 4 illustrates a single iteration of
We devise four sets of prompts to sequentially guide the P ENTEST GPT working on the HackTheBox machine Car-
Reasoning Module through the completion of each stage. rier [48], a medium-difficulty target. As depicted in a-1), the
To bolster the reproducibility of our results, we optimize PTT, in natural language format, encodes the testing status,
these prompts further with a technique known as hint gener- revealing the open ports (21, 22, 80) with running services.
ation [47]. From our practical experience, we observe that The Reasoning Module is subsequently instructed to identify
LLMs are adept at interpreting the tree-structured information the available tasks. As highlighted in red, service scanning is
pertinent to penetration testing and can update it accurately the only available task on the leaf node of the PTT. This task
in response to test outputs. is therefore chosen and forwarded to the Generation Module
for command generation. The generated command is exe-
5.4 Generation Module cuted in the testing environment, and the execution result is
conveyed to the Reasoning Module to update the PTT. In
The Generation Module translates specific sub-tasks from a-2), the Reasoning Module integrates the previous scanning
the Reasoning Module into concrete commands or instruc- result into the PTT, cross-referencing it with the earlier PTT
tions. Each time a new sub-task is received, a fresh session is to update only the leaf nodes. It then looks for the available
initiated in the Generation Module. This strategy effectively tasks to execute. In this case, two tasks emerge: scanning
isolates the context of the overarching penetration task from the web service on port 80 and checking the SSH service for
the immediate task under execution, enabling the LLM to known vulnerabilities. The LLM evaluates which task is more

USENIX Association 33rd USENIX Security Symposium 855

Reasoning Module Generation Module
b-1) Available Leaf-node Tasks d-1) Command to Execute e-1) Execution Result
Task Tree: Available Tasks Command to Execute Nmap Output
1. Perform reconnaissance tasks (on- 1. Identify services running on open Use nmap to determine service PORT STATE SERVICE VERSION
going) ports. versions on open ports. The command 21/tcp filtered ftp
1.1. Scan for open ports on the is `nmap -sV -p21,22,80 <ip- 22/tcp open ssh OpenSSH 7.6p1
target machine (completed) Decided Task address>`. ...
- Port 21, 22 and 80 are open. Identify services running on open
1.2. Identify services running on ports.
open ports. - Port 21, 22 and 80 are open.
2. Perform the testing
a-1) Task Tree c-1) Decided Task

Cross
Check

b-2) Available Leaf-node Tasks

Task Tree: Available Tasks
Testing Environment
1. Perform reconnaissance tasks 1. Scan the web port
1.1. ... 2. Check if the SSH service contain
1.2. Identify services running on known vulnerabilities.
open ports.
- FTP filtered, OpenSSH 7.6p1,
Apache 2.4.18 Decided Task
2. Perform the testing Reasoning: the web services are Command to Execute Nikto Output
2.1 Scan the web port usually more vulnerable; Decide to Use nikto to scan the target web + Server: Apache/2.4.18 (Ubuntu)
2.2 Check if the SSH service work on web service service. The command is `nikto -h + The anti-clickjacking X-Frame-
contain known vulnerabilities. - Web is running Apache 2.4.18. <ip-address>`. Options header is not present.
a-2) Updated Task Tree c-2) Decided Task d-2) Command to Execute ...

Reasoning Module Generation Module e-2) Execution Result

Figure 4: A demonstration of the task-tree update process on the testing target HTB-Carrier

promising and chooses to investigate the web service, often 5.6 Active Feedback
seen as more vulnerable. This task is passed to the Generation
Module. The Generation Module turns this general task into While LLMs can produce insightful outputs, their outcomes
a detailed process, employing nikto [49], a commonly used sometimes require revisions. To facilitate this, we introduce
web scanning script. The iterative process continues until the an interactive handle in P ENTEST GPT, known as active feed-
tester completes the penetration testing task. back, which allows the user to interact directly with the Rea-
soning Module. A vital feature of this process is that it does
not alter the context within the Reasoning Module unless the
user explicitly desires to update some information. The rea-
5.5 Parsing Module
soning context, including the PTT, is stored as a fixed chunk
The Parsing Module operates as a supportive interface, en- of tokens. This chunk of tokens is provided to a new LLM
abling effective processing of the natural language informa- session during an active feedback interaction, and users can
tion exchanged between the user and the other two core mod- pose questions regarding them. This ensures that the original
ules. Two needs can primarily justify the existence of this session remains unaffected, and users can always query the
module. First, security testing tool outputs are typically ver- reasoning context without making unnecessary changes. If
bose, laden with extraneous details, making it computationally the user believes it necessary to update the PTT, they can
expensive and unnecessarily redundant to feed these extended explicitly instruct the model to update the reasoning context
outputs directly into the LLMs. Second, users without spe- history accordingly. This provides a robust and flexible frame-
cialized knowledge in the security domain may struggle to work for the user to participate in the decision-making process
extract key insights from security testing outputs, presenting actively.
challenges in summarizing crucial testing information. Con-
sequently, the Parsing Module is essential in streamlining and 5.7 Discussion
condensing this information.
In P ENTEST GPT, the Parsing Module is devised to handle We explore various design alternatives for P ENTEST GPT
four distinct types of information: (1) user intentions, which to tackle the challenges identified in Exploratory Study. We
are directives provided by the user to dictate the next course have experimented with different designs, and here we discuss
of action, (2) security testing tool outputs, which represent the some key decisions.
raw outputs generated by an array of security testing tools, (3) Addressing Context Loss with Token Size: a straight-
raw HTTP web information, which encompasses all raw in- forward solution to alleviate context loss is the employment
formation derived from HTTP web interfaces, and (4) source of LLM models with an extended token size. For instance,
codes extracted during the penetration testing process. Users GPT-4 provides versions with 8k and 32k token size limits.
must specify the category of the information they provide, This approach, however, confronts two substantial challenges.
and each category is paired with a set of carefully designed First, even a 32k token size might be inadequate for penetra-
prompts. For source code analysis, we integrate the GPT-4 tion testing scenarios, as the output of a single testing tool
code interpreter [50] to execute the task. like dirbuster [51] may comprise thousands of tokens. Con-

856 33rd USENIX Security Symposium USENIX Association

sequently, GPT-4 with a 32k limit cannot retain the entire 6.1 Evaluation Settings
testing context. Second, even when the entire conversation his-
We implement P ENTEST GPT with 1,900 lines of Python3
tory fits within the 32k token boundary, the API may still skew
code and 740 lines of prompts, available at our open-source
towards recent content, focusing on local tasks and overlook-
project [22]. We evaluate its performance over the benchmark
ing broader context. These issues guided us in formulating
constructed in Section 3, and additional real-world penetration
the design for the Reasoning Module and the Parsing Module.
testing machines (Section 6.5). In this evaluation, we integrate
Vector Database to Improve Context Length: Another P ENTEST GPT with GPT-3.5 and GPT-4 to form two work-
technique to enhance the context length of LLMs involves ing versions: P ENTEST GPT-GPT-3.5 and P ENTEST GPT-
a vector database [52, 53]. By transmuting data into vector GPT-4. Due to the lack of API access, we do not select other
embeddings, LLMs can efficiently store and retrieve informa- LLM models, such as Bard. In line with our previous ex-
tion, practically creating long-term memory. Theoretically, periments, we use the same experiment environment setting
penetration testing tool outputs could be archived in the vector and instruct P ENTEST GPT to only use the non-automated
database. In practice, though, we observe that many results penetration testing tools.
closely resemble and vary in only nuanced ways. This sim-
ilarity often leads to confused information retrieval. Solely
relying on a vector database fails to overcome context loss in 6.2 Performance Evaluation (RQ3)
penetration testing tasks. Integrating the vector database into The overall task completion status of P ENTEST GPT-GPT-
the design of P ENTEST GPT is an avenue for future research. 3.5, P ENTEST GPT-GPT-4, and the naive usage of LLMs
Precision in Information Extraction: Precise information is illustrated in Figure 5a. As the Figure shows, our solu-
extraction is crucial for conserving token usage and avoiding tions powered by LLMs demonstrate superior penetration test-
verbosity in LLMs [54, 55]. Rule-based methods are com- ing capabilities compared to the naive application of LLMs.
monly employed to extract diverse information. However, Specifically, P ENTEST GPT-GPT-4 surpasses the other three
rule-based techniques are engineeringly expensive given nat- solutions, successfully solving 6 out of 7 easy difficulty targets
ural language’s inherent complexity and the variety of infor- and 2 out of 4 medium difficulty targets. This performance
mation types in penetration testing. We devise the Parsing indicates that P ENTEST GPT-GPT-4 can handle penetration
Module to manage several general input information types, a testing targets ranging from easy to medium difficulty lev-
strategy found to be both feasible and efficient. els. Meanwhile, P ENTEST GPT-GPT-3.5 manages to solve
Limitations of LLMs: LLMs are not an all-encompassing only two challenges of easy difficulty, a discrepancy that can
solution. Present LLMs exhibit flaws, including hallucina- be attributed to GPT-3.5 lacking the knowledge related to
tion [56, 57] and outdated knowledge. Our mitigation efforts, penetration testing found in GPT-4.
such as implementing task tree verification to ward off hallu- The sub-task completion status of P ENTEST GPT-GPT-3.5,
cination, might not completely prevent the Reasoning Module P ENTEST GPT-GPT-4, and the naive usage of LLM is shown
from producing erroneous outcomes. Thus, a human-in-the- in Figure 5b. As the Figure illustrates, both P ENTEST GPT-
loop strategy becomes vital, facilitating the input of necessary GPT-3.5 and P ENTEST GPT-GPT-4 perform better than
expertise and guidance to steer LLMs effectively. the standard utilization of LLMs. It is noteworthy that
P ENTEST GPT-GPT-4 not only solves one more medium
difficulty target compared to naive GPT-4 but also accom-
plishes 111% more sub-tasks (57 vs. 27). This highlights that
6 Evaluation our design effectively addresses context loss challenges and
leads to more promising testing results. Nevertheless, all the
In this section, we assess the performance of P ENTEST GPT, solutions struggle with hard difficulty testing targets. As elab-
focusing on the following four research questions: orated in Section 4, hard difficulty targets typically demand
RQ3 (Performance): How does the performance of P EN - a deep understanding from the penetration tester. To reach
TEST GPT compare with that of native LLM models and hu- testing objectives, they may require modifications to existing
man experts? penetration testing tools or scripts. Our design does not ex-
pand the LLMs’ knowledge of vulnerabilities, so it does not
RQ4 (Strategy): Does P ENTEST GPT employ different
notably enhance performance on these more complex targets.
problem-solving strategies compared to those utilized by
LLMs or human experts?
RQ5 (Ablation): How does each module within P ENTEST- 6.3 Strategy Evaluation (RQ4)
GPT contribute to the overall penetration testing perfor- We analyze P ENTEST GPT’s problem-solving methods, com-
mance? paring them with LLMs and human experts. Through manual
RQ6 (Practicality): Is P ENTEST GPT practical and effective examination, we identify P ENTEST GPT’s approach to pen-
in real-world penetration testing tasks? etration testing. Notably, P ENTEST GPT breaks down tasks

USENIX Association 33rd USENIX Security Symposium 857

GPT-4 Excalibur
GPT-3.5 PentestGPT-GPT-3.5
6 GPT-4 PentestGPT-GPT-4
Port Port
Scanning Scanning
1 4 1 2
4 FTP Web FTP Web
Service Service Service Service
2 5 3 5
2 2 File Web File Shell
1 1 Browsing Scanning Browsing 5 Upload

0 0 0 0 0 0
3 4 6
Arbitrary Arbitrary Reverse
Easy Medium Hard
File Upload File Upload Shell
(a) Overall completion status. Flow 1 Flow 2 Flow 1 Flow 2
Flow 1 & 2 are independent Flow 1 & 2 are interrelated
GPT-3.5 PentestGPT-GPT-3.5
69 GPT-4 PentestGPT-GPT-4

57
Figure 6: Penetration testing strategy comparison between
52 GPT-3.5 and P ENTEST GPT on VulnHub-Hackable II.

31
27
24 unable to process images, which are crucial in certain penetra-
13 14 12 tion testing scenarios. Addressing this limitation may require
8
5 5
the development of advanced multimodal models that can
Easy Medium Hard interpret both text and visual data. Second, P ENTEST GPT
lacks the ability to employ certain social engineering tech-
(b) Subtask completion status.
niques and to detect subtle cues. For example, while a human
Figure 5: The performance of GPT-3.5, GPT-4, tester might generate a brute-force wordlist from information
P ENTEST GPT-GPT-3.5, and P ENTEST GPT-GPT-4 extracted from a target service, P ENTEST GPT can retrieve
on overall target completion and sub-task completion. names from a web service but fails to guide the usage of
tools needed to create a wordlist from these names. Third, the
similarly to human experts and prioritizes effectively. Rather models struggle with accurate exploitation code construction
than just addressing the latest identified task, P ENTEST GPT within a limited number of trials. Despite some proficiency in
identifies key sub-tasks that can result in success. code comprehension and generation, the LLM falls short in
Figure 6 contrasts the strategies of GPT-4 and P ENTEST- producing detailed exploitation scripts, particularly with low-
GPT on the VulnHub machine, Hackable II [58]. This ma- level bytecode operations. These limitations underline the
chine features two vulnerabilities: an FTP service for file necessity for improvement in areas where human insight and
uploads and a web service to view FTP files. A valid exploit intricate reasoning are still more proficient than automated
requires both services. The figure shows GPT-4 starting with solutions.
the FTP service and identifying the upload vulnerability (❶-
❸). Yet, it does not link this to the web service, causing an 6.4 Ablation Study (RQ5)
incomplete exploit. In contrast, P ENTEST GPT shifts between
We perform an ablation study on how the three modules:
the FTP and web services. It first explores both services (❶-
Reasoning Module, Generation Module, and Parsing Module,
❷), then focuses on the FTP (❸-❹), realizing the FTP and
contribute to the performance of P ENTEST GPT. We imple-
web files are identical. With this insight, P ENTEST GPT in-
ment three variants:
structs the tester to upload a shell (❺), achieving a successful
reverse shell (❻). This matches the solution guide and un- 1. P ENTEST GPT- NO -PARSING: the Parsing Module is de-
derscores P ENTEST GPT’s adeptness at integrating various activated, causing all data to be directly fed into the
testing aspects. system.
Our second observation is that although P ENTEST GPT be-
2. P ENTEST GPT- NO -G ENERATION: the Generation Mod-
haves more similarly to human experts, it still exhibits some
ule is deactivated, leading to the completion of task gen-
strategies that humans will not apply. For instance, P ENTEST-
eration within the Reasoning Module itself. The prompts
GPT still prioritizes brute-force attacks before vulnerability
for task generation remain consistent.
scanning. This is obvious in cases where P ENTEST GPT al-
ways tries to brute-force the SSH service on target machines. 3. P ENTEST GPT- NO -R EASONING: the Reasoning Module
We analyze cases where penetration testing with P ENTEST- is disabled. Instead of PTT, this variant adopts the same
GPT failed, identifying three primary limitations. First, P EN - methodology utilized with LLMs for penetration testing,
TEST GPT struggles with image interpretation. LLMs are as delineated in the Exploratory Study.

858 33rd USENIX Security Symposium USENIX Association

Table 5: P ENTEST GPT performance over the active Hack-
PentestGPT-no-Parsing PentestGPT-no-Generation
PentestGPT-no-Reasoning PentestGPT TheBox Challenges.
Machine Difficulty Completions Completed Users Cost (USD)
Sau Easy 5/5 (✓) 4798 15.2
6 Pilgramage Easy 3/5 (✓) 5474 12.6
5
Topology Easy 0/5 (✗) 4500 8.3
4 4
PC Easy 4/5 (✓) 6061 16.1
MonitorsTwo Easy 3/5 (✓) 8684 9.2
2
1 1
Authority Medium 0/5 (✗) 1209 11.5
0 0 0 0 0 Sandworm Medium 0/5 (✗) 2106 10.2
Easy Medium Hard
Jupiter Medium 0/5 (✗) 1494 6.6
Agile Medium 2/5 (✓) 4395 22.5
OnlyForYou Medium 0/5 (✗) 2296 19.3
(a) Overall completion status Total - 17/50 (6) - 131.5
PentestGPT-no-Parsing PentestGPT-no-Generation
PentestGPT-no-Reasoning PentestGPT Table 6: P ENTEST GPT performance over picoMini CTF.
69
62
56 57 Challenge Category Score Completions
44 44 login web 100 5/5 (✓)
35 advance-potion-making forensics 100 3/5 (✓)
23 spelling-quiz crypto 100 4/5 (✓)
9 9
12 caas web 150 2/5 (✓)
7
XtrOrdinary crypto 150 5/5 (✓)
Easy Medium Hard tripplesecure crypto 150 3/5 (✓)
clutteroverflow binary 150 1/5 (✓)
(b) Sub-task completion status not crypto reverse 150 0/5 (✗)
scrambled-bytes forensics 200 0/5 (✗)
Figure 7: The performance of P ENTEST GPT, P ENTEST GPT- breadth reverse 200 0/5 (✗)
N O -A NNOTATION, P ENTEST GPT-O PERATION -O NLY, and notepad web 250 1/5 (✓)
P ENTEST GPT-PARAMETER -O NLY on both normalized aver- college-rowing-team crypto 250 2/5 (✓)
age code coverage (µLOC) and bug detection. fermat-strings binary 250 0/5 (✗)
corrupt-key-1 crypto 350 0/5 (✗)
SaaS binary 350 0/5 (✗)
riscy business reverse 350 0/5 (✗)
homework binary 400 0/5 (✗)
All the variants are integrated with GPT-4 API for testing. lockdown-horses binary 450 0/5 (✗)
corrupt-key-2 crypto 500 0/5 (✗)
vr-school binary 500 0/5 (✗)
Figure 7 presents the outcomes of three tested variants on
MATRIX reverse 500 0/5 (✗)
our benchmarks. Among these, P ENTEST GPT consistently
outperforms the ablation baselines in both target and sub-task
completion. Our primary observations include: (1) Without 6.5 Practicality Study (RQ6)
its Parsing Module, P ENTEST GPT- NO -PARSING sees only a
slight drop in performance for task and sub-task completion. We demonstrate P ENTEST GPT’s applicability in real-world
Though parsing aids in penetration testing, the 32k token limit penetration testing scenarios, extending beyond standardized
generally covers diverse outputs. The Reasoning Module’s benchmarks. For this analysis, we deploy P ENTEST GPT in
design, which retains the full testing context, compensates for two distinct challenge formats: (1) HackTheBox (HTB) ac-
the absence of the Parsing Module, ensuring minimal perfor- tive machine challenges, which present a series of real-world
mance reduction. (2) P ENTEST GPT- NO -R EASONING has the penetration testing scenarios accessible to a global audience.
lowest success, achieving just 53.6% of the sub-tasks of the We selected 10 machines from the active list, comprising five
full variant. This is even lower than the basic GPT-4 setup. targets of easy difficulty and five of intermediate difficulty.
The Generation Module’s added sub-tasks distort the LLM (2) picoMini [21], a jeopardy-style Capture The Flag (CTF)
context. The mismatched prompts and extended generation competition organized by Carnegie Mellon University and
output cloud the original context, causing the test’s failure. redpwn [59]. The competition featured 21 unique CTF chal-
(3) P ENTEST GPT- NO -G ENERATION slightly surpasses the lenges and drew participation from 248 teams in its initial
basic GPT-4. Without the Generation Module, the process round. These challenges are now freely accessible online
mirrors standard LLM usage. The module’s main role is guid- for practice and reattempts. Our evaluation employed P EN -
ing precise testing operations. Without it, testers might require TEST GPT in conjunction with the GPT-4 32k token length
additional information to use essential tools or scripts. API, defining the capture of the root flag as the metric for

USENIX Association 33rd USENIX Security Symposium 859

a successful trial. We conduct five trials on each target and risks of misuse. To mitigate these risks, we have implemented
documented the number of successful captures. Note that we several strategies. We actively promote ethical guidelines for
consider single successful capture out of five trials as success- the use of P ENTEST GPT and collaborate closely with cy-
ful attempt over the target. This criterion reflects the iterative bersecurity communities to prevent misuse. Moreover, we
nature of real-world penetration testing and CTF challenges, have incorporated monitoring modules [70] to track the tool’s
where multiple attempts are allowed, and success is ultimately usage and are committed to ensuring that it is not used in-
determined by achieving the objective at least once. appropriately. These measures are designed to balance the
Tables 5 presents P ENTEST GPT’s performance across both advantages of advanced penetration testing tools with ethi-
sets of challenges. In the HackTheBox challenges, P ENTEST- cal considerations, ensuring that P ENTEST GPT serves as a
GPT successfully completed four easy and one medium diffi- positive contribution to cybersecurity defenses.
culty challenges, incurring a total cost of 131.5 USD—an aver-
age of 21.9 USD per target. This performance indicates P EN -
TEST GPT’s effectiveness in tackling easy to intermediate-
8 Conclusion
level penetration tests at a reasonable cost. Table 6 demon-
This work delves into the potential and constraints of LLMs
strates the performance of P ENTEST GPT in the picoMini
for penetration testing. Building a novel benchmark, we shed
CTF. In particular, P ENTEST GPT managed to solve 9 out
light on LLM performance in this complex area. While LLMs
of 21 challenges, with the average cost per attempt being
manage basic tasks and use testing tools effectively, they
5.1 USD. Ultimately, P ENTEST GPT accumulated a total of
struggle with task-specific context and attention challenges. In
1400 points2 and ranked 24th out of 248 teams with valid
response, we present P ENTEST GPT, a tool emulating human
submissions [60]. These outcomes suggest a promising per-
penetration testing actions. Influenced by real-world testing
formance of P ENTEST GPT on real-world penetration testing
teams, P ENTEST GPT comprises Reasoning, Generation, and
tasks among various types of challenges.
Parsing Modules, promoting a segmented problem-solving
strategy. Our comprehensive evaluation of P ENTEST GPT
7 Discussion underscores its promise, but also areas where human skills
surpass present technology. This work paves the way for
It is possible that LLMs used by P ENTEST GPT were trained future advancements in the crucial realm of cybersecurity.
on walkthroughs of the benchmark machines, which could
invalidate evaluation results. To counter this, we employ two
methods. First, We ensure the LLM lacks prior knowledge of 9 Acknowledgement
the target machine. We ascertain this by querying LLMs about
This research / project is supported by the National Research
the tested machine’s familiarity. Secondly, our benchmark
Foundation, Singapore, and the Cyber Security Agency un-
comprises machines launched post-2021, ensuring they are
der its National Cybersecurity R&D Programme (NCRP25-
beyond OpenAI models’ training data. Our study on recent
P04-TAICeN). Any opinions, findings and conclusions or
HackTheBox challenges confirms P ENTEST GPT’s ability to
recommendations expressed in this material are those of the
solve without pre-existing target knowledge.
author(s) and do not reflect the views of National Research
While we aim for universally applicable prompts, certain
Foundation, Singapore and Cyber Security Agency of Singa-
LLMs avoid producing specific hacking content. For instance,
pore.
OpenAI has implement model alignments [61] to ensure the
GPT model outputs do not violate usage policies, including
generating malicious exploitation contents. We incorporate References
jailbreak techniques [62–68] to coax LLMs into producing
relevant data. Improving reproducibility of P ENTEST GPT [1] A. Applebaum, D. Miller, B. Strom, H. Foster, and
remains a focus area. C. Thomas, “Analysis of automated adversary emulation
LLMs occasionally "hallucinate" [56], producing outputs techniques,” in Proceedings of the Summer Simulation
deviating from training data. This impacts our tool’s depend- Multi-Conference. Society for Computer Simulation
ability. To combat this, we’re researching methods [69] to International, 2017, p. 16.
minimize hallucination, anticipating this will boost our tool’s
efficiency and reliability. [2] B. Arkin, S. Stender, and G. McGraw, “Software pene-
The ethical implications of employing P ENTEST GPT in tration testing,” IEEE Security & Privacy, vol. 3, no. 1,
penetration testing are significant and warrant careful consid- pp. 84–87, 2005.
eration. While P ENTEST GPT can greatly enhance security by
identifying vulnerabilities, its capabilities also pose potential [3] G. Deng, Z. Zhang, Y. Li, Y. Liu, T. Zhang, Y. Liu, G. Yu,
and D. Wang, “Nautilus: Automated restful api vulnera-
2 Each challenge’s points were assigned based on its difficulty level bility detection.”

860 33rd USENIX Security Symposium USENIX Association

[4] W. X. Zhao, K. Zhou, J. Li, T. Tang, X. Wang, [21] [Online]. Available: https://picoctf.org/competitions/
Y. Hou, Y. Min, B. Zhang, J. Zhang, Z. Dong et al., 2021-redpwn.html
“A survey of large language models,” arXiv preprint
arXiv:2303.18223, 2023. [22] “Pentestgpt: An llm-enhanced penetration testing
tool.” [Online]. Available: https://github.com/GreyDGL/
[5] Y. Liu, T. Han, S. Ma, J. Zhang, Y. Yang, J. Tian, H. He, PentestGPT
A. Li, M. He, Z. Liu et al., “Summary of chatgpt/gpt-
4 research and perspective towards the future of large [23] G. Weidman, Penetration testing: a hands-on introduc-
language models,” arXiv preprint arXiv:2304.01852, tion to hacking. No starch press, 2014.
2023.
[24] F. Abu-Dabaseh and E. Alshammari, “Automated pene-
[6] J. Wei, Y. Tay, R. Bommasani, C. Raffel, B. Zoph, tration testing: An overview,” in The 4th International
S. Borgeaud, D. Yogatama, M. Bosma, D. Zhou, D. Met- Conference on Natural Language Computing, Copen-
zler et al., “Emergent abilities of large language models,” hagen, Denmark, 2018, pp. 121–129.
arXiv preprint arXiv:2206.07682, 2022.
[25] J. Schwartz and H. Kurniawati, “Autonomous pene-
[7] V. Mayoral-Vilches, G. Deng, Y. Liu, M. Pinzger, and tration testing using reinforcement learning,” arXiv
S. Rass, “Exploitflow, cyber security exploitation routes preprint arXiv:1905.05965, 2019.
for game theory and ai research in robotics,” 2023.
[26] H. Pearce, B. Ahmad, B. Tan, B. Dolan-Gavitt, and
[8] Y. Zhang, W. Song, Z. Ji, Danfeng, Yao, and N. Meng,
R. Karri, “Asleep at the keyboard? assessing the security
“How well does llm generate security tests?” 2023.
of github copilot’s code contributions,” in 2022 IEEE
[9] Z. He, Z. Li, S. Yang, A. Qiao, X. Zhang, X. Luo, and Symposium on Security and Privacy (SP). IEEE, 2022,
T. Chen, “Large language models for blockchain secu- pp. 754–768.
rity: A systematic literature review,” 2024.
[27] H. Pearce, B. Tan, B. Ahmad, R. Karri, and B. Dolan-
[10] N. Antunes and M. Vieira, “Benchmarking vulnerability Gavitt, “Examining zero-shot vulnerability repair with
detection tools for web services,” in 2010 IEEE Interna- large language models,” in 2023 IEEE Symposium on
tional Conference on Web Services. IEEE, 2010, pp. Security and Privacy (SP). IEEE, 2023, pp. 2339–2356.
203–210.
[28] “OWASP Juice-Shop Project,” https://owasp.org/
[11] P. Xiong and L. Peyton, “A model-driven penetration www-project-juice-shop/, 2022.
test framework for web applications,” in 2010 Eighth
International Conference on Privacy, Security and Trust. [29] NIST and E. Aroms, “Nist special publication 800-115
IEEE, 2010, pp. 173–180. technical guide to information security testing and as-
sessment,” 2012.
[12] “Hackthebox: Hacking training for the best.” [Online].
Available: http://www.hackthebox.com/ [30] [Online]. Available: https://cwe.mitre.org/data/
definitions/89.html
[13] [Online]. Available: https://www.vulnhub.com/
[14] “OWASP Foundation,” https://owasp.org/. [31] E. Collins, “Lamda: Our breakthrough conversation
technology,” May 2021. [Online]. Available: https:
[15] MITRE, “Common Weakness Enumeration (CWE),” //blog.google/technology/ai/lamda/
https://cwe.mitre.org/index.html, 2021.
[32] “Chatgpt,” https://chat.openai.com/, (Accessed on
[16] “Models - openai api,” https://platform.openai.com/ 02/02/2023).
docs/models/, (Accessed on 02/02/2023).
[33] “The most advanced penetration testing distribution.”
[17] “Gpt-4,” https://openai.com/research/gpt-4, (Accessed [Online]. Available: https://www.kali.org/
on 06/30/2023).
[34] S. Inc., “Nexus vulnerability scanner.” [On-
[18] Google, “Bard,” https://bard.google.com/?hl=en.
line]. Available: https://www.sonatype.com/products/
[19] S. Mauw and M. Oostdijk, “Foundations of attack trees,” vulnerability-scanner-upload
vol. 3935, 07 2006, pp. 186–198.
[35] S. Rahalkar and S. Rahalkar, “Openvas,” Quick Start
[20] [Online]. Available: https://app.hackthebox.com/ Guide to Penetration Testing: With NMAP, OpenVAS
machines/list/active and Metasploit, pp. 47–71, 2019.

USENIX Association 33rd USENIX Security Symposium 861

[36] B. Guimaraes and M. Stampar, “sqlmap: Automatic SQL [50] [Online]. Available: https://openai.com/blog/
injection and database takeover tool,” https://sqlmap. chatgpt-plugins#code-interpreter
org/, 2022.
[51] KajanM, “Kajanm/dirbuster: a multi threaded java
[37] J. Yeo, “Using penetration testing to enhance your com- application designed to brute force directories and files
pany’s security,” Computer Fraud & Security, vol. 2013, names on web/application servers.” [Online]. Available:
no. 4, pp. 17–20, 2013. https://github.com/KajanM/DirBuster
[38] [Online]. Available: https://nmap.org/ [52] J. Wang, X. Yi, R. Guo, H. Jin, P. Xu, S. Li, X. Wang,
X. Guo, C. Li, X. Xu et al., “Milvus: A purpose-built
[39] [Online]. Available: https://help.openai.com/en/articles/
vector data management system,” in Proceedings of the
7127966-what-is-the-difference-between-the-gpt-4-models
2021 International Conference on Management of Data,
[40] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, 2021, pp. 2614–2627.
L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin,
“Attention is all you need,” 2023. [53] R. Guo, X. Luan, L. Xiang, X. Yan, X. Yi, J. Luo,
Q. Cheng, W. Xu, J. Luo, F. Liu et al., “Manu: a cloud na-
[41] L. Yang, H. Chen, Z. Li, X. Ding, and X. Wu, “Chatgpt tive vector database management system,” Proceedings
is not enough: Enhancing large language models with of the VLDB Endowment, vol. 15, no. 12, pp. 3548–3561,
knowledge graphs for fact-aware language modeling,” 2022.
2023.
[54] G. Wang, Y. Li, Y. Liu, G. Deng, T. Li, G. Xu, Y. Liu,
[42] Y. Bang, S. Cahyawijaya, N. Lee, W. Dai, D. Su, H. Wang, and K. Wang, “Metmap: Metamorphic test-
B. Wilie, H. Lovenia, Z. Ji, T. Yu, W. Chung et al., “A ing for detecting false vector matching problems in llm
multitask, multilingual, multimodal evaluation of chat- augmented generation,” 2024.
gpt on reasoning, hallucination, and interactivity,” arXiv
preprint arXiv:2302.04023, 2023. [55] Y. Li, Y. Liu, G. Deng, Y. Zhang, W. Song, L. Shi,
K. Wang, Y. Li, Y. Liu, and H. Wang, “Glitch tokens in
[43] J. Wei, X. Wang, D. Schuurmans, M. Bosma, B. Ichter, large language models: Categorization taxonomy and
F. Xia, E. Chi, Q. Le, and D. Zhou, “Chain-of-thought effective detection,” 2024.
prompting elicits reasoning in large language models,”
2023. [56] M. Zhang, O. Press, W. Merrill, A. Liu, and N. A.
Smith, “How language model hallucinations can snow-
[44] A. Authors, “Excalibur: Automated penetra-
ball,” arXiv preprint arXiv:2305.13534, 2023.
tion testing,” https://anonymous.4open.science/
r/EXCALIBUR-Automated-Penetration-Testing/ [57] N. Li, Y. Li, Y. Liu, L. Shi, K. Wang, and H. Wang, “Hal-
README.md, 2023. luvault: A novel logic programming-aided metamorphic
testing framework for detecting fact-conflicting halluci-
[45] H. S. Lallie, K. Debattista, and J. Bal, “A review of
nations in large language models,” 2024.
attack graph and attack tree visual syntax in cyber
security,” Computer Science Review, vol. 35, p. 100219,
[58] [Online]. Available: https://www.vulnhub.com/entry/
2020. [Online]. Available: https://www.sciencedirect.
hackable-ii,711/
com/science/article/pii/S1574013719300772
[59] [Online]. Available: https://redpwn.net/
[46] K. Barbar, “Attributed tree grammars,” Theoretical
Computer Science, vol. 119, no. 1, pp. 3–22, 1993.
[60] [Online]. Available: play.picoctf.org/events/67/
[Online]. Available: https://www.sciencedirect.com/
scoreboards
science/article/pii/030439759390337S
[61] Y. Liu, Y. Yao, J.-F. Ton, X. Zhang, R. Guo, H. Cheng,
[47] H. Sun, X. Li, Y. Xu, Y. Homma, Q. Cao, M. Wu, J. Jiao,
Y. Klochkov, M. F. Taufiq, and H. Li, “Trustworthy llms:
and D. Charles, “Autohint: Automatic prompt optimiza-
a survey and guideline for evaluating large language
tion with hint generation,” 2023.
models’ alignment,” 2023.
[48] Sep 2018. [Online]. Available: https://forum.
hackthebox.com/t/carrier/963 [62] Y. Liu, G. Deng, Z. Xu, Y. Li, Y. Zheng, Y. Zhang,
L. Zhao, T. Zhang, and Y. Liu, “Jailbreaking chatgpt via
[49] “Nikto web server scanner.” [Online]. Available: prompt engineering: An empirical study,” arXiv preprint
https://github.com/sullo/nikto arXiv:2305.13860, 2023.

862 33rd USENIX Security Symposium USENIX Association

[63] G. Deng, Y. Liu, Y. Li, K. Wang, Y. Zhang, Z. Li, GPT-4

H. Wang, T. Zhang, and Y. Liu, “Masterkey: Automated Yes, I understand. Please provide the initial details and the
jailbreaking of large language model chatbots,” in scope of the penetration test so I can assist accordingly.
Proceedings 2024 Network and Distributed System Remember to follow ethical guidelines and obtain proper
Security Symposium, ser. NDSS 2024. Internet Society, authorization for any systems or networks you are testing.
2024. [Online]. Available: http://dx.doi.org/10.14722/
ndss.2024.24188 Human Tester

Your target is a penetration testing benchmark machine.

[64] Y. Liu, G. Deng, Y. Li, K. Wang, Z. Wang, X. Wang,
It’s IP address is 192.168.1.5; it is a Linux-based machine.
T. Zhang, Y. Liu, H. Wang, Y. Zheng, and Y. Liu,
Now please provide your next step to complete the pene-
“Prompt injection attack against llm-integrated applica-
tration testing task.
tions,” 2024.
GPT-4
[65] J. Li, Y. Liu, C. Liu, L. Shi, X. Ren, Y. Zheng, Y. Liu, and
Y. Xue, “A cross-language investigation into jailbreak Alright, the first step in a penetration test is usually recon-
attacks in large language models,” 2024. naissance. This is where we gather as much information
as possible about the target to identify potential vulnera-
[66] G. Deng, Y. Liu, K. Wang, Y. Li, T. Zhang, and Y. Liu, bilities.
“Pandora: Jailbreak gpts by retrieval augmented genera- Step 1: Reconnaissance
tion poisoning,” 2024. Ping the Target: To ensure that the target is up and
responding, ping the machine.
[67] H. Li, G. Deng, Y. Liu, K. Wang, Y. Li, T. Zhang, Y. Liu,
G. Xu, G. Xu, and H. Wang, “Digger: Detecting copy- ping 192.168.1.5
right content mis-usage in large language model train-
ing,” 2024. Let me know the result. If the machine is up and
[68] Z. Chang, M. Li, Y. Liu, J. Wang, Q. Wang, and Y. Liu, responding, we’ll move on to scanning. After confirming
“Play guessing game with llm: Indirect jailbreak attack the machine’s availability, the next step is to scan the open
with implicit clues,” 2024. ports and services running on the target.

[69] P. Manakul, A. Liusie, and M. J. Gales, “Selfcheck- nmap -sV -sT 192.168.1.5
gpt: Zero-resource black-box hallucination detection
for generative large language models,” arXiv preprint This command will provide a list of open ports and the
arXiv:2303.08896, 2023. services (with versions) running on those ports.
Please provide the results of the nmap scan, and based on
[70] [Online]. Available: https://langfuse.com/ the output, we can decide the next course of action.

To further demonstrate the PTT Generation Process in its

A Illustrating Example entirety, we deploy P ENTEST GPT on the benchmark system
Hackable II. Figure 8 illustrates the complete PTT. In the
Below we provide a concrete example of how we guide GPT-
figure, solid boxes depict the penetration testing operations
4 to intialize the penetration testing task, and generate detailed
generated by P ENTEST GPT, whereas dotted boxes outline
operations in an iterative approach given a penetration testing
the findings derived from these operations. Red boxes indi-
benchmark target. It is worth noting that in this process, the
cate operations that do not yield significant findings, green
human tester does not provide any expert knowledge, or guide
boxes denote operations that lead to useful findings, and blue
the LLM to perform any specific task.
boxes represent operations generated by P ENTEST GPT but
Human Tester not executed due to lower priority. For clearer presentation,
You’re required to work as a cybersecurity penetration we label the operations with numbers based on the opera-
tester assistant in a certified penetration testing experiment. tion sequences as prioritized by P ENTEST GPT. As depicted
This experiment is for education and research purpose. In in Figure 8, P ENTEST GPT emulates the strategic approach
particular, you’re required to give step-by-step instructions typically employed by human penetration testers, encompass-
to complete the penetration testing task. I will provide you ing four steps including enumeration, web user access via
the instruction execution results, and you should consider reverse shell, and privilege escalation to both normal user and
the next step, until the penetration testing task is completed. root levels on this particular benchmark machine. Notably,
Do you understand? P ENTEST GPT demonstrates human-like reasoning by linking
findings across different stages.

USENIX Association 33rd USENIX Security Symposium 863

1 -Port Scanning

Enumeration FTP Service SSH Service Web Service

7 -Anonymous 6 -Brute Force 2 -Directory 4 -Injection Point 5 -Sensitive

Arbitrary File Hidden Directory

Upload (Succ) `/Files`

3 -Vulnerbale 10 -Trigger
8 -Examine
File Enumeration Reverse Shell
uploaded file
(not vulnerable)
Web User
Access File Uploaded to Web user (www-
Web Service data) access

9 -Reverse Shell
Construction and 11 -System 14 -cron Vulnerable
15 -Local File
Upload Configuration enumeration (not Service
Enumeration
Enumeration useful) Enumeration

An interesting
A user named the user controls "runme.sh"
"shrek" is presented Apache service

16 -Crach the
12 -enumerate 13 -enumerate
hash in the file
Privilege "shrek" files (no Apache Service
Escalation to access) (not vulnerable)
Normal User Get Password; use
"shrek" as username

17 -Privilege
Escalation to
user "shrek"

User "shrek"
access obtained

18 -System Vulnerable
cron Local File
Configuration Service
Privilege enumeration Enumeration
Enumeration Enumeration
Escalation to
root
"shrek" can run
Python with sudo
access

19 -Privilege
Escalation to root