0% found this document useful (0 votes)

52 views9 pages

Guiding Configuration of firewaLL Through Augmented Large Language Models

The paper presents GOLLUM, a conversational agent designed to assist in firewall configurations using augmented large language models (LLMs). By integrating pfSense firewall documentation through a retrieval augmented generation approach, GOLLUM aims to minimize misconfigurations and improve cybersecurity. Experimental results demonstrate that GOLLUM provides accurate support in configuration-related queries, achieving over 80% correctness in responses.

Uploaded by

khhhhh

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

52 views9 pages

Guiding Configuration of firewaLL Through Augmented Large Language Models

Uploaded by

khhhhh

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 9

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/389433271

GOLLUM: Guiding cOnﬁguration of ﬁrewaLL Through aUgmented Large

Language Models

Conference Paper · February 2025

DOI: 10.5220/0013221900003890

CITATIONS READS

0 36

3 authors, including:

Antonio Maci Antonio Coscia

BV TECH S.p.A. 13 PUBLICATIONS 86 CITATIONS
15 PUBLICATIONS 93 CITATIONS
SEE PROFILE
SEE PROFILE

All content following this page was uploaded by Antonio Maci on 28 February 2025.

The user has requested enhancement of the downloaded file.

GOLLUM: Guiding cOnfiguration of firewaLL Through aUgmented
Large Language Models

a b c
Roberto Lorusso , Antonio Maci and Antonio Coscia
Cybersecurity Laboratory, BV TECH S.p.A., Milan, Italy
{roberto.lorusso, antonio.maci, antonio.coscia}@bvtech.com

Keywords: Computer Networks, Conversational Agent, Cybersecurity, Firewall Configuration, Large Language Model,
RAGAS, Retrieval Augmented Generation.

Abstract: Artificial intelligence (AI) tools offer significant potential in network security, particularly for addressing is-
sues like firewall misconfiguration, which can lead to security flaws. Configuration support services can help
prevent errors by providing clear general-purpose language instructions, thus minimizing the need for manual
references. Large language models (LLMs) are AI-based agents that use deep neural networks to understand
and generate human language. However, LLMs are generalists by construction and may lack the knowledge
needed in specific fields, thereby requiring links to external sources to perform highly specialized tasks. To
meet these needs, this paper proposes GOLLUM, a conversational agent designed to guide firewall config-
urations using augmented LLMs. GOLLUM integrates the pfSense firewall documentation via a retrieval
augmented generation approach, providing an example of actual use. The generative models used in GOL-
LUM were selected based on their performance on the state-of-the-art NetConfEval and CyberMetric datasets.
Additionally, to assess the effectiveness of the proposed application, an automated evaluation pipeline, involv-
ing RAGAS as test dataset generator and a panel of LLMs for judgment, was implemented. The experimental
results indicate that GOLLUM, powered by LLama3-8B, provides accurate and faithful support in three out
of four cases, while achieving > 80% of answer correctness in configuration-related queries.

1 INTRODUCTION sentences contained within. As it learns the patterns

by which words are sequenced, the model can make
Modern technologies that leverage artificial intelli- predictions about how sentences should probably be
gence (AI) can positively contribute to the imple- structured (Min et al., 2023). In the cybersecurity do-
mentation of tactical cybersecurity services. Accord- main LLMs can be mainly adopted for (Sarker, 2024):
ingly, several AI-based methods are being used to threat analysis, incident response, training and aware-
tackle the challenges posed by such a domain, includ- ness, phishing detection, penetration testing, and im-
ing deep learning (DL) and natural language process- plementation of conversational agents to provide real-
ing (NLP). Bridging them resulted in breakthrough time assistance to users. For the last use case, it
advances in the performance achieved by computer- has been observed that interaction with conversational
based agents in terms of their efficiency in handling agents can provide concrete support to humans in re-
challenging tasks related to human language. Cur- lation to the task at hand (Ross et al., 2023). Mitigat-
rently, a clear example is given by the large language ing human error through assistance tools is crucial,
models (LLMs), representing the most widely dis- especially when these errors occur in large and inter-
tributed tools capable of understanding and generat- connected contexts, such as in the case of misconfigu-
ing general-purpose languages. The underlying de- rations generated by network administrators on tacti-
sign of such models is based on the use of trans- cal devices, like firewalls (Alicea and Alsmadi, 2021).
formers, i.e., artificial neural networks that implement This represents a primary concern as it can result in
the self-attention mechanism to extract linguistic con- serious performance and security problems, making
tent and to infer the relationship between words and the need for anomaly resolution and optimization al-
a
gorithms that act on firewall security policies (Cos-
https://orcid.org/0009-0007-8640-7109
b
cia et al., 2023; Coscia et al., 2025). In any case,
https://orcid.org/0000-0002-6526-554X
c
it is paramount to avoid in advance the existence of
https://orcid.org/0000-0002-7263-4999

489
Lorusso, R., Maci, A. and Coscia, A.
GOLLUM: Guiding cOnfiguration of firewaLL Through aUgmented Large Language Models.
DOI: 10.5220/0013221900003890
Paper published under CC license (CC BY-NC-ND 4.0)
In Proceedings of the 17th International Conference on Agents and Artificial Intelligence (ICAART 2025) - Volume 1, pages 489-496
ISBN: 978-989-758-737-5; ISSN: 2184-433X
Proceedings Copyright © 2025 by SCITEPRESS – Science and Technology Publications, Lda.
ICAART 2025 - 17th International Conference on Agents and Artificial Intelligence

these abnormal setups, as they would hinder the de- 2 LITERATURE REVIEW
vice from being used in safety-critical contexts gov-
erned by strict standards (Anwar et al., 2021). In this Due to the radical diffusion of LLMs, recent stud-
setting, LLMs can be placed as powerful engines to ies have evaluated their deployment in emerging
assist operators with network configurations (Huang telecommunication technologies, such as 6G (Xu
et al., 2024). Guiding the setup of specific firewall et al., 2024). In such a scenario, several collab-
functionalities can be accomplished by employing the orative LLMs, each with different scopes, are dis-
LLM as an agent that replies to specific questions tributed among the network infrastructure to per-
posed by the operator, i.e., performing a question an- form user-agent interactive tasks. In (Wu et al.,
swering (QA) task (Ouyang et al., 2022). As a gen- 2024), a low-rank data-driven network context adap-
eral rule, LLMs embody the knowledge acquired dur- tation method was proposed to significantly mini-
ing training, thus limiting their reliability when em- mize the fine-tuning effort of LLM employed in com-
ployed in unknown contexts, potentially leading to plex network-related tasks. This framework, called
a phenomenon called hallucination (Ji et al., 2023). NetLLM, turned out to be useful in three specific net-
This problem can be mitigated by expanding the working use cases. Likewise, a study by (Chen et al.,
model knowledge through fine-tuning, i.e., updating 2024) proposes the application of language models in
its weights using task-specific data. However, it re- the management of distributed models in cloud edge
quires more time and computational resources when computing. This proposal is called NetGPT (accord-
scaled up with larger models. A viable alternative is ing with the LLM leveraged) has the objective of re-
represented by the so-called retrieval augmented gen- leasing a collaborative framework for effective and
eration (RAG) technique (Lewis et al., 2020), con- adaptive network resource management and orches-
sisting of using dynamically retrieved external knowl- tration. The comprehensive investigations conducted
edge from custom documents in response to an in- by (Huang et al., 2024) discuss how LLMs impact and
put query. This approach mitigates hallucinations by can enhance networking tasks, such as computer net-
providing more accurate models, which are also tol- work diagnosis, design and configuration, and secu-
erant of fluctuations in context-specific information, rity. In (Ferrag et al., 2024), a Bidirectional Encoder
ensuring consistency. In response to the challenge Representations from Transformers (BERT)-based ar-
posed, leveraging the capabilities of the increasingly chitecture is leveraged as a thread detector. It was
disruptive AI paradigm, this article proposes an ap- fine-tuned on a Internet-of-things (IoT) data gener-
plication to assist network administrators in configur- ated trough a combination of novel encoding and tok-
ing firewall functionalities, namely, guiding config- enization strategies. The AI-based agent presented in
uration of firewall using augmented large language (Loevenich et al., 2024) is equipped with an AI-based
models (GOLLUM). To achieve this, the prior knowl- chatbot that leveraged LLM and knowledge graph-
edge of small-sized LLMs was evaluated in the corre- based RAG technique to provide a human-machine
sponding instruct version to assess their expertise in interface capable of performing the QA task, accord-
suggesting network configurations and answering cy- ing to the findings of the autonomous agent. In (Padu-
bersecurity questions, using the NetConfEval and Cy- raru et al., 2024), an augmented LLama2 model pro-
berMetric datasets, respectively (Wang et al., 2024; vides a chatbot to support security analysts with the
Tihanyi et al., 2024). Through an ad hoc pipeline latest trends in information and network security. This
RAG, the most accurate and fastest models on the two was achieved by combining RAG and safeguarding
datasets were equipped with external knowledge pro- with the LLM capabilities. According to the review
vided by the pfSense documentation. In summary, the we conducted, there appears to be a strong need to fo-
main contributions of this paper are: cus research activities on testing LLMs across various
• The proposal of a conversational agent, namely networking contexts. In addition, the same tools can
GOLLUM, which can assist network administra- greatly stimulate support activities, thereby fostering
tors in firewall configurations. the achievement of a better cyber posture by users.
To the best of our knowledge, no previous study has
• The implementation of an automated evaluation investigated the use of LLMs for assisting in the con-
pipeline for the RAG chain that exploits LLMs figuration of critical network security devices, such as
in both the test case generation and the judgment firewalls. In such a scenario, a user could ask to an AI-
phases. based agent to support him in: (i) configuring network
policies; (ii) indicating which are the steps to follow
to configure a specific functionality according to the
manual of a specific product (e.g., pfSense); (iii) un-

490
GOLLUM: Guiding cOnfiguration of firewaLL Through aUgmented Large Language Models

derstanding which are the attack vectors related to a 3.2 Document Chunker
line of defense (e.g., what are SQL injection (SQLi)
attacks and how prevent them enabling web applica- Chunking documents is a standard practice that aims
tion firewall (WAF) proxy-based plugin like (Coscia to improve the accuracy of information retrieval (IR)
et al., 2024)). Depending on the operative context and systems and ensures that the augmentation process
the proposed purpose, it is essential to develop safety- does not saturate the length of the context window of
oriented agents, preferably through local LLMs, since LLMs. However, fine-grained chunks may lose im-
these can be deployed in closed private scenarios. portant contextual information; therefore, it is impor-
tant to balance the trade-off in chunk length. A com-
mon approach involves linking smaller fragments to
3 GOLLUM DESIGN the original full documents or larger chunks, with the
goal of maximizing the accuracy of the retrieval pro-
cess while preserving the broader context to be used
GOLLUM consists of a RAG pipeline equipped with
in the generation process. According to this proce-
a parent document retriever that augments the con-
dure, we employ a recursive character splitter with a
text understanding capabilities of a language model.
chunk size of 256 for child nodes and 1700 for par-
Regarding the language model adopted in GOLLUM
ent nodes with an overlap of 100 characters to pre-
a deeper discussion on the selection criterion is pro-
serve continuity between adjacent chunks. The length
vided in section 4.1.2.
of the parent nodes was chosen to be close to the
mean value of the lengths of documents. In addi-
3.1 Knowledge Base tion, the chunk length was set so as to take advantage
of as much as possible of the context window of the
The knowledge base is the main source of information adopted language model, while avoiding saturating it
during retrieval. It complements the implicit knowl- with the number of chunks that can be retrieved.
edge encoded by the model parameters with struc-
tured or unstructured information, such as textual data
or even images. To provide a real-world use case of
3.3 Embedding Model
GOLLUM we refer to the pfSense firewall. In par-
In a RAG pipeline, the embedding model plays a
ticular, we retained the instruct-oriented text of in-
crucial role, as it transforms text into a searchable
terest based on its clarity, simplicity, and concise-
format that allows efficient and relevant information
ness and deprived of any not-relevant reference to
retrieval. Such an encoded structure is a vector (a
pure firewall topics. Moreover, each book content has
high-dimensional numerical representation) that cap-
been pre-processed so that only chapter content is re-
tures the semantic meaning of a sequence of text
tained, thereby removing prefaces, frontispieces, and
(e.g., a query), ensuring that similar concepts are
any outlines. As a consequence of the overall content
close to each other in the vector space, even if they
analysis and to avoid redundancies, the book (Zien-
are phrased differently. The embedding model se-
tara, 2018) was chosen as the final knowledge base in
lected for GOLLUM was the so-called mxbai-embed-
view of its substantial textual content and the limited
large-335M (d = 1024) as it stably appears in the top
presence of figures and tables. A descriptive analysis
lightweight performers on the massive text embed-
of topics covered in this book is presented in Table 1.
ding benchmark (MTEB) leaderboard (Muennighoff
et al., 2023). In addition, as stated by Mixedbread,
Table 1: Descriptive analysis of knowledge base.
such an embedding model is appropriate for RAG-
Topic No. pages No. tokens related use cases. As for user queries, the embedding
Captive Portal 35 49536 model encodes the knowledge base (such a phase hap-
Configuration 38 50648 pens offline); thus, all external knowledge documents
DNS 31 48465 are transformed into vectors that are stored in a vector
Firewall/NAT 51 72184 database for fast lookup.
Installation 40 62118
Multiple WANs 29 45267 3.4 Vector Store
NTP/SNMP 12 16578
Routing and Bridging 43 62523 Vector stores represent a fundamental component in
Traffic Shaping 27 53073 RAG applications, as they are needed for storing, in-
Troubleshooting 74 101840 dexing and managing any type of data in the form
VPN/IPsec 49 68859 of high-dimensional, dense numerical vectors, com-

491
ICAART 2025 - 17th International Conference on Agents and Artificial Intelligence

monly produced by an embedding model. These vec- 4 EXPERIMENTAL EVALUATION

tors convey a semantic representation of the data, al-
lowing for similarity-based retrieval through metrics 4.1 Materials and Methods
such as cosine similarity or maximum marginal rel-
evance. Embeddings are stored and retrieved using 4.1.1 Datasets
Chroma, an open-source, lightweight vector database
with integrated quantization, compression, and the To evaluate stand-alone LLM capabilities in both
ability to dynamically adjust the database’s size in re- computer networks and cybersecurity domains, the
sponse to changing needs. following two datasets have been used: (i) NetCon-
fEval1 (Wang et al., 2024); (ii) CyberMetric2 (Tihanyi
3.5 Retriever et al., 2024). Then, to assess the responses of the en-
tire RAG pipeline against the golden answers, a cus-
The vector stores can store multiple representations of tom test set comprising 330 question-golden answer
the same documents. We employ a parent-document pairs, 30 for each topic at Table 1, was constructed
retrieval strategy that generates and indexes two dis- using the appropriate RAGAS (Es et al., 2024) util-
tinct embeddings: one for the child chunks and an- ity. Due to the uneven distribution of content lengths
other for the parent documents obtained at 3.2. Then, across topics, as depicted in Table 1, it was necessary
cosine similarity is computed between the embed- to develop an appropriately balanced test set. Tak-
dings of the input query and the child nodes to retrieve ing into account possible noise, duplicates, and errors
the most relevant matches; these are then used to re- while generating the test set using the RAGAS util-
trieve the larger information segments from the parent ity, we started from a target number of 50 question-
nodes, which are finally used in the augmentation pro- answer pairs for each topic. Then, we pre-processed
cess. The number and size of the retrieved documents the test set by removing duplicates, empty, and invalid
must be balanced according to the LLMs context win- golden answers. Finally, each question was manu-
dow and available hardware resources. A larger con- ally inspected, reducing the number of QA pairs to 30
text window increases the computational load. For samples per topic. This was achieved by selecting the
our purposes, we retrieve the top four relevant par- test samples based on the following criteria: (i) topic
ent chunks, each consisting of 1700 characters, to be coherence; (ii) specificity, i.e., we retained the most
fed into a context window of size 8192, i.e., the lower specific and accurate QA pairs by discarding samples
bound of maximum context window sizes supported for which the related question does not require exter-
by the employed models. Hence, we ensure that the nal knowledge to be answered. Given the same set of
context window is not saturated, thus leaving room pfSense documentation used by the RAG chain, the
for generation. As shown in Figure 1, embeddings of LLMs involved in the generation of the synthetic test
chunks related to the same topic exhibit spatial prox- set were chosen to be different, but with a comparable
imity. size of LLMs in section 4.1.2, were: (i) Gemma-7B
exploited as the generator; (ii) Gemma2-9B was used
as a critic model to validate the generation process;
(iii) Mxbai-embed-large-335M to produce the embed-
dings as in section 3.3. The generator-critic model
pair is set so that the former is an older release than
the latter, as recommended by the RAGAS utility.

4.1.2 Large Language Models Evaluated

From the models evaluated in (Tihanyi et al., 2024),

Figure 1: Parent embeddings per topic computed using the we selected the latest versions with available instruct
model outlined in section 3.3. models to translate requirements into formal speci-
fications for reachability policies, emulating firewall
traffic management. Consequently, the lightweight
local LLMs assessed for the generation text compo-
nent of GOLLUM are: (i) Llama3-8B and Llama3.1-
8B; (ii) Mistral-7B and Neural-Chat-7B.
1 https://github.com/NetConfEval/NetConfEval
2 https://github.com/cybermetric/CyberMetric

492
GOLLUM: Guiding cOnfiguration of firewaLL Through aUgmented Large Language Models

4.1.3 Metrics ground truth (eGT ) encoded as embeddings is cal-

culated and then averaged with the FC:
First, to evaluate the effectiveness of LLMs in deal-
ing with the translation of network specifications, |TP| ·eGT
the accuracy is calculated by dividing the number + ∥eeA∥∥e
|TP|+ 12 ×(|FP|+|FN|) A GT ∥
of correctly translated requirements by the expected AC = (1)
2
ones given a fixed batch size b (number of speci-
fications). In particular, each test case comprised • Faithfulness (FF), i.e, how consistent is the gen-
erated answer with respect to the given context.
m = bMAXb samples, where, in our case bMAX = 10 as
2
To realize this, the generated answer is initially
b ∈ [1, 5, 10, 20, 50, 100]. On the other hand, because
split into NA single statements. Then, the panel
CyberMetric consists of multiple choices QA test, the
of LLMs judge determines how many statements
average (since each experiment has been repeated five
(nFF ) are effectively retrieved from the context:
times to ensure good statistical significance, as done
by the authors of the dataset in their experiments) nFF
FF = (2)
LLM capability of selecting the correct answer among NA
the four different options (accuracy). In addition, the
medium inference time was recorded for both tasks 4.2 Results and Discussion
because we were interested in models that were ac-
curate and quick to respond in their practical employ- 4.2.1 LLMs Knowledge of Computer Networks
ment. Then, according to (Adlakha et al., 2024), QA and Cybersecurity Analysis
instruction-following and context-specialized models
should be evaluated mainly along two aspects, which
aim to point out the correctness of the provided an-
swer and how the information conveyed by the agent
is sourced from the external knowledge provided. The
paradigm adopted for such an evaluation is the so-
called LLM-as-a-judge (Zheng et al., 2023). How-
ever, to ensure a more impartial judgment, we formed
a panel of judges, consisting of an ensemble of three
local medium-sized LLMs not involved in any setup
adopted so far, i.e.: Phi3-14B; Vicuna-13B; Qwen2.5-
14B. We choose a trio of models, given the nature of
the judgment requested to each LLM, that is, binary.
This choice is necessary to implement a majority vot-
ing schema to infer definitive decisions. Therefore,
at least two models can provide agreeable judgments.
To realize evaluation purposes, the following metrics
were considered:
• Answer correctness (AC), that is, an indicator of
Figure 2: LLM accuracy and inference time (in seconds)
how well the generated answer compares to the
achieved on NetConfEval per different weight quantization
ground truth. First, to compute such a measure, and batch size.
the factual correctness (FC) is derived as the F1
score calculated on the number of: (i) statements Figure 2 displays the average accuracy and inference
shared by the answer and the ground truth (true time trends achieved by the evaluated LLMs for dif-
positives (TPs)); (ii) facts in the generated an- ferent b values. For b = bMAX → m = 1, and this
swer that do not belong to the expected answer justifies the absence of the confidence interval in the
(false positives (FPs)); (iii) statements found in correspondence of b = 100. However, and in ac-
the ground truth but not in the generated answer cordance with the results obtained by the authors of
(false negatives (FNs)). These measures are de- NetConfEval, for b = 100, we noticed that the LLM
rived according to the decisions inferred by the outputs produced are always cut, which leads to per-
aforementioned majority voting scheme. It should formance degradation. Ranging the performance for
be noted that TPs and FPs are mutually exclusive, each model, the following findings are derived:
which makes the decision binary. Second, the co- • Llama3-8B achieved high accuracy at smaller
sine similarity between the answer (eA ) and the batch sizes without quantization. As the batch

493
ICAART 2025 - 17th International Conference on Agents and Artificial Intelligence

size increases (beyond 20), there is a rapid drop

in accuracy. The model quantization also fol-
lows a similar pattern but starts slightly lower.
The inference time is relatively low at smaller
batch sizes and increases gradually as the batch
size increases. The fp16 weight precision requires
slightly more time compared to int4 quantization,
which indicates that the latter offers faster infer-
ence while retaining similar accuracy.
• Llama3.1-8B demonstrated relatively high accu- Figure 3: LLM accuracy and inference time (in seconds)
racy at smaller batch sizes for both the fp16 and achieved on CyberMetric per different weight quantization.
int4 configurations. This improvement is signifi-
cant compared to Llama3-8B for b = 50. As pre- Figure 3 indicates that the accuracy achieved by
viously observed, the accuracy decreased as the Mistral-7B, which is close to ε = 73.65 (human accu-
batch size increased, with a significant drop-off racy) if quantization is enabled and the lowest among
beyond batch size 50. The int4 version demon- those achieved by all models compared in the case of
strated better accuracy than fp16 (for b ∈ [1, 5, 20]) the fp16 weight precision. Therefore, this model is
and produced the highest average accuracy among the worst performer on the 80 questions of the Cy-
all compared models (higher than 60%) for b = berMetric dataset, examining also the inference time
50. The inference time remained low and stable obtained, which is the longest compared to competi-
for all batch sizes, demonstrating that this lan- tors per quantization adopted. The remaining three
guage model was more efficient in handling larger models outperformed human accuracy, and this result
batches than the other models. The int4 configu- is notable for Neural-Chat quantized, which achieved
ration remained slightly faster than the fp16. the highest average accuracy score (with a negligible
standard error) in these experiments. Despite this, the
• Except for small batch sizes, Mistral-7B achieved
average inference time was longer than that achieved
low accuracy scores among all compared models
by the Llama models in both quantization settings.
(from b = 10) regardless of whether model quanti-
Finally, Llama3 and Llama3.1 achieve comparable
zation is adopted or not. The model is completely
performance in terms of accuracy and inference time
insolvent even at b = 20. In addition, it requires a
(equal for this metric). Models without quantization
longer inference time than Llama models for both
require an average inference time of approximately 10
the fp16 and int4 setups.
s, which indicates that they can answer ∼480 ques-
• Neural-Chat-7B showed a sharp increase in infer- tions in just a minute. On the other hand, this abil-
ence time with increasing batch sizes, peaking at ity almost doubles that of adopting quantization while
batch size 50 (around one minute) and stabilizing maintaining acceptable accuracy. Intersecting the re-
afterward. The int4 configuration demonstrated a sults in both Figures 2 and 3, the Llama3 models ap-
faster inference time (except for b ≥ 50) than the pear to be more suitable in understanding contexts re-
fp16 configuration, although the speed gains were lated to computer networks and security, producing
accompanied by low fluctuating accuracy. This accurate results in a very reactive manner.
measure is the major drawback of this model be-
cause it is under 20% for b ≥ 20. 4.2.2 RAG-Based Analysis
As a general overview, the accuracy and infer-
Equipping GOLLUM with Llama3-8B or Llama3.1-
ence time appear to decrease and increase, respec-
8B results in a shift in AC and FF trends on the basis
tively, with increasing b. Considering the trade-off
of how shown in Figure 4. To be specific:
between accuracy and inference time, the models be-
longing to the Meta Llama3 family appear to be the • Llama3-8B achieves ∼ 76% of AC with a stan-
best among the evaluated in providing network policy dard error of approximately ±0.02. With regard to
translations. This result is critical because it opens the alignment between the generated content and
up the scenario of declining these models for the net- the context retrieved, this language model results
work analysis policy task, which is primary in fire- in FF ∼ 75%.
wall applications. Furthermore, adopting quantiza- • Llama3.1-8B yields ∼ 78% of AC, i.e., the up-
tion does not lead to average performance degrada- per bound of the aforementioned model AC confi-
tion, ensuring faster inferences and, inevitably, lower dence interval, again with a standard deviation of
GPU memory consumption. ±2%. In this case, despite the increase in accu-

494
GOLLUM: Guiding cOnfiguration of firewaLL Through aUgmented Large Language Models

racy compared to Llama3-8B, a drop in FF is evi- Figure 5 depicts the average performance
dent, which appeared to be close to 70%. Similar achieved by GOLLUM per topic. Specifically,
to Llama3-8B, the standard deviation was wider Lllama3.1-8B outperforms Llama3-B in answer cor-
than the AC. rectness on topics such as Captive Portal, Installation,
According to the two above points, GOLLUM is and Multiple/WANs, while the opposite trend is
more correct and faithful if Llama3-8B is used as the observed for VPN/IPsec. Similarly, the FF achieved
language model, i.e., a more balanced AC-FF trade- by GOLLUM equipped with Lllama3.1-8B is better
off is obtained compared to Llama3.1-8B usage. Es- for Firewall/NAT and DNS topics than the same score
tablishing a balance between AC and FF can help en- obtained by Llama3-8B, which provides more faith-
sure the response not only provides factually correct ful answers in all the remaining topics. A remarkable
information and accurately represents the intended result concerns the AC achieved on questions related
message, reasoning, or data. The achievement of 76% to the configuration topic, which exceeds the 80%
correctness implies that the model is accurate in three in both cases denoting how GOLLUM can provide
out of four answers, which is a promising starting accurate guidelines on configuration tasks. Similarly,
point. This shows that the retrieval pipeline is gen- very promising results are obtained for questions
concerning routing and bridging topic, validating
erally effective; however, it could still be improved to
the efficiency in understanding network reachability
capture more precise and relevant answers. A 75%
requirements, as previously assessed with the Net-
faithfulness score means that, on average, one of four
ConfEval benchmark, in a more specialized context
responses may include unfaithful or hallucinated con-
(as the top FF score is exhibited). The consistency
tent.
in model performance on various topics highlights
the potential of GOLLUM to serve as a robust,
context-sensitive support tool in network security.

5 CONCLUSION
Configuring firewalls is a challenging and error-prone
task that can compromise network security. Given
the complexity, especially in large-scale and dynamic
networks, human operators can benefit from intelli-
gent, context-aware tools for accurate firewall setup
guidance. In this paper, we introduced GOLLUM,
an AI-powered assistant designed to mitigate config-
uration challenges using a proper RAG pipeline. The
Figure 4: RAG metric scores per LLM leveraged by
GOLLUM.
proposed method exploits generative models that in-
corporate adequate prior knowledge of computer net-
works and cybersecurity. By combining LLMs with
an extensive structured knowledge base, GOLLUM
provides reliable and context-specific support to net-
work administrators. Based on the experiments,
GOLLUM equipped with Llama3-8B demonstrated
more balanced performance, with considerable thor-
oughness in providing support on the topic of pfSense
configurations. Future developments may include ex-
panding the knowledge base to include additional net-
work security resources, further optimizing the re-
trieval mechanism.

Figure 5: RAG metric scores per LLM achieved by

ACKNOWLEDGEMENTS
GOLLUM on different topics.
This work was supported in part by the Fondo
Europeo di Sviluppo Regionale Puglia Programma

495
ICAART 2025 - 17th International Conference on Agents and Artificial Intelligence

Operativo Regionale (POR) Puglia 2014-2020-Axis Loevenich, J. F., Adler, E., Mercier, R., Velazquez, A., and
I-Specific Objective 1a-Action 1.1 (Research and Lopes, R. R. F. (2024). Design of an autonomous cy-
Development)-Project Title: CyberSecurity and ber defence agent using hybrid ai models. In 2024
Security Operation Center (SOC) Product Suite International Conference on Military Communication
and Information Systems (ICMCIS), pages 1–10.
by BV TECH S.p.A., under Grant CUP/CIG
Min, B., Ross, H., Sulem, E., Veyseh, A. P. B., Nguyen,
B93G18000040007. T. H., Sainz, O., Agirre, E., Heintz, I., and Roth, D.
(2023). Recent advances in natural language process-
ing via large pre-trained language models: A survey.
REFERENCES ACM Computing Surveys, 56(2).
Muennighoff, N., Tazi, N., Magne, L., and Reimers, N.
Adlakha, V., BehnamGhader, P., Lu, X. H., Meade, N., and (2023). MTEB: Massive text embedding benchmark.
Reddy, S. (2024). Evaluating correctness and faith- In 17th Conference of the European Chapter of the As-
fulness of instruction-following models for question sociation for Computational Linguistics, pages 2014–
answering. Transactions of the Association for Com- 2037.
putational Linguistics, 12:681–699. Ouyang, L., Wu, J., Jiang, X., Almeida, D., Wainwright,
Alicea, M. and Alsmadi, I. (2021). Misconfiguration in fire- C., Mishkin, P., et al. (2022). Training language mod-
walls and network access controls: Literature review. els to follow instructions with human feedback. In
Future Internet, 13(11). Advances in Neural Information Processing Systems,
volume 35, pages 27730–27744.
Anwar, R. W., Abdullah, T., and Pastore, F. (2021). Fire-
wall best practices for securing smart healthcare envi- Paduraru, C., Patilea, C., and Stefanescu, A. (2024). Cy-
ronment: A review. Applied Sciences, 11(19). berguardian: An interactive assistant for cybersecu-
rity specialists using large language models. In 19th
Chen, Y., Li, R., Zhao, Z., Peng, C., Wu, J., Hossain, E., and International Conference on Software Technologies -
Zhang, H. (2024). Netgpt: An ai-native network archi- Volume 1: ICSOFT, pages 442–449.
tecture for provisioning beyond personalized genera-
tive services. IEEE Network. Ross, S. I., Martinez, F., Houde, S., Muller, M., and Weisz,
J. D. (2023). The programmer’s assistant: Conversa-
Coscia, A., Dentamaro, V., Galantucci, S., Maci, A., and tional interaction with a large language model for soft-
Pirlo, G. (2023). An innovative two-stage algorithm ware development. In 28th International Conference
to optimize firewall rule ordering. Computers & Secu- on Intelligent User Interfaces, page 491–514.
rity, 134:103423.
Sarker, I. H. (2024). Generative AI and Large Language
Coscia, A., Dentamaro, V., Galantucci, S., Maci, A., and Modeling in Cybersecurity, pages 79–99. Springer
Pirlo, G. (2024). Progesi: A proxy grammar to en- Nature Switzerland.
hance web application firewall for sql injection pre-
vention. IEEE Access, 12:107689–107703. Tihanyi, N., Ferrag, M. A., Jain, R., Bisztray, T., and Deb-
bah, M. (2024). Cybermetric: A benchmark dataset
Coscia, A., Maci, A., and Tamma, N. (2025). Frog: A
based on retrieval-augmented generation for evaluat-
firewall rule order generator for faster packet filtering.
ing llms in cybersecurity knowledge. In 2024 IEEE
Computer Networks, 257:110962.
International Conference on Cyber Security and Re-
Es, S., James, J., Espinosa Anke, L., and Schockaert, S. silience (CSR), pages 296–302.
(2024). RAGAs: Automated evaluation of retrieval
Wang, C., Scazzariello, M., Farshin, A., Ferlin, S., Kostić,
augmented generation. In 18th Conference of the Eu-
D., and Chiesa, M. (2024). Netconfeval: Can llms
ropean Chapter of the Association for Computational
facilitate network configuration? Proceedings of the
Linguistics: System Demonstrations, pages 150–158.
ACM Networking, 2(CoNEXT2).
Ferrag, M. A., Ndhlovu, M., Tihanyi, N., Cordeiro,
Wu, D., Wang, X., Qiao, Y., Wang, Z., Jiang, J., Cui, S.,
L. C., et al. (2024). Revolutionizing cyber threat
and Wang, F. (2024). Netllm: Adapting large lan-
detection with large language models: A privacy-
guage models for networking. In ACM SIGCOMM
preserving bert-based lightweight model for iot/iiot
2024 Conference, page 661–678.
devices. IEEE Access, 12:23733–23750.
Xu, M., Niyato, D., Kang, J., Xiong, Z., Mao, S., Han, Z.,
Huang, Y., Du, H., Zhang, X., Niyato, D., Kang, J., Xiong,
Kim, D. I., and Letaief, K. B. (2024). When large
Z., Wang, S., and Huang, T. (2024). Large language
language model agents meet 6g networks: Perception,
models for networking: Applications, enabling tech-
grounding, and alignment. IEEE Wireless Communi-
niques, and challenges. IEEE Network.
cations, pages 1–9.
Ji, Z., Lee, N., Frieske, R., Yu, T., Su, D., Xu, Y., Ishii, E.,
Zheng, L., Chiang, W.-L., Sheng, Y., Zhuang, S., Wu, Z.,
Bang, Y. J., Madotto, A., and Fung, P. (2023). Survey
Zhuang, Y., et al. (2023). Judging llm-as-a-judge with
of hallucination in natural language generation. ACM
mt-bench and chatbot arena. In Advances in Neu-
Computing Surveys, 55(12):1–38.
ral Information Processing Systems, volume 36, pages
Lewis, P., Perez, E., Piktus, A., Petroni, F., Karpukhin, 46595–46623.
V., et al. (2020). Retrieval-augmented generation for
Zientara, D. (2018). Learn pfSense 2.4: Get up and running
knowledge-intensive nlp tasks. In Advances in Neu-
with Pfsense and all the core concepts to build firewall
ral Information Processing Systems, volume 33, pages
and routing solutions. Packt Publishing.
9459–9474.