Guiding Configuration of firewaLL Through Augmented Large Language Models
Guiding Configuration of firewaLL Through Augmented Large Language Models
net/publication/389433271
CITATIONS READS
0 36
3 authors, including:
All content following this page was uploaded by Antonio Maci on 28 February 2025.
a b c
Roberto Lorusso , Antonio Maci and Antonio Coscia
Cybersecurity Laboratory, BV TECH S.p.A., Milan, Italy
{roberto.lorusso, antonio.maci, antonio.coscia}@bvtech.com
Keywords: Computer Networks, Conversational Agent, Cybersecurity, Firewall Configuration, Large Language Model,
RAGAS, Retrieval Augmented Generation.
Abstract: Artificial intelligence (AI) tools offer significant potential in network security, particularly for addressing is-
sues like firewall misconfiguration, which can lead to security flaws. Configuration support services can help
prevent errors by providing clear general-purpose language instructions, thus minimizing the need for manual
references. Large language models (LLMs) are AI-based agents that use deep neural networks to understand
and generate human language. However, LLMs are generalists by construction and may lack the knowledge
needed in specific fields, thereby requiring links to external sources to perform highly specialized tasks. To
meet these needs, this paper proposes GOLLUM, a conversational agent designed to guide firewall config-
urations using augmented LLMs. GOLLUM integrates the pfSense firewall documentation via a retrieval
augmented generation approach, providing an example of actual use. The generative models used in GOL-
LUM were selected based on their performance on the state-of-the-art NetConfEval and CyberMetric datasets.
Additionally, to assess the effectiveness of the proposed application, an automated evaluation pipeline, involv-
ing RAGAS as test dataset generator and a panel of LLMs for judgment, was implemented. The experimental
results indicate that GOLLUM, powered by LLama3-8B, provides accurate and faithful support in three out
of four cases, while achieving > 80% of answer correctness in configuration-related queries.
489
Lorusso, R., Maci, A. and Coscia, A.
GOLLUM: Guiding cOnfiguration of firewaLL Through aUgmented Large Language Models.
DOI: 10.5220/0013221900003890
Paper published under CC license (CC BY-NC-ND 4.0)
In Proceedings of the 17th International Conference on Agents and Artificial Intelligence (ICAART 2025) - Volume 1, pages 489-496
ISBN: 978-989-758-737-5; ISSN: 2184-433X
Proceedings Copyright © 2025 by SCITEPRESS – Science and Technology Publications, Lda.
ICAART 2025 - 17th International Conference on Agents and Artificial Intelligence
these abnormal setups, as they would hinder the de- 2 LITERATURE REVIEW
vice from being used in safety-critical contexts gov-
erned by strict standards (Anwar et al., 2021). In this Due to the radical diffusion of LLMs, recent stud-
setting, LLMs can be placed as powerful engines to ies have evaluated their deployment in emerging
assist operators with network configurations (Huang telecommunication technologies, such as 6G (Xu
et al., 2024). Guiding the setup of specific firewall et al., 2024). In such a scenario, several collab-
functionalities can be accomplished by employing the orative LLMs, each with different scopes, are dis-
LLM as an agent that replies to specific questions tributed among the network infrastructure to per-
posed by the operator, i.e., performing a question an- form user-agent interactive tasks. In (Wu et al.,
swering (QA) task (Ouyang et al., 2022). As a gen- 2024), a low-rank data-driven network context adap-
eral rule, LLMs embody the knowledge acquired dur- tation method was proposed to significantly mini-
ing training, thus limiting their reliability when em- mize the fine-tuning effort of LLM employed in com-
ployed in unknown contexts, potentially leading to plex network-related tasks. This framework, called
a phenomenon called hallucination (Ji et al., 2023). NetLLM, turned out to be useful in three specific net-
This problem can be mitigated by expanding the working use cases. Likewise, a study by (Chen et al.,
model knowledge through fine-tuning, i.e., updating 2024) proposes the application of language models in
its weights using task-specific data. However, it re- the management of distributed models in cloud edge
quires more time and computational resources when computing. This proposal is called NetGPT (accord-
scaled up with larger models. A viable alternative is ing with the LLM leveraged) has the objective of re-
represented by the so-called retrieval augmented gen- leasing a collaborative framework for effective and
eration (RAG) technique (Lewis et al., 2020), con- adaptive network resource management and orches-
sisting of using dynamically retrieved external knowl- tration. The comprehensive investigations conducted
edge from custom documents in response to an in- by (Huang et al., 2024) discuss how LLMs impact and
put query. This approach mitigates hallucinations by can enhance networking tasks, such as computer net-
providing more accurate models, which are also tol- work diagnosis, design and configuration, and secu-
erant of fluctuations in context-specific information, rity. In (Ferrag et al., 2024), a Bidirectional Encoder
ensuring consistency. In response to the challenge Representations from Transformers (BERT)-based ar-
posed, leveraging the capabilities of the increasingly chitecture is leveraged as a thread detector. It was
disruptive AI paradigm, this article proposes an ap- fine-tuned on a Internet-of-things (IoT) data gener-
plication to assist network administrators in configur- ated trough a combination of novel encoding and tok-
ing firewall functionalities, namely, guiding config- enization strategies. The AI-based agent presented in
uration of firewall using augmented large language (Loevenich et al., 2024) is equipped with an AI-based
models (GOLLUM). To achieve this, the prior knowl- chatbot that leveraged LLM and knowledge graph-
edge of small-sized LLMs was evaluated in the corre- based RAG technique to provide a human-machine
sponding instruct version to assess their expertise in interface capable of performing the QA task, accord-
suggesting network configurations and answering cy- ing to the findings of the autonomous agent. In (Padu-
bersecurity questions, using the NetConfEval and Cy- raru et al., 2024), an augmented LLama2 model pro-
berMetric datasets, respectively (Wang et al., 2024; vides a chatbot to support security analysts with the
Tihanyi et al., 2024). Through an ad hoc pipeline latest trends in information and network security. This
RAG, the most accurate and fastest models on the two was achieved by combining RAG and safeguarding
datasets were equipped with external knowledge pro- with the LLM capabilities. According to the review
vided by the pfSense documentation. In summary, the we conducted, there appears to be a strong need to fo-
main contributions of this paper are: cus research activities on testing LLMs across various
• The proposal of a conversational agent, namely networking contexts. In addition, the same tools can
GOLLUM, which can assist network administra- greatly stimulate support activities, thereby fostering
tors in firewall configurations. the achievement of a better cyber posture by users.
To the best of our knowledge, no previous study has
• The implementation of an automated evaluation investigated the use of LLMs for assisting in the con-
pipeline for the RAG chain that exploits LLMs figuration of critical network security devices, such as
in both the test case generation and the judgment firewalls. In such a scenario, a user could ask to an AI-
phases. based agent to support him in: (i) configuring network
policies; (ii) indicating which are the steps to follow
to configure a specific functionality according to the
manual of a specific product (e.g., pfSense); (iii) un-
490
GOLLUM: Guiding cOnfiguration of firewaLL Through aUgmented Large Language Models
derstanding which are the attack vectors related to a 3.2 Document Chunker
line of defense (e.g., what are SQL injection (SQLi)
attacks and how prevent them enabling web applica- Chunking documents is a standard practice that aims
tion firewall (WAF) proxy-based plugin like (Coscia to improve the accuracy of information retrieval (IR)
et al., 2024)). Depending on the operative context and systems and ensures that the augmentation process
the proposed purpose, it is essential to develop safety- does not saturate the length of the context window of
oriented agents, preferably through local LLMs, since LLMs. However, fine-grained chunks may lose im-
these can be deployed in closed private scenarios. portant contextual information; therefore, it is impor-
tant to balance the trade-off in chunk length. A com-
mon approach involves linking smaller fragments to
3 GOLLUM DESIGN the original full documents or larger chunks, with the
goal of maximizing the accuracy of the retrieval pro-
cess while preserving the broader context to be used
GOLLUM consists of a RAG pipeline equipped with
in the generation process. According to this proce-
a parent document retriever that augments the con-
dure, we employ a recursive character splitter with a
text understanding capabilities of a language model.
chunk size of 256 for child nodes and 1700 for par-
Regarding the language model adopted in GOLLUM
ent nodes with an overlap of 100 characters to pre-
a deeper discussion on the selection criterion is pro-
serve continuity between adjacent chunks. The length
vided in section 4.1.2.
of the parent nodes was chosen to be close to the
mean value of the lengths of documents. In addi-
3.1 Knowledge Base tion, the chunk length was set so as to take advantage
of as much as possible of the context window of the
The knowledge base is the main source of information adopted language model, while avoiding saturating it
during retrieval. It complements the implicit knowl- with the number of chunks that can be retrieved.
edge encoded by the model parameters with struc-
tured or unstructured information, such as textual data
or even images. To provide a real-world use case of
3.3 Embedding Model
GOLLUM we refer to the pfSense firewall. In par-
In a RAG pipeline, the embedding model plays a
ticular, we retained the instruct-oriented text of in-
crucial role, as it transforms text into a searchable
terest based on its clarity, simplicity, and concise-
format that allows efficient and relevant information
ness and deprived of any not-relevant reference to
retrieval. Such an encoded structure is a vector (a
pure firewall topics. Moreover, each book content has
high-dimensional numerical representation) that cap-
been pre-processed so that only chapter content is re-
tures the semantic meaning of a sequence of text
tained, thereby removing prefaces, frontispieces, and
(e.g., a query), ensuring that similar concepts are
any outlines. As a consequence of the overall content
close to each other in the vector space, even if they
analysis and to avoid redundancies, the book (Zien-
are phrased differently. The embedding model se-
tara, 2018) was chosen as the final knowledge base in
lected for GOLLUM was the so-called mxbai-embed-
view of its substantial textual content and the limited
large-335M (d = 1024) as it stably appears in the top
presence of figures and tables. A descriptive analysis
lightweight performers on the massive text embed-
of topics covered in this book is presented in Table 1.
ding benchmark (MTEB) leaderboard (Muennighoff
et al., 2023). In addition, as stated by Mixedbread,
Table 1: Descriptive analysis of knowledge base.
such an embedding model is appropriate for RAG-
Topic No. pages No. tokens related use cases. As for user queries, the embedding
Captive Portal 35 49536 model encodes the knowledge base (such a phase hap-
Configuration 38 50648 pens offline); thus, all external knowledge documents
DNS 31 48465 are transformed into vectors that are stored in a vector
Firewall/NAT 51 72184 database for fast lookup.
Installation 40 62118
Multiple WANs 29 45267 3.4 Vector Store
NTP/SNMP 12 16578
Routing and Bridging 43 62523 Vector stores represent a fundamental component in
Traffic Shaping 27 53073 RAG applications, as they are needed for storing, in-
Troubleshooting 74 101840 dexing and managing any type of data in the form
VPN/IPsec 49 68859 of high-dimensional, dense numerical vectors, com-
491
ICAART 2025 - 17th International Conference on Agents and Artificial Intelligence
492
GOLLUM: Guiding cOnfiguration of firewaLL Through aUgmented Large Language Models
493
ICAART 2025 - 17th International Conference on Agents and Artificial Intelligence
494
GOLLUM: Guiding cOnfiguration of firewaLL Through aUgmented Large Language Models
racy compared to Llama3-8B, a drop in FF is evi- Figure 5 depicts the average performance
dent, which appeared to be close to 70%. Similar achieved by GOLLUM per topic. Specifically,
to Llama3-8B, the standard deviation was wider Lllama3.1-8B outperforms Llama3-B in answer cor-
than the AC. rectness on topics such as Captive Portal, Installation,
According to the two above points, GOLLUM is and Multiple/WANs, while the opposite trend is
more correct and faithful if Llama3-8B is used as the observed for VPN/IPsec. Similarly, the FF achieved
language model, i.e., a more balanced AC-FF trade- by GOLLUM equipped with Lllama3.1-8B is better
off is obtained compared to Llama3.1-8B usage. Es- for Firewall/NAT and DNS topics than the same score
tablishing a balance between AC and FF can help en- obtained by Llama3-8B, which provides more faith-
sure the response not only provides factually correct ful answers in all the remaining topics. A remarkable
information and accurately represents the intended result concerns the AC achieved on questions related
message, reasoning, or data. The achievement of 76% to the configuration topic, which exceeds the 80%
correctness implies that the model is accurate in three in both cases denoting how GOLLUM can provide
out of four answers, which is a promising starting accurate guidelines on configuration tasks. Similarly,
point. This shows that the retrieval pipeline is gen- very promising results are obtained for questions
concerning routing and bridging topic, validating
erally effective; however, it could still be improved to
the efficiency in understanding network reachability
capture more precise and relevant answers. A 75%
requirements, as previously assessed with the Net-
faithfulness score means that, on average, one of four
ConfEval benchmark, in a more specialized context
responses may include unfaithful or hallucinated con-
(as the top FF score is exhibited). The consistency
tent.
in model performance on various topics highlights
the potential of GOLLUM to serve as a robust,
context-sensitive support tool in network security.
5 CONCLUSION
Configuring firewalls is a challenging and error-prone
task that can compromise network security. Given
the complexity, especially in large-scale and dynamic
networks, human operators can benefit from intelli-
gent, context-aware tools for accurate firewall setup
guidance. In this paper, we introduced GOLLUM,
an AI-powered assistant designed to mitigate config-
uration challenges using a proper RAG pipeline. The
Figure 4: RAG metric scores per LLM leveraged by
GOLLUM.
proposed method exploits generative models that in-
corporate adequate prior knowledge of computer net-
works and cybersecurity. By combining LLMs with
an extensive structured knowledge base, GOLLUM
provides reliable and context-specific support to net-
work administrators. Based on the experiments,
GOLLUM equipped with Llama3-8B demonstrated
more balanced performance, with considerable thor-
oughness in providing support on the topic of pfSense
configurations. Future developments may include ex-
panding the knowledge base to include additional net-
work security resources, further optimizing the re-
trieval mechanism.
495
ICAART 2025 - 17th International Conference on Agents and Artificial Intelligence
Operativo Regionale (POR) Puglia 2014-2020-Axis Loevenich, J. F., Adler, E., Mercier, R., Velazquez, A., and
I-Specific Objective 1a-Action 1.1 (Research and Lopes, R. R. F. (2024). Design of an autonomous cy-
Development)-Project Title: CyberSecurity and ber defence agent using hybrid ai models. In 2024
Security Operation Center (SOC) Product Suite International Conference on Military Communication
and Information Systems (ICMCIS), pages 1–10.
by BV TECH S.p.A., under Grant CUP/CIG
Min, B., Ross, H., Sulem, E., Veyseh, A. P. B., Nguyen,
B93G18000040007. T. H., Sainz, O., Agirre, E., Heintz, I., and Roth, D.
(2023). Recent advances in natural language process-
ing via large pre-trained language models: A survey.
REFERENCES ACM Computing Surveys, 56(2).
Muennighoff, N., Tazi, N., Magne, L., and Reimers, N.
Adlakha, V., BehnamGhader, P., Lu, X. H., Meade, N., and (2023). MTEB: Massive text embedding benchmark.
Reddy, S. (2024). Evaluating correctness and faith- In 17th Conference of the European Chapter of the As-
fulness of instruction-following models for question sociation for Computational Linguistics, pages 2014–
answering. Transactions of the Association for Com- 2037.
putational Linguistics, 12:681–699. Ouyang, L., Wu, J., Jiang, X., Almeida, D., Wainwright,
Alicea, M. and Alsmadi, I. (2021). Misconfiguration in fire- C., Mishkin, P., et al. (2022). Training language mod-
walls and network access controls: Literature review. els to follow instructions with human feedback. In
Future Internet, 13(11). Advances in Neural Information Processing Systems,
volume 35, pages 27730–27744.
Anwar, R. W., Abdullah, T., and Pastore, F. (2021). Fire-
wall best practices for securing smart healthcare envi- Paduraru, C., Patilea, C., and Stefanescu, A. (2024). Cy-
ronment: A review. Applied Sciences, 11(19). berguardian: An interactive assistant for cybersecu-
rity specialists using large language models. In 19th
Chen, Y., Li, R., Zhao, Z., Peng, C., Wu, J., Hossain, E., and International Conference on Software Technologies -
Zhang, H. (2024). Netgpt: An ai-native network archi- Volume 1: ICSOFT, pages 442–449.
tecture for provisioning beyond personalized genera-
tive services. IEEE Network. Ross, S. I., Martinez, F., Houde, S., Muller, M., and Weisz,
J. D. (2023). The programmer’s assistant: Conversa-
Coscia, A., Dentamaro, V., Galantucci, S., Maci, A., and tional interaction with a large language model for soft-
Pirlo, G. (2023). An innovative two-stage algorithm ware development. In 28th International Conference
to optimize firewall rule ordering. Computers & Secu- on Intelligent User Interfaces, page 491–514.
rity, 134:103423.
Sarker, I. H. (2024). Generative AI and Large Language
Coscia, A., Dentamaro, V., Galantucci, S., Maci, A., and Modeling in Cybersecurity, pages 79–99. Springer
Pirlo, G. (2024). Progesi: A proxy grammar to en- Nature Switzerland.
hance web application firewall for sql injection pre-
vention. IEEE Access, 12:107689–107703. Tihanyi, N., Ferrag, M. A., Jain, R., Bisztray, T., and Deb-
bah, M. (2024). Cybermetric: A benchmark dataset
Coscia, A., Maci, A., and Tamma, N. (2025). Frog: A
based on retrieval-augmented generation for evaluat-
firewall rule order generator for faster packet filtering.
ing llms in cybersecurity knowledge. In 2024 IEEE
Computer Networks, 257:110962.
International Conference on Cyber Security and Re-
Es, S., James, J., Espinosa Anke, L., and Schockaert, S. silience (CSR), pages 296–302.
(2024). RAGAs: Automated evaluation of retrieval
Wang, C., Scazzariello, M., Farshin, A., Ferlin, S., Kostić,
augmented generation. In 18th Conference of the Eu-
D., and Chiesa, M. (2024). Netconfeval: Can llms
ropean Chapter of the Association for Computational
facilitate network configuration? Proceedings of the
Linguistics: System Demonstrations, pages 150–158.
ACM Networking, 2(CoNEXT2).
Ferrag, M. A., Ndhlovu, M., Tihanyi, N., Cordeiro,
Wu, D., Wang, X., Qiao, Y., Wang, Z., Jiang, J., Cui, S.,
L. C., et al. (2024). Revolutionizing cyber threat
and Wang, F. (2024). Netllm: Adapting large lan-
detection with large language models: A privacy-
guage models for networking. In ACM SIGCOMM
preserving bert-based lightweight model for iot/iiot
2024 Conference, page 661–678.
devices. IEEE Access, 12:23733–23750.
Xu, M., Niyato, D., Kang, J., Xiong, Z., Mao, S., Han, Z.,
Huang, Y., Du, H., Zhang, X., Niyato, D., Kang, J., Xiong,
Kim, D. I., and Letaief, K. B. (2024). When large
Z., Wang, S., and Huang, T. (2024). Large language
language model agents meet 6g networks: Perception,
models for networking: Applications, enabling tech-
grounding, and alignment. IEEE Wireless Communi-
niques, and challenges. IEEE Network.
cations, pages 1–9.
Ji, Z., Lee, N., Frieske, R., Yu, T., Su, D., Xu, Y., Ishii, E.,
Zheng, L., Chiang, W.-L., Sheng, Y., Zhuang, S., Wu, Z.,
Bang, Y. J., Madotto, A., and Fung, P. (2023). Survey
Zhuang, Y., et al. (2023). Judging llm-as-a-judge with
of hallucination in natural language generation. ACM
mt-bench and chatbot arena. In Advances in Neu-
Computing Surveys, 55(12):1–38.
ral Information Processing Systems, volume 36, pages
Lewis, P., Perez, E., Piktus, A., Petroni, F., Karpukhin, 46595–46623.
V., et al. (2020). Retrieval-augmented generation for
Zientara, D. (2018). Learn pfSense 2.4: Get up and running
knowledge-intensive nlp tasks. In Advances in Neu-
with Pfsense and all the core concepts to build firewall
ral Information Processing Systems, volume 33, pages
and routing solutions. Packt Publishing.
9459–9474.
496