0% found this document useful (0 votes)
94 views13 pages

Rule-Att&Ck Mapper (Ram) : Mapping Siem Rules To Ttps Using Llms

The document presents the Rule-ATT&CK Mapper (RAM), a framework that automates the mapping of SIEM rules to MITRE ATT&CK techniques using large language models (LLMs). It addresses the challenges of manual annotation, which is time-consuming and error-prone, by leveraging LLMs to enhance accuracy and efficiency in threat detection. The study evaluates RAM's performance with various LLMs, highlighting its potential to improve cybersecurity workflows and reduce the workload on security analysts.

Uploaded by

Елена О
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
94 views13 pages

Rule-Att&Ck Mapper (Ram) : Mapping Siem Rules To Ttps Using Llms

The document presents the Rule-ATT&CK Mapper (RAM), a framework that automates the mapping of SIEM rules to MITRE ATT&CK techniques using large language models (LLMs). It addresses the challenges of manual annotation, which is time-consuming and error-prone, by leveraging LLMs to enhance accuracy and efficiency in threat detection. The study evaluates RAM's performance with various LLMs, highlighting its potential to improve cybersecurity workflows and reduce the workload on security analysts.

Uploaded by

Елена О
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 13

Rule-ATT&CK Mapper (RAM): Mapping SIEM Rules to TTPs

Using LLMs
Prasanna N. Wudali Moshe Kravchik Ehud Malul, Parth A. Gandhi,
Ben-Gurion University of the Negev Rafael Advanced Defense Systems Yuval Elovici, Asaf Shabtai
Ben-Gurion University of the Negev

Abstract Each SIEM platform employs its own rule definition language (RDL),
The growing frequency of cyberattacks has heightened the demand a schema-based structure for defining these rules that standardizes
for accurate and efficient threat detection systems. Security infor- the creation and execution of SIEM rules, making them inherently
mation and event management (SIEM) platforms are important for structured data and a foundational component of modern cyber-
arXiv:2502.02337v1 [cs.CR] 4 Feb 2025

analyzing log data and detecting adversarial activities through rule- security operations. Examples of such schemas include the search
based queries, also known as SIEM rules. The efficiency of the threat processing language (SPL) from Splunk, the Lucene query language
analysis process relies heavily on mapping these SIEM rules to the by Elasticsearch, and the Kusto query language (KQL) by Microsoft.
relevant attack techniques in the MITRE ATT&CK framework. Inac- Security alerts are triggered when the execution of SIEM rules
curate annotation of SIEM rules can result in the misinterpretation yields search results. When such alerts are generated, security ana-
of attacks, increasing the likelihood that threats will be overlooked. lysts must examine each alert individually, performing tasks such
Such misinterpretation can expose an organization’s systems and as triage, analysis, and interpretation, and determine whether the
networks to potential damage and security breaches. Existing solu- alert corresponds to an actual attack. A critical aspect of effective
tions for annotating SIEM rules with MITRE ATT&CK technique threat detection and hunting is the precise mapping and understand-
and sub-technique labels have notable limitations: manual annota- ing of the tactics, techniques, and procedures (TTPs) employed by
tion of SIEM rules is both time-consuming and prone to errors, and adversaries, as defined in the MITRE ATT&CK framework.1 In-
machine learning-based approaches mainly focus on annotating un- corporating MITRE ATT&CK techniques in the analysis provides
structured free text sources (e.g., threat intelligence reports) rather valuable insights, enabling analysts to discern potential attack flows.
than structured data like SIEM rules. Structured data often contains Such mapping enhances security professionals’ ability to anticipate
limited information, further complicating the annotation process and mitigate the strategies employed by cyber adversaries.
and making it a challenging task. To address these challenges, we Mapping SIEM rules to specific MITRE ATT&CK techniques is
propose Rule-ATT&CK Mapper (RAM), a novel framework that a complex manual process that is prone to errors and can be time-
leverages large language models (LLMs) to automate the mapping consuming. Cybero, a leading cybersecurity company, reported [8]
of structured SIEM rules to MITRE ATT&CK techniques. RAM’s that "organizations collect sufficient log data to potentially detect 94%
multi-stage pipeline, which was inspired by the prompt chaining of techniques outlined in the MITRE ATT&CK framework; however,
technique, enhances mapping accuracy without requiring LLM pre- only 24% of these techniques are effectively covered due to gaps in
training or fine-tuning. Using the Splunk Security Content dataset, detection rules, with an additional 12% of SIEM rules rendered non-
we evaluate RAM’s performance using several LLMs, including functional or misconfigured." In its best practices guide [7] to MITRE
GPT-4-Turbo, Qwen, IBM Granite, and Mistral. Our evaluation high- ATT&CK mapping, CISA, an American cyber defense agency, listed
lights GPT-4-Turbo’s superior performance, which derives from (i) leaping to conclusions (i.e., prematurely deciding on a mapping
its enriched knowledge base, and an ablation study emphasizes based on insufficient evidence or examination of the facts), (ii)
the importance of external contextual knowledge in overcoming missing opportunities (i.e., not considering, being unaware of, or
the limitations of LLMs’ implicit knowledge for domain-specific overlooking other potential technique mappings based on implied
tasks. These findings demonstrate RAM’s potential in automating or unclear information), and (iii) miscategorization (i.e., the selec-
cybersecurity workflows and provide valuable insights for future tion of incorrect techniques due to misinterpreting, misreading,
advancements in this field. or inadequately understanding the techniques, specifically the dif-
ference between two techniques) as common mistakes committed
Keywords by security analysts when manually performing the mapping task.
Given the above, there is a need to automate the mapping process
SIEM rules, LLMs, MITRE ATT&CK
and thereby reduce the workload on security analysts and increase
1 Introduction the speed and accuracy of threat detection.
Recent cybersecurity research has explored various techniques
The rapid advancement of technology and widespread adoption for mapping unstructured data from cyber threat intelligence (CTI)
of digital applications have resulted in a significant increase in reports to the MITRE ATT&CK framework [3, 4, 19, 22, 30]. While
cyberattacks [5]. To gain visibility into their digital ecosystems, these methods have demonstrated effectiveness in handling un-
organizations deploy security information and event management structured data, they have a limited ability to adapt to structured
(SIEM) systems in their networks. These systems store and analyze data use cases, such as intrusion detection system and SIEM rules.
log data generated by various digital entities in the network [11]. Also, these methods use supervised learning-based approaches to
SIEM systems enable threat detection by allowing users to exe-
cute search queries, referred to as rules, on the ingested log data. 1 https://attack.mitre.org/
classify structured data (i.e., intrusion detection system and SIEM logs or the mitigation strategy being applied upon which the rule
rules) to MITRE ATT&CK technique classes, which require retrain- operates. This natural language representation, along with the data
ing when new threats emerge. Their reliance on retraining limits source or mitigation-related information, serves as input to another
their scalability and efficiency in dynamic threat landscapes. Măr- LLM that maps the rule in question to probable MITRE ATT&CK
mureanu et al. [20] proposed a method to map structured data, techniques. In the final stage, the pipeline refines the mapping and
specifically Splunk rules, to the MITRE ATT&CK framework. This provides reasoning by extracting the most relevant techniques from
approach utilizes a BERT model trained as a classifier to categorize the list of potential matches, facilitating precise alignment of the
Splunk rules into 14 high-level MITRE ATT&CK tactic classes. How- rule with the MITRE ATT&CK framework.
ever, this method shares the same limitations as other supervised We conducted a comprehensive series of experiments to eval-
learning approaches discussed earlier, particularly the need for re- uate RAM’s ability to map SIEM rules to the MITRE ATT&CK
training with updated data to address new threats. Furthermore, framework. The evaluation focused on common metrics such as
the task of mapping rules to high-level tactics is comparatively precision and recall, which are indicators of the method’s accuracy
easier than mapping them to MITRE ATT&CK techniques and sub- and completeness in correctly classifying the SIEM rules to relevant
techniques, which involve around 670 distinct classes and present techniques within the framework. Various LLMs were examined,
a much greater challenge. Despite focusing on this simplified task, including Qwen, IBM Granite, Mistral, and GPT-4-Turbo, and we
the method failed to achieve high performance in their evaluation, evaluated RAM’s effectiveness when each LLM was employed in
due to its inherent limitations. In a recent study, Fayyazi et al. [12] the pipeline. We used the threat detection rules published in the
employed large language models (LLMs) to map CTIs in the form Splunk Security content dataset2 in our experiments; to ensure that
of unstructured text to MITRE ATT&CK techniques, while Nir et the rules were not already known to the LLM, we carefully selected
al. [9] employed them to map Snort intrusion detection rules to rules for the dataset based on their creation or modification dates.
MITRE ATT&CK techniques. Specifically, we included only those rules with dates later than the
These investigations highlight the potential of LLMs in cyber- knowledge cut-off date of the LLMs utilized in our experiments.
security tasks but also underscore their limitations. Solely relying Using various configuration settings, we aimed to identify the
on the implicit knowledge of LLMs has proven insufficient for ad- optimal strategies for maximizing the performance of these mod-
dressing the domain-specific requirements of cybersecurity. This els. Our study not only demonstrates the potential of LLMs in
gap highlights the need for more adaptable and scalable method- automating threat analysis but also provides insights into the most
ologies tailored to the dynamic nature of cyber threats. To produce effective configurations for deploying these models in real-world cy-
accurate and reliable predictions, they require additional contextual bersecurity environments. This study is among the first to explore
information that is not inherently available to the LLM. leveraging LLMs to map structured data to the MITRE ATT&CK
To address these shortcomings, we propose RAM, a novel LLM- framework, and our results, which highlight RAM’s potential, leave
based framework for analyzing SIEM rules and recommending room for further refinement in future research. We also provide
relevant MITRE ATT&CK techniques. RAM eliminates dependence valuable insights regarding the challenges encountered during this
on training data, utilizes LLM agents to retrieve supplementary study, which can guide subsequent advancements in this domain
contextual information, and transforms structured rule into un- for example, the lack of a completely labeled SIEM rules dataset.
structured natural language to preserve the syntactic and semantic The main contributions of this paper are as follows:
meaning of the rule. This innovative approach ensures reliable and • We demonstrate the feasibility of using LLMs to automate the
accurate predictions while overcoming the limitations of existing mapping of SIEM rules to MITRE ATT&CK techniques and pro-
methods. vide reasoning, which could significantly enhance the capabilities
LLMs, with their advanced natural language processing (NLP) of current cybersecurity tools.
capabilities, can process and analyze structured data, automatically • We propose an AI agent-based framework that utilizes both
identify patterns, and understand the syntactic meaning of the implicit and explicit knowledge in automating the mapping of
data, but they often fall short in understanding the semantic mean- structured SIEM rules to MITRE ATT&CK techniques.
ing of the data. This study leverages LLMs to autonomously map • We demonstrate the effective utilization of LLMs without the
structured data in the form of SIEM rules to MITRE ATT&CK tech- need for pretraining or fine-tuning, thereby eliminating the need
niques, enabling the automation of cybersecurity threat detection for any training data.
and investigation. • We provide a practical guide for deploying LLMs in cybersecurity,
RAM is a multi-stage AI agent pipeline (see Figure 1) inspired by by identifying the optimal configurations for these models.
the prompt chaining technique [27] and designed to enhance the • We present valuable insights regarding the challenges encoun-
understanding and application of SIEM rules. The pipeline begins tered during the experimentation process, providing increased
with the extraction of indicators of compromise (IoCs) from the understanding of the obstacles and considerations that shaped
rule (e.g., process names, file names, registry keys and values, IP our research and findings.
addresses, network ports). Then, a web search LLM agent retrieves
additional contextual information related to the IoCs identified in
the rule. Leveraging the information gathered in the preceding
stages, the next AI agent translates the rule into natural language
text, providing a comprehensive description. This textual descrip-
tion is then used by an LLM to identify the data source [15] of the 2 https://github.com/splunk/security_content/tree/develop/detections
Rule-ATT&CK Mapper (RAM): Mapping SIEM Rules to TTPs Using LLMs

Figure 1: Overview of our AI Agent-based RAM pipeline.

2 Background engineering techniques are discussed in detail in this section. Addi-


LLMs, due to their advanced NLP and generation capabilities, are tionally, to provide context for readers unfamiliar with the MITRE
well-suited to analyze structured data by understanding its syn- ATT&CK framework, a brief overview of its structure and purpose
tactical meaning, as well as for providing human understandable is also included in this section.
reasoning. However, effectively mapping structured data, such as
SIEM rules, to the MITRE ATT&CK framework [2] requires not
only syntactical understanding but also a deep comprehension of 2.1 MITRE ATT&CK Framework
the semantic meaning of the rules. LLMs must discern the semantic The MITRE ATT&CK framework [2] is a comprehensive, globally
content of a rule and identify its alignment with the techniques and recognized knowledge base that provides detailed insights into the
sub-techniques. Relying solely on the implicit knowledge of LLMs tactics, techniques, and procedures (TTPs) used by adversaries in
is insufficient for this complex task. cyberattacks. It is designed to aid cybersecurity professionals in un-
To address this challenge, various prompt engineering tech- derstanding adversarial behaviors, identifying attack patterns, and
niques [23] were implemented to enhance the LLMs’ ability to un- strengthening defensive strategies. By systematically categorizing
derstand the semantic nuances of SIEM rules and propose relevant adversarial actions, the framework helps organizations enhance
MITRE ATT&CK techniques and sub-techniques. These prompt their detection, response, and mitigation capabilities.
The framework is structured around three primary components: with actionable operations in a unified system. REACT agents dy-
tactics, techniques, and procedures. Tactics represent the high-level namically integrate high-level reasoning and decision-making with
objectives that adversaries aim to achieve during an attack. These the execution of actions in an iterative feedback loop. This enables
objectives outline the “why” behind an adversary’s actions and are the agent to automatically analyze tasks, plan appropriate steps,
categorized into stages of an attack lifecycle, such as Initial Ac- execute actions, and adapt to new information or evolving contexts.
cess, Execution, Persistence, Privilege Escalation, and Exfiltration. The REACT framework operates through a structured workflow: (1)
Techniques define the specific methods used to accomplish these The LLM interprets the input query or instruction, performs logical
tactical goals. These represent the “how” of an attack and include analysis, and generates a plan of action, (2) The agent executes the
actions like phishing (to gain initial access), command and scripting planned actions, such as querying an API or controlling an external
interpreter usage (for execution), or credential dumping (to gain system, (3) The results of the actions are analyzed, allowing the
access to sensitive credentials). Complementing these, procedures agent to refine its reasoning and plan subsequent steps, and (4)
provide real-world examples of how techniques are operational- The final output integrates the outcomes of reasoning and acting,
ized, offering context on specific tools, scripts, or strategies used in delivering a comprehensive response to the user.
documented adversarial campaigns.
In addition to TTPs, the MITRE ATT&CK framework introduces
two critical concepts: data sources and mitigations. Data sources 3 Related Work
refer to the various types of telemetry and system-generated data Table 1 summarizes the related work on mapping cybersecurity data
that can be collected and analyzed to detect adversarial techniques. to attack tactics and techniques. As can be seen, previous works
Mitigations describe the preventive or corrective actions that can mainly focused on mapping unstructured data, such as threat in-
be implemented to neutralize threats or limit their impact. telligence reports and semi-structured data, such as event logs,
to MITRE ATT&CK techniques. These efforts proposed both rule-
2.2 Prompt Engineering based methods [18] and machine learning (ML)-based approaches [3,
Prompt engineering [23] is the practice of crafting precise and ef- 4, 12, 14, 19, 22, 29, 30].
fective input prompts to optimize the performance and output of The rule-based method proposed by Kryukov et al. [18] aimed to
LLMs. It involves designing queries or instructions that guide the map security events in SIEM to MITRE ATT&CK framework using
model’s understanding and execution of tasks. This technique has pre-defined rules based on threat patterns. A key limitation of their
become pivotal in ensuring that LLMs generate accurate, contextu- method lies in its heavy reliability on rule (or patterns) database.
ally appropriate, and reliable results, especially in tasks requiring While these patterns are essential to carry out accurate mapping,
nuanced reasoning or complex problem-solving. the method’s reliance on them constrains its adaptability to newly
emerging threats within the dynamic and ever-evolving cybersecu-
2.2.1 Prompt Chaining. Prompt chaining [27] is a technique that rity landscape. As a result, increased false positives (misidentifica-
decomposes complex tasks into a sequence of smaller, logically tion of benign activities as threats) and false negatives (failure to
ordered steps. In this approach, the output from one step serves as detect actual threats) are produced, particularly when new attack
the input for the next, enabling a modular and iterative resolution patterns or techniques have not been added to the database.
of intricate problems. For instance, when generating a report, the Recent advancements have marked a significant shift from tradi-
initial prompt could request an outline, subsequent prompts could tional rule-based methods to the adoption of ML-based approaches,
expand individual sections, and a final prompt could synthesize the particularly the use of language models, for this task. Language
results into a cohesive document. This method improves the clarity models, such as BERT and GPT, offer the ability to process unstruc-
and manageability of multifaceted tasks. tured text with minimal feature engineering due to their powerful
contextual understanding and pre-trained embeddings.
2.2.2 Chain-of-thought Prompting. Another notable technique is You et al. [29] introduced a classification-based model for map-
chain-of-thought (CoT) prompting [25], which implements a step- ping unstructured data in the form of CTI reports to the MITRE
by-step reasoning. By explicitly including intermediate reasoning ATT&CK framework. Their approach utilized a combination of a
steps within the prompt or instructing the model to generate these bi-directional LSTM model and a CNN model to classify unstruc-
steps, CoT prompting enhances the AI’s capability to address tasks tured data into just six technique classes, significantly simplifying
that require logical inference or multi-step computation. the task. However, the complexity of mapping unstructured data in-
creases substantially when the model must predict across the entire
2.3 LLM Agents and React Framework set of technique classes in the MITRE ATT&CK framework, which
LLM agents, or AI Agents [17], leverage the capabilities of large includes approximately 670 classes. Additionally, it is important
language models to perform tasks autonomously by reasoning, to note that an attack pattern or rule can be mapped to multiple
planning, and acting based on input instructions. They are typically techniques, and a single technique may fall under multiple tactics,
used in applications like chatbots, decision-making systems, or further complicating the mapping process. This limitation may
task automation. These agents operate by combining LLMs with restrict the method’s applicability in real-world scenarios where
external tools, APIs, or environments to handle complex tasks that comprehensive coverage of TTP classes is essential.
require more than natural language generation. Liu et al. [19] proposed a novel approach to map unstructured
The REACT (REasoning & ACTing) framework [28] enhances CTI to MITRE ATT&CK framework. Their methodology, referred
the functionality of LLM agents by combining logical reasoning to as ATHRNN (attention-based transformer hierarchical recurrent
Rule-ATT&CK Mapper (RAM): Mapping SIEM Rules to TTPs Using LLMs

neural network), employs a two-step classification process: first, LLM predictions. Mărmureanu et al. [20] proposed method to map
classifying the unstructured text into MITRE ATT&CK tactics and structured data in the form of Splunk rules to MITRE ATT&CK
then further classifying it into MITRE ATT&CK techniques. In other framework. The authors proposed the use of ML classifiers to map
work, Alves et al. [4] explored the application of BERT models the Splunk rules to tactics specified in MITRE ATT&CK framework.
for the classification of unstructured text into MITRE ATT&CK A significant limitation of these methods is their dependence on su-
techniques. The study uses eleven different BERT models to map pervised learning approaches to train the machine learning models
unstructured texts to the MITRE ATT&CK framework, aiming to within their frameworks. These models are unable to dynamically
enhance automation in cyber threat intelligence. adapt to evolving threat landscapes or newly introduced MITRE
Similarly, Alam et al. [3] proposed LADDER, a framework de- ATT&CK techniques without undergoing retraining with updated
signed to enhance cybersecurity by automatically extracting attack datasets. This retraining process is not only time-consuming but
patterns from CTI sources. LADDER uses different BERT-based also resource-intensive, substantially restricting the methods’ abil-
models for extracting attack patterns from unstructured texts and ity to keep pace with the rapid evolution of cyber threats.
then mapping these patterns to MITRE ATT&CK framework. Rani In summary, the limitations of prior studies can be categorized
et al. [22] proposed TTPXHunter, a method designed for the au- as follows: (1) reliance on supervised learning tasks, (2) inability
tomated mapping of attack patterns extracted from cyber threat to adapt to structured texts, and (3) dependence on additional con-
reports to MITRE ATT&CK framework. This method is an exten- textual information to effectively interpret rules. To address these
sion of TTPHunter [21], improving its ability to cover a broader shortcomings, we propose RAM, a novel LLM-based approach that
range of techniques from MITRE ATT&CK framework and preci- eliminates the need for training data, coherently processes struc-
sion with the help of a cyber domain-specific language model called tured text into natural language, and employs LLM agents to re-
SecureBERT. Sentences are transformed into embeddings using Se- trieve supplementary contextual information, enabling reliable and
cureBERT and then sent to a linear classifier for TTP prediction. accurate predictions.
Fayyazi et al. [12] evaluated how well LLMs, specifically encoder-
only (e.g., RoBERTa) and decoder-only (e.g., GPT-3.5) models, can
summarize and map cyberattack procedures to the appropriate 4 Methodology
ATT&CK tactics. The authors compared various mapping approaches. Each SIEM system uses its own RDL to define threat detection
Howerver, they focused on mapping cyberattack procedures to rules, and each RDL has its own schema. For example, the Splunk
MITRE ATT&CK tactics (which represent higher-level categoriza- SIEM uses the SPL to define its threat detection rules. The task of
tions in MITRE ATT&CK framework). While this is useful, it is com- understanding threat detection rules and recommending relevant
paratively easier than mapping cyberattack procedures to MITRE MITRE ATT&CK techniques (or sub-techniques) requires complex
ATT&CK techniques and sub-techniques, which are more granular reasoning skills. In the case of LLMs, this can be achieved with a
and detailed, offering deeper insights into the specific actions and technique called prompt chaining in which each task is divided into
methods employed in an attack. multiple sub-tasks in order to understand the complex reasoning
Fengrui et al. [14] introduced a method combining data augmen- behind the task. Therefore, we employ a multi-phase architecture
tation and instruction supervised fine-tuning using LLMs to classify based on prompt chaining that leverages the power of LLMs to
TTPs effectively in scenarios with limited data. Similarly, Zhang take a SIEM rule defined in any RDL and map it to relevant MITRE
et al. [30] introduced a novel framework for constructing attack ATT&CK techniques using the power of LLMs. Our approach is
knowledge graphs (KGs) from CTI reports, by leveraging LLMs. based on the following intuitions:
While these methods demonstrate remarkable progress in han- • LLMs’ implicit knowledge: LLMs possess deep understanding of
dling unstructured data, their applicability to structured data use diverse RDLs. This enables them to interpret any rule, regardless
cases, such as mapping SIEM rules, is limited. These approaches of the RDL it is defined in, and convert it into comprehensible
are specifically designed for unstructured data, where relationships natural language text.
between entities and contextual information are often explicitly • LLMs’ similarity comparison capability: LLMs are adept at analyz-
defined, simplifying the mapping process to the MITRE ATT&CK ing and comparing textual descriptions. They can intelligently
framework. Adapting these methods to structured formats like assess the similarity between two textual inputs to establish a
SIEM rules would require extensive modifications, reducing their meaningful connection.
effectiveness and suitability for such scenarios. RAM has two main phases: (1) the rule to text translation phase,
To the best of our knowledge, only two studies have specifi- and (2) the MITRE ATT&CK techniques recommendation phase.
cally focused on mapping structured data, such as IDS rules and These two phases in the pipeline include six key steps to determine
SIEM rules, to the relevant MITRE ATT&CK technique (or sub- relevant TTPs, as illustrated in Figure 1.
techniques) using language models. Although LLMs excel at translating SIEM rules into natural lan-
Nir et al. [9] investigated the integration of LLMs, specifically guage, they often lack critical domain-specific contextual informa-
ChatGPT, into cybersecurity workflows to automate the associ- tion related to IoCs in the rules. To overcome this limitation, the
ation of network intrusion detection system (NIDS) rules with rule to text translation phase includes three steps: IoC extraction,
corresponding MITRE ATT&CK techniques. While their method contextual information retrieval, and natural language translation.
represents one of the first applications of LLMs for this purpose, The workflow begins with the extraction of IoCs from the rules
their findings highlight the necessity of incorporating additional (for example, processes, log source, event codes, and file names) that
contextual information to enhance the accuracy and reliability of the rule searches for in the logs (step (1)).In the next sstep a web
Table 1: Summary of previous work.

Paper name Mapping Input data (& type) Method Comments


Kryukov et al. [18] Techniques SIEM alerts log Rule-based mapping Coverage of all technique classes. No quantitative metric
(2022) (semi-structured) was provided.
Alves et al. [4] Techniques MITRE Procedures Train BERT model as classifier Achieved classification accuracy of 0.82 on test dataset.
(2022) (unstructured) Coverage of only 253 techniques in their evaluation.
You et al. [29] (2022) Techniques CTI reports Use pre-trained Sentence-BERT for Evaluation performed on only six techniques which
(unstructured) embeddings; train bi-LSTM with attention extremely simplifies the classification task. Classification
coupled with CNN for classification accuracy on six techniques is 0.94.
Liu et al. [19] (2022) Techniques CTI reports Train transformer and RNN-based model as Coverage of all technique classes. Achieved an AUC score
(unstructured) classifier of 0.76 during the classification task.
Fayyazi et al. [12] Tactics MITRE Procedures RAG-based approach to improve LLM Only 14 classes available for classification. Achived a high
(2023) (unstructured) performance F1 score of 0.95 when using RAG to fetch external data.
Alam et al. [3] Techniques CTI reports Train BERT model as classifier Achieved TTP classification recall of 0.63.
(2023) (unstructured)
Rani et al. [22] Techniques CTI reports Use SecureBERT for embeddings and train a The dataset consisted of only 193 Technique classes.
(2024) (unstructured) linear classifier Achieved a recall of 0.96 on augmented test dataset.
Fengrui et al. [14] Techniques MITRE Procedures LLM fine-tuning with MITRE data Achieved a recall of 0.89 when the number of samples in
(2024) (unstructured) the fine-tuning dataset is more than 33. When the sample
size is less, the recall achieved is 0.43.
Zhang et al. [30] Techniques CTI reports Use LLM for similarity matching Overall recall achieved for technique identification is 0.59.
(2024) (unstructured)
Nir et al. [9] (2024) Techniques NIDS rules (structured) Use LLM’s implicit knowledge Maximum recall achieved with ChatGPT-4 is 0.68.
Mărmureanu et Tactics SIEM rules (structured) Train BERT model as classifier Only 14 classes available for classification. With weight
al. [20] (2023) based ensemble learning strategy, achieved a recall of 0.72

Our method Techniques SIEM rules (structured) Use prompt engineering techniques and Coverage of all technique level classes. Achieved a recall of
(RAM) implement LLM agents to enhance LLM 0.75 on the test rules.
performance

search agent performs the task of obtaining additional contextual direct extraction of IoCs from the rules without requiring extensive
information about the IoCs discovered ((step 2)). By incorporating pre-training on specific datasets.
this additional domain-specific information, the pipeline enhances The result of this stage is a dictionary structure, where:
the language translation, resulting in a more accurate and mean- • Keys represent types of IoC, such as processes, files, IP addresses,
ingful interpretation of SIEM rules. The rule itself and the IoCs’ and log sources.
contextual additional information from the previous stage are then • Values are lists containing specific IoC details, such as process
used to translate the rule from RDL to natural language (step (3)). names, file names, IP addresses, and log source identifiers.
The MITRE ATT&CK techniques recommendation phase of the In the example depicted in Figure 2(a), the pipeline processes
pipeline includes the following three steps. The rule in processed in a rule for which relevant MITRE ATT&CK techniques need to
data source identification step in which the probable origin of the be recommended. The IoC extractor LLM produces a dictionary
data is identified (step (4)). The description of the rule is then used structure as output, organizing the IoCs in a structured format to
to determine probable MITRE ATT&CK techniques based on the support subsequent stages in the analysis pipeline.
implicit knowledge of the LLM (step (5)). Finally, using chain-of-
thought [25] prompting, the most relevant techniques are extracted 4.2 Contextual Information Retrieval
from the list of probable techniques (step (6)). Each step of our In this step, an LLM agent is employed to retrieve relevant infor-
method is further described in detail below. mation pertaining to the IoCs extracted from the rule. A REACT
agent [1] was used in this case to generate both reasoning traces
and task-specific actions in an interleaved manner. REACT agents
4.1 IoC Extraction interact with external tools to retrieve additional information that
The context associated with a SIEM detection rule is crucial for leads to more factual and reliable responses. The LLM agent con-
its accurate interpretation and effective application. Obtaining this ducts a systematic search across web resources to gather additional
contextual understanding requires comprehensive analysis of the contextual information for each IoC value present in the rule. This
embedded IoCs in the SIEM rule. In the first step, RAM system- step addresses LLMS’ lack of up-to-date knowledge or specialized
atically identifies and extracts all IoCs, identifying the types of domain expertise (which is critical to understanding the role and
IoCs and their corresponding values that form the foundational significance of the IoCs in the rule), without the need for retraining
elements of the detection rules. Leveraging the LLM’s inherent or fine-tuning. Figure 2(b) presents an example in which the rule
understanding of rule structures and IoCs, we employ a zero-shot includes the process name soaphound.exe as an IoC. As can be
prompting approach for this task. Zero-shot prompting enables the seen, the web search results indicate that soaphound.exe is being
Rule-ATT&CK Mapper (RAM): Mapping SIEM Rules to TTPs Using LLMs

Figure 2: An illustration of the different steps in RAM.

used for active directory (AD) enumeration, which is important for 4.4 Data Source or Mitigation Identification
the understanding of the attack. Identifying the most relevant data component or mitigation asso-
ciated with the rule description in this step is critical for filtering
4.3 Natural Language Translation out irrelevant MITRE ATT&CK techniques (or sub-techniques) in
subsequent steps of the pipeline. In the MITRE ATT&CK frame-
The translation of detection rules into natural language textual
work, data sources represent various categories of information that
descriptions fulfills three key objectives:
can be gathered from sensors or logs. These data sources include
(1) Ensures that RAM is format-agnostic: It converts rules
data components, which are specific attributes or properties within
defined in various RDL formats into a generic, unstructured
a data source that are directly relevant to detecting a particular
text format, ensuring compatibility with different SIEM systems,
technique or sub-technique . For example, in the context of the
regardless of the specific rule format.
rule described in Figure 2(a), the term Endpoint.Processes indi-
(2) Provides contextual explanation: It includes all relevant con-
cates that the activity is happening on an endpoint. Presence of
textual information to produce a concise and comprehensible
the terms such as, soaphound.exe, –buildcache, –certdump and
explanation of the rule.
etc. indicate that the rule searches for command line execution
(3) Enhances the comprehension for LLMs: It enables LLMs
of an executable named soaphound.exe with specific parameters.
to more effectively compare the translated rule with descrip-
Therefore, the appropriate data source in this example is Com-
tions in the MITRE ATT&CK framework by providing a unified
mand, with the corresponding data component being Command
textual representation.
Execution. Additionally, mitigations are defined as categories of
To achieve these objectives, a zero-shot prompting technique is
technologies or strategies that can prevent or reduce the impact
employed. The input to the LLM comprises two components:
of specific techniques or sub-techniques. The MITRE ATT&CK
• Syntactical information: The rule itself, providing the framework explicitly establishes relationships between data com-
structural and operational details. ponents, mitigations, and techniques (or sub-techniques), enabling
• Contextual information: Details of the IoCs extracted a systematic approach for identifying relevant elements.
from the rule, providing semantic insights into the rule’s To identify the most relevant data component or mitigation
intent and function. associated with a given rule description, we utilize agentic retrieval
augmented generation (RAG), which incorporates an AI Agent-
The LLM utilizes these inputs to generate a natural language textual based implementation of the RAG framework. Data from the MITRE
description of the rule. This transformation not only ensures a ATT&CK framework, specifically related to data components and
more interpretable representation but also facilitates further steps mitigations, is stored in a vector database (e.g., ChromaDB). The
of analysis and comparison, particularly in aligning the rule with process begins with the rule description from the previous stage,
MITRE ATT&CK techniques and sub-techniques.
which serves as the input to the AI Agent. The LLM-powered agent The chain-of-thought (CoT) rationale generated during the com-
automatically generates a search query tailored to retrieve relevant parison of each rule to its probable technique is also provided as an
information from the RAG database. output in this step. This rationale offers a detailed natural language
For each query, the system retrieves the five most similar docu- explanation, articulating why a particular technique is relevant to
ments from the database, each containing contextual information the given rule. Such explanations are highly valuable for security
about data components or mitigations. These documents are then analysts, as they provide clear and transparent reasoning behind
utilized by the LLM agent to contextualize the rule description. By the mapping, enabling analysts to better understand and validate
comparing the content of these retrieved documents with the rule the association between the rule and the technique. Other classi-
description, the LLM agent determines and outputs the most rele- fication models proposed in previous works within this domain
vant data component or mitigation along with a chain-of-thought also suffer from the limitation of being black-box models, which
as to why the data component or mitigation is related to the rule. lack the ability to provide clear reasoning or explanations. Unlike
RAM, these models fail to generate transparent, CoT rationales that
4.5 Probable Technique Recommendation explain why a particular rule is mapped to a specific technique,
making them less interpretable and less useful for security analysts.
In this step, an LM agent is utilized to propose probable MITRE
ATT&CK techniques (and sub-techniques) that may be relevant to
5 Evaluation
the description of the provided rule. We used a REACT agent in this
step as well to utilize both implicit and explicit knowledge during 5.1 Dataset
reasoning. For explicit knowledge, the agent searches the MITRE The Splunk Security Content dataset is a comprehensive repository
ATT&CK framework to obtain the list of probable techniques (and of security resources dedicated to improving the detection and
sub-techniques). The natural language description of the rule from mitigation of threats, maintained by Splunk Inc.’s research team.
the previous step serves as input to the LLM agent. The output of This dataset features over 1,600 analytic rules, all written in Splunk’s
this stage consists of a list of JSON objects, each containing the Search Processing Language (SPL), to effectively identify malicious
MITRE technique ID, technique name, and technique description behaviors.
as seen in Figure 2(c). These analytic rules are systematically organized into distinct
Throughout our experiments, we observed that as the number domains such as endpoint, network, application, etc., based on their
of recommendations (𝑘) increases, both the framework’s average intended scope of applicability. For instance, the endpoint domain
recall and precision initially improve, however beyond a certain comprises rules specifically crafted to detect malicious activities
threshold of 𝑘, the precision begins to decline. Based on these occurring on endpoint devices. Moreover, each rule is annotated
observations(please refer Table 5), we selected a 𝑘-value of 11 to with corresponding MITRE ATT&CK technique identifiers, thereby
ensure a high recall. offering a ground-truth framework for our experiments. We chose
the endpoint domain as it presents a higher diversity of MITRE
4.6 Relevant Technique Extraction ATT&CK technique labels than other domains in the dataset.
In this step, RAM refines the set of probable MITRE ATT&CK To ensure experimental integrity, we carefully evaluated the
techniques identified in the previous stage by eliminating irrelevant knowledge cut-off dates of the models utilized. Among the hosted
entries. This step in the pipeline serves two primary purposes: (1) models, GPT-4-Turbo had the most recent cut-off date of December
to enhance precision while maintaining recall achieved in previous 2023,3 while Granite, the latest local model, was released in October
step, and (2) to provide a clear rationale for the selection of the 2024. To prevent any potential data leakage, we exclusively analyzed
labels, ensuring transparency and interpretability of the mapping rules from the Splunk Security Content dataset that were created
process. This refinement process is grounded in the assumption or modified on or after November 2024. During our review of the
that LLMs are effective for text similarity matching tasks. Splunk Security Content repository in December 2024, we identified
The process comprises two key steps: 360 rules within the endpoint domain that met our criteria. None
that none of the hosted or local models employed in our experiments
• Rule-technique comparison: The description of each tech- were capable of inherently performing web searches since they were
nique in the set of probable techniques is compared with used via an API [6].
the rule description. A chain-of-thought technique is then
applied to elucidate the reasoning behind the association 5.2 Evaluation Setup
of each technique with the rule. We implemented our method RAM in Python 3.9, leveraging mul-
• Confidence calculation: The generated chain-of-thought ra- tiple libraries and frameworks for model interaction. We utilized
tionale for each technique (or sub-technique) is compared the transformers library [26] for local models and for hosted mod-
with the rule description to compute a relevance (or confi- els, we employed the LangChain and LangGraph frameworks. The
dence) score, as done in prior work [16]. hosted models used in our experiments are GPT-4 Turbo, GPT-4o,
Techniques with higher confidence scores are deemed more rel- and GPT-4o-mini, whereas the local models are Mistral-7B, IBM
evant to the rule. Conversely, techniques with scores falling below Granite-3.0-8B, and Qwen-2.5-7B. Table 2 provides a summary of
a predefined threshold are excluded. The techniques retained after all the models used in our experiments.
this filtering step represent the most relevant techniques corre- 3 https://learn.microsoft.com/en-us/azure/ai-services/openai/concepts/models?tabs=
sponding to the given rule’s description. global-standard%2Cstandard-chat-completions#gpt-4-and-gpt-4-turbo-models
Rule-ATT&CK Mapper (RAM): Mapping SIEM Rules to TTPs Using LLMs

We configured our experimental setup with an Intel Core i7 pro-


cessor, 32 GB RAM, and NVIDIA RTX 4090 GPU (24GB VRAM).
Given the GPU memory constraints, we specifically selected local
models with up to 8 billion parameters that could be run at full pre-
cision without quantization, ensuring optimal model performance
and reliable comparison baseline. This hardware configuration al-
lowed us to maintain consistent computational precision across all
local model experiments.
With the technical infrastructure in place, we next focused on
developing a standardized approach to model interaction through
careful prompt engineering. Our approach follows a carefully de-
signed template that maximizes model performance while main-
taining consistency across different models. The prompts used in all
steps of RAM follow a four-component structure described below:
• Context: This provides the background or relevant details that
set the stage for the task. It includes information on the topic,
scenario, or purpose, ensuring that the LLM understands the
larger situation.
• Instruction: This part specifies the exact task the LLM is expected
to perform. It provides a clear, concise, and actionable explana-
tion to effectively guide the LLM.
• Guidelines: These are the rules or constraints that the LLM must
follow when completing the task. They ensure that the output is
aligned with the desired tone, format, or style.
• Input: This includes any user-provided data, queries, or material
required for the LLM to complete the task. It serves as the starting
point or raw material for generating the output.
An example of a prompt used in the experiments is illustrated
in Figure 3. In this example, the prompt is structured into four key
components to effectively guide the LLM. The context specifies
that the task belongs to the cybersecurity domain and that the
LLM should approach it as a cybersecurity specialist. Following
Figure 3: Overview of prompt structure used in all steps of
this, the instruction clearly defines the task, requiring the LLM to
the pipeline.
identify and output the relevant attack techniques/sub-techniques
corresponding to the provided rule description. The guidelines set
additional constraints, specifying that the response must be format-
ted in JSON. Lastly, the actual rule description is presented as the approach utilizes SecureBERT to generate embeddings for the
input for the LLM to process and analyze. input data, which are then processed by a linear classifier to
produce the final output.

5.3 Baselines 5.4 Evaluation Metrics


We used three baseline methods to compare RAM with: To evaluate our framework, we employed evaluation metrics com-
(1) A single GPT-4-Turbo LLM with zero-shot prompting. The monly used in classic recommender systems. Specifically, we com-
input to the LLM was the rule as-is, and the LLM was asked to puted the average recall (AR) and average precision (AP) across the
analyze the rule and output all the MITRE ATT&CK techniques entire test dataset. In recommender systems, relevance is inherently
or sub-techniques relevant to the rule. user-specific, and both AP and AR metrics effectively capture the
(2) BERT-based classifiers. Building on the research by Mărmure- varying importance of individual recommendations. This charac-
anu et al. [20], we trained a BERT model [10] as a baseline for teristic makes these metrics particularly well-suited for multi-class
comparison our experiments. BERT’s ability to capture bidirec- classification problems like MITRE ATT&CK technique prediction,
tional context through its masked language modeling (MLM) where a single rule can be associated with multiple valid labels
and next sentence prediction (NSP) objectives makes it partic- (techniques or sub-techniques). For instance, the rule illustrated in
ularly effective for classification tasks [24]. Additionally, we Figure 2(a), has multiple techniques and sub-techniques mapped
implemented CodeBERT [13] and an adaptation of the BERT to it, including: T1087.002, T1069.001, T1482, T1087.001, T1087,
model, as baseline models for comparison in our experiments. T1069.002, and T1069.
(3) TTPxHunter. By implementing the publicly available code For computing AR, we first calculated the recall for each sample
from the research proposed by Rani et al. [22], we established (rule) in the dataset. The AR was then calculated by averaging
an additional baseline for comparison with our method. This the recall values across all samples in the dataset. Similarly, to
Table 2: Summary of the models and their configurations.

Model Type Model Name Parameter Size Context Window Max Tokens Knowledge Cut-off
GPT-4-Turbo ∼1.8 Trillion 128,000 4096 December 2023
Hosted GPT-4o ∼2 Trillion 128,000 16,384 October 2023
GPT-4o-mini ∼8 Billion 128,000 16,384 October 2023
IBM Granite ∼8 Billion 128,000 8,192 Not available
Local Mistral ∼7 Billion 32,000 4,096 Not available
Qwen ∼3 Billion 128,000 8,192 Not available

compute the AP, we first calculated the precision for each sample The context window of an LLM refers to the maximum amount of
in the dataset, and then, the AP was determined by averaging the text, measured in tokens, that the model can process at a time. This
precision values across all samples in the dataset. parameter plays a crucial role in determining the model’s ability to
In addition, the task of mapping a SIEM rule to the MITRE handle tasks requiring extensive context. In the case of RAM, the
ATT&CK framework is inherently a multi-class classification prob- performance using the Mistral model was poorer compared to other
lem, where the number of techniques associated with each sample models. This can be attributed to the significantly smaller context
varies. Appendix A provides an analysis of the distribution of sam- window of the Mistral model, which limited its ability to process
ples based on the number of techniques they contain. Therefore, we and utilize the full breadth of information required for effective
also compute the Weighted Average Precision (WAP) and Weighted recommendations. In contrast, GPT-4o-mini, a similar model with
Average Recall (WAR) metrics. These metrics apply weights based a larger context window, performed better as it could handle more
on the relative number of techniques for each sample. comprehensive inputs, leading to improved accuracy and reliability
in mapping SIEM rules to MITRE ATT&CK techniques
5.5 Results
Table 3: RAM’s performance using various hosted and local
Comparison of RAM with baselines. We compared RAM with
models.
the baseline methods described in the previous section, and the
results of this comparison are presented in Table 3. As evident from
the table, RAM outperformed the baseline methods when utilizing Model Type Model Name AR (WAR) AP (WAP)
GPT-4-Turbo and GPT-4o as the underlying models. These findings GPT-4-Turbo 0.75 (0.724) 0.52 (0.51)
underscore the critical role of both implicit and explicit domain- Hosted GPT-4o 0.71 (0.69) 0.49 (0.47)
specific knowledge in enabling LLMs to deliver optimal results for GPT-4o-mini 0.38 (0.36) 0.29 (0.275)
the task of mapping SIEM rules to the MITRE ATT&CK framework. IBM Granite 0.34 (0.31) 0.28 (0.24)
Local Mistral 0.12 (0.09) 0.08 (0.05)
RAM’s performance with different LLMs. RAM’s pipeline Qwen 0.23 (0.19) 0.16 (0.15)
was executed using several LLMs to evaluate its ability to recom- Zero-shot LLM 0.46 (0.43) 0.31 (0.30)
mend MITRE ATT&CK techniques (or sub-techniques) for a given BERT 0.68 (0.65) 0.39 (0.38)
Baselines
SIEM rule. As can be seen in Table 3, GPT-4-Turbo demonstrated CodeBERT 0.65 (0.63) 0.47 (0.45)
superior performance, delivering the most accurate and relevant TTPxHunter [22] 0.59 (0.56) 0.42 (0.41)
recommendations when compared to other hosted and local models.
These results provided insights into how the model configuration,
including model size and architectural differences, influence the Ablation study. In order to analyze the importance of the dif-
quality of recommendations generated by RAM. ferent components and steps in RAM, we performed an ablation
study, mainly focusing on the impact of language translation and
Effect of model’s size and context window length on per- additional contextual information on the recommendations made
formance. As can be seen in Table 3, RAM’s performance is sig- by the LLM. We performed the ablation study by running the ex-
nificantly influenced by the size of the LLMs used in its pipeline periment in three different scenario. In the first scenario, the rule
(see Table 2 for different models used and their sizes and context was provided as-is to the next stage of the pipeline; in this case,
window lengths used). Larger models, such as GPT-4-Turbo and RAM achieved an AR of approximately 0.46 and an AP of around
GPT-4o, demonstrated superior performance compared to their 0.39. In the second scenario, the rule was first translated into a
smaller counterparts like GPT-4o-mini or local models with fewer natural language description, without adding any contextual infor-
parameters. This performance disparity highlights the ability of mation, before being passed to subsequent stages of the pipeline.
larger models to better understand complex relationships and pat- This improved the AR to around 0.54 and the AP to approximately
terns in the input data. The large number of parameters and greater 0.42. Finally, in the third scenario, the rule was translated into a
context window allow them to capture nuanced information that natural language description enriched with contextual information
smaller models might overlook, leading to more accurate and reli- before being processed by the later stages of the pipeline. This
able recommendations. setup yielded the best results, with an AR of around 0.75 and an
Rule-ATT&CK Mapper (RAM): Mapping SIEM Rules to TTPs Using LLMs

AP of approximately 0.51. In these experiments GPT-4-Turbo was


used, given its superior performance in the previous experiments
performed. A summary of the results is presented in Table 4.
These findings highlight the importance of contextual informa-
tion in enhancing RAM’s performance, demonstrating that enrich-
ing natural language translations with domain-specific insights
leads to improved recall and precision.

Table 4: Impact of contextual information that enriches nat-


ural language translation with domain-specific insights on
RAM’s performance.

S.No. Experiment AR (WAR) AP (WAP)


1 Rule as-is 0.46 (0.42) 0.39 (0.36)
Rule description w/o
2 0.54 (0.49) 0.42 (0.40) Figure 4: Average precision vs average recall curve.
contextual information
Rule description with
3 0.75 (0.724) 0.52 (0.51)
contextual information explanation. This explanation, expressed in natural language, pro-
vides the relationship between the rule and the technique. Such
transparency allows security analysts to interpret the rationale be-
Effect of 𝑘 on relevant recommendations. Our preliminary hind the mapping, enabling them to assess whether RAM’s output
experiments demonstrated that limiting 𝑘 substantially influenced can be trusted or requires further scrutiny. For example, refer to
RAM’s performance. Larger 𝑘 values improved recall by capturing SIEM rule mentioned in the Figure 2(a) as input. This rule relates
a broader range of potentially relevant recommendations, but this to the execution of an executable file called soaphound.exe on a
came at the cost of reduced precision. This trade-off highlights the system. One of the labels for this rule is "T1482 - Domain Trust
necessity of carefully selecting 𝑘 to balance precision and recall Discovery". Figure 5 provides the CoT explanation as to why the
according to the specific needs of the application. technique T1482 is relevant to the SIEM rule.
To address this, we replaced the hard limit on 𝑘 with a filtering
mechanism based on the confidence (relevance) score generated
in the final stage of the pipeline. Recommendations with scores
below a predefined threshold were excluded. We used a threshold
of 0.8, which effectively filtered low-confidence recommendations
while retaining the most relevant results. This approach improved
the overall performance of the model by dynamically adjusting
the number of recommendations based on their relevance. RAM’s
results for different 𝑘 values, using the abovementioned filtering
mechanism, are presented in Table 5.

Table 5: RAM’s performance for various values of 𝑘.

S.No. k AR (WAR) AP (WAP) F1-Score


1 1 0.45 (0.44) 0.29 (0.26) 0.35
2 3 0.62 (0.58) 0.34 (0.3) 0.44
3 5 0.66 (0.636) 0.45 (0.43) 0.53
4 7 0.71 (0.68) 0.42 (0.40) 0.527
5 9 0.74 (0.72) 0.40 (0.35) 0.52
6 11 0.75 (0.728) 0.39 (0.38) 0.51
7 13 0.75 (0.72) 0.35 (0.31) 0.48
7 dynamic-k 0.75 (0.724) 0.52 (0.51) 0.62
Figure 5: Chain-of-thought reasoning provided by RAM.
Reasoning. One major advantage of using LLMs over other lan-
guage models like BERT or CodeBERT is their inherent ability to
provide reasoning for the tasks they perform. In RAM’s pipeline, the 6 Discussion
Rule-Technique Comparer step plays a crucial role by comparing the The use of RAM for mapping SIEM rules to the MITRE ATT&CK
rule to the technique in question and generating a chain-of-thought framework offers several distinct advantages. Unlike traditional
approaches, RAM does not require any training data, making it par- where All_Changes.result="*locked out*" by
ticularly well-suited for the cybersecurity domain, where publicly All_Changes.user All_Changes.result
available data is often scarce. Additionally, the incorporation of |`drop_dm_object_name("All_Changes")`
LLMs within the pipeline enables the generation of natural language |`drop_dm_object_name("Account_Management")`
reasoning, facilitating easier interpretation of results. This trans- | `security_content_ctime(firstTime)`
parency enhances the trust of security analysts on the framework, | `security_content_ctime(lastTime)`
as they can better understand and validate the mapping process. | search count > 5
While RAM demonstrated promising results, our study has cer-
tain limitations. One of the primary concerns is confidentiality, as This rule aims to identify instances where excessive user account
RAM’s pipeline relies on hosted models, which may involve sending lockouts are being recorded. The label assigned to this rule in the
sensitive data to external servers. Additionally, the use of prompt dataset is "T1078 - Valid Accounts," which is an accurate label.
chaining in the pipeline leads to longer response times compared to However, the dataset does not include all the relevant labels for
baseline methods. The duration of processing varies based on the this rule. In this case, the technique "T1110 - Brute Force" is also a
resources available when employing local models, while for hosted relevant technique that should have been mapped to the rule but
models, it depends on factors such as network bandwidth and the was not included in the dataset. Despite these imperfections in the
tokens-per-second (TPS) rate of the model. labeling, the Splunk Security Content dataset provided the most
The results of our study also revealed several challenges associ- reliable ground truth available for our study.
ated with the research, which are discussed below: Technique vs. Sub-Technique Prediction Challenges: In some
Insufficient Rule-Specific Information: The information em- cases, the LLM predicted a technique instead of a sub-technique
bedded within a rule alone was found to be inadequate for fully (or vice versa). Under stricter evaluation criteria that was used
understanding its purpose. Additional contextual data such as the to evaluate our experiments, such discrepancies were categorized
associated common vulnerabilities and exposures (CVE) 4 ID of- as mismatches, which negatively influenced RAM’s performance
ten proved necessary to interpret the rule accurately. For example, metrics of RAM. These observations underline the need for more
consider the following rule: precise distinction mechanisms during prediction tasks.
| tstats `security_content_summariesonly` count FROM 7 Conclusion and Future Work
datamodel=Endpoint.Registry
where Registry.registry_path="*\InProcServer32\*" We proposed RAM, an LLM-based approach for mapping SIEM
Registry.registry_value_data=*\FORMS\* threat detection rules to MITRE ATT&CK techniques. While LLMs
by Registry.registry_path Registry.registry_key_name possess implicit knowledge derived from publicly available data,
Registry.registry_value_name their direct application in cybersecurity contexts is often limited due
Registry.registry_value_data Registry.dest to domain-specific challenges. Our experiments showed that RAM’s
Registry.process_guid Registry.user performance significantly improves when additional contextual
| `drop_dm_object_name(Registry)` information is integrated.
| `security_content_ctime(firstTime)` We identified two primary strategies for incorporating such infor-
| `security_content_ctime(lastTime)` mation: (i) explicitly supplying contextual data in real time through
LLM agents, and (ii) fine-tuning the LLM with domain-specific infor-
From the information provided in the rule, it is impossible to infer mation. In this study, we adopted the first approach, enriching the
that this rule searches for phishing activity through outlook.exe pipeline in real time with publicly available contextual data sourced
process unless the CVE ID related to it is known which is "CVE- from the Internet as fine-tuning an LLM requires labelled dataset.
2024-21378". As part of future work, we plan to enhance RAM by incorporating
Impact of Textual Similarities on LLM Accuracy: The high de- organization-specific contextual information, which can further
gree of similarity in textual descriptions of various MITRE ATT&CK tailor the model to specific operational environments. Addition-
techniques and sub-techniques led to instances of hallucination by ally, we aim to explore fine-tuning LLMs with organization specific
the language models. This overlap in descriptions posed a signifi- contextual data as an alternative approach to further improve the
cant challenge to ensuring accurate predictions. prediction accuracy of RAM. Future research will also investigate
Dataset Mislabeling: The importance of a technique (or sub- optimization techniques such as hyperparameter tuning and ensem-
technique) to a specific rule differed based on the user’s perspective. ble methods to further enhance the performance of the proposed
This led to inconsistencies in how the dataset was labeled, with method. With these enhancements RAM will serve as a reliable
several cases being mislabeled. These errors highlight that deter- and adaptable solution for mapping SIEM rules to MITRE ATT&CK
mining relevance is often subjective. As an example, consider the techniques in dynamic and complex cybersecurity landscapes.
following rule, The adoption of RAM will contribute greatly to the automation
of the cybersecurity incident response pipeline. It also provides a
| tstats `security_content_summariesonly` roadmap for integrating advanced LLMs into the defensive strate-
count min(_time) as firstTime max(_time) gies of organizations worldwide.
as lastTime from datamodel=Change.All_Changes
4 https://cve.mitre.org/
Rule-ATT&CK Mapper (RAM): Mapping SIEM Rules to TTPs Using LLMs

References [25] Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi,
[1] [n. d.]. ReAct Prompting. https://www.promptingguide.ai/techniques/react Quoc V Le, Denny Zhou, et al. 2022. Chain-of-thought prompting elicits reason-
[2] Bader Al-Sada, Alireza Sadighian, and Gabriele Oligeri. 2024. Mitre att&ck: State ing in large language models. Advances in neural information processing systems
of the art and way forward. Comput. Surveys 57, 1 (2024), 1–37. 35 (2022), 24824–24837.
[3] Md Tanvirul Alam, Dipkamal Bhusal, Youngja Park, and Nidhi Rastogi. 2023. [26] Thomas Wolf. 2020. Transformers: State-of-the-Art Natural Language Processing.
Looking beyond IoCs: Automatically extracting attack patterns from external arXiv preprint arXiv:1910.03771 (2020).
CTI. In Proceedings of the 26th International Symposium on Research in Attacks, [27] Tongshuang Wu, Ellen Jiang, Aaron Donsbach, Jeff Gray, Alejandra Molina,
Intrusions and Defenses. 92–108. Michael Terry, and Carrie J Cai. 2022. Promptchainer: Chaining large language
[4] Paulo MMR Alves, PR Geraldo Filho, and Vinícius P Gonçalves. 2022. Leveraging model prompts through visual programming. In CHI Conference on Human Factors
BERT’s Power to Classify TTP from Unstructured Text. In 2022 Workshop on in Computing Systems Extended Abstracts. 1–10.
Communication Networks and Power Systems (WCNPS). IEEE, 1–7. [28] Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan,
[5] Checkpoint. [n. d.]. Check Point Research. https://blog.checkpoint.com/research/ and Yuan Cao. 2023. ReAct: Synergizing Reasoning and Acting in Language
check-point-research-reports-highest-increase-of-global-cyber-attacks-seen- Models. arXiv:2210.03629 [cs.CL] https://arxiv.org/abs/2210.03629
in-last-two-years-a-30-increase-in-q2-2024-global-cyber-attacks/ [29] Yizhe You, Jun Jiang, Zhengwei Jiang, Peian Yang, Baoxu Liu, Huamin Feng,
[6] Xiang Chen, Chaoyang Gao, Chunyang Chen, Guangbei Zhang, and Yong Liu. Xuren Wang, and Ning Li. 2022. TIM: threat context-enhanced TTP intelligence
2024. An Empirical Study on Challenges for LLM Application Developers. mining on unstructured threat data. Cybersecurity 5, 1 (2022), 3.
arXiv:2408.05002 [cs.SE] https://arxiv.org/abs/2408.05002 [30] Yongheng Zhang, Tingwen Du, Yunshan Ma, Xiang Wang, Yi Xie, Guozheng Yang,
[7] CISA. [n. d.]. Best Practices for Mapping to MITRE ATT&CK. https://www.cisa. Yuliang Lu, and Ee-Chien Chang. 2024. AttacKG+: Boosting Attack Knowledge
gov/news-events/news/best-practices-mitre-attckr-mapping Graph Construction with Large Language Models. arXiv preprint arXiv:2405.04753
[8] Cybero. [n. d.]. SIEM Optimization Through MITRE ATT&CK. https: (2024).
//www.cyrebro.io/blog/siem-optimization-through-mitre-attck-staying-
ahead-of-threats-with-cyrebro/
[9] Nir Daniel, Florian Klaus Kaiser, Anton Dzega, Aviad Elyashar, and Rami Puzis.
A Labels Distribution
2023. Labeling NIDS Rules with MITRE ATT &CK Techniques Using ChatGPT.
In European Symposium on Research in Computer Security. Springer, 76–91.
[10] Jacob Devlin. 2018. Bert: Pre-training of deep bidirectional transformers for
language understanding. arXiv preprint arXiv:1810.04805 (2018).
[11] Exabeam. [n. d.]. What is SIEM and How Does it Work? https://www.exabeam.
com/explainers/siem-tools/siem-solutions/
[12] Reza Fayyazi, Rozhina Taghdimi, and Shanchieh Jay Yang. 2023. Advancing TTP
Analysis: Harnessing the Power of Encoder-Only and Decoder-Only Language
Models with Retrieval Augmented Generation. arXiv preprint arXiv:2401.00280
(2023).
[13] Zhangyin Feng, Daya Guo, Duyu Tang, Nan Duan, Xiaocheng Feng, Ming Gong,
Linjun Shou, Bing Qin, Ting Liu, Daxin Jiang, et al. 2020. Codebert: A pre-trained
model for programming and natural languages. arXiv preprint arXiv:2002.08155
(2020).
[14] Yu Fengrui and Yanhui Du. 2024. Few-Shot Learning of TTPs Classification
Using Large Language Models. (2024).
[15] MITRE ATT&CK framework. [n. d.]. Data Sources. https://attack.mitre.org/
datasources/
[16] Scott Freitas, Jovan Kalajdjieski, Amir Gharib, and Robert McCann. 2024. AI-
Driven Guided Response for Security Operation Centers with Microsoft Copilot
for Security. arXiv:2407.09017 [cs.LG] https://arxiv.org/abs/2407.09017
[17] Xu Huang, Weiwen Liu, Xiaolong Chen, Xingmei Wang, Hao Wang, Defu Lian,
Yasheng Wang, Ruiming Tang, and Enhong Chen. 2024. Understanding the
planning of LLM agents: A survey. arXiv:2402.02716 [cs.AI] https://arxiv.org/
abs/2402.02716 Figure 6: Distribution of length of labels in test samples.
[18] Roman Kryukov, Vladimir Zima, Elena Fedorchenko, Evgenia Novikova, and
Igor Kotenko. 2022. Mapping the Security Events to the MITRE ATT &CK Attack
Patterns to Forecast Attack Propagation. In International Workshop on Attacks
and Defenses for Internet-of-Things. Springer, 165–176.
[19] Chenjing Liu, Junfeng Wang, and Xiangru Chen. 2022. Threat intelligence
ATT&CK extraction based on the attention transformer hierarchical recurrent
neural network. Applied Soft Computing 122 (2022), 108826.
[20] Marius Mărmureanu and Ciprian Oprişa. 2023. MITRE Tactics Inference from
Splunk Queries. In 2023 IEEE 19th International Conference on Intelligent Computer
Communication and Processing (ICCP). 277–283. doi:10.1109/ICCP60212.2023.
10398612
[21] Nanda Rani, Bikash Saha, Vikas Maurya, and Sandeep Kumar Shukla. 2023.
TTPHunter: Automated Extraction of Actionable Intelligence as TTPs from
Narrative Threat Reports. In Proceedings of the 2023 Australasian Computer Sci-
ence Week (Melbourne, VIC, Australia) (ACSW ’23). Association for Computing
Machinery, New York, NY, USA, 126–134. doi:10.1145/3579375.3579391
[22] Nanda Rani, Bikash Saha, Vikas Maurya, and Sandeep Kumar Shukla. 2024.
TTPXHunter: Actionable Threat Intelligence Extraction as TTPs form Finished
Cyber Threat Reports. arXiv preprint arXiv:2403.03267 (2024).
[23] Pranab Sahoo, Ayush Kumar Singh, Sriparna Saha, Vinija Jain, Samrat Mondal,
and Aman Chadha. 2024. A systematic survey of prompt engineering in large
language models: Techniques and applications. arXiv preprint arXiv:2402.07927
(2024).
[24] Chi Sun, Xipeng Qiu, Yige Xu, and Xuanjing Huang. 2019. How to fine-tune
bert for text classification?. In Chinese computational linguistics: 18th China
national conference, CCL 2019, Kunming, China, October 18–20, 2019, proceedings
18. Springer, 194–206.

You might also like