Microsft Guide Paper 1

Uploaded by

sharmin.105.kuet

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

68 views10 pages

Microsft Guide Paper 1

Uploaded by

sharmin.105.kuet

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 10

AI-Driven Guided Response for Security Operation Centers with

Microsoft Copilot for Security

Scott Freitas Jovan Kalajdjieski
[email protected] [email protected]
Microsoft Security Research Microsoft Security Research
Redmond, WA, USA Redmond, WA, USA

Amir Gharib Robert McCann

[email protected] [email protected]
arXiv:2407.09017v4 [cs.LG] 26 Nov 2024

Microsoft Security Research Microsoft Security Research

Redmond, WA, USA Redmond, WA, USA
ABSTRACT 1 INTRODUCTION
Security operation centers contend with a constant stream of se- In the rapidly evolving cybersecurity landscape, the sharp rise in
curity incidents, ranging from straightforward to highly complex. threat actors has overwhelmed enterprise security operation cen-
To address this, we developed Microsoft Copilot for Security ters (SOCs) with an unprecedented volume of incidents to triage [9].
Guided Response (CGR), an industry-scale ML architecture that This surge requires solutions that can either partially or fully auto-
guides security analysts across three key tasks—(1) investigation, mate the remediation process. Fully automated systems demand an
providing essential historical context by identifying similar inci- exceptionally high confidence threshold (e.g., 99%) to ensure correct
dents; (2) triaging to ascertain the nature of the incident—whether it actions are taken to avoid inadvertently disabling critical enterprise
is a true positive, false positive, or benign positive; and (3) remedia- assets. Consequently, attaining such a high level of confidence often
tion, recommending tailored containment actions. CGR is integrated renders full automation impractical.
into the Microsoft Defender XDR product and deployed worldwide, This challenge has catalyzed the development of guided response
generating millions of recommendations across thousands of cus- (GR) systems to support SOC analysts by facilitating informed
tomers. Our extensive evaluation, incorporating internal evaluation, decision-making. Extended Detection and Response (XDR) prod-
collaboration with security experts, and customer feedback, demon- ucts are ideally positioned to deliver precise, context-rich guided
strates that CGR delivers high-quality recommendations across response recommendations thanks to their comprehensive visibility
all three tasks. We provide a comprehensive overview of the CGR across the entire enterprise security landscape. By consolidating
architecture, setting a precedent as the first cybersecurity company telemetry across endpoints, network devices, cloud environments,
to openly discuss these capabilities in such depth. Additionally, we email systems, and more, XDR systems can harness a wide array of
release GUIDE, the largest public collection of real-world security data to provide historical context, generate detailed insights into
incidents, spanning 13M evidences across 1M incidents annotated the nature of threats, and recommend tailored remediation actions.
with ground-truth triage labels by customer security analysts. This
Guided response challenges. Scalable and accurate GR systems
dataset represents the first large-scale cybersecurity resource of
face several key challenges that require a combination of innovative
its kind, supporting the development and evaluation of guided
ML system design, and a deep understanding of cybersecurity:
response systems and beyond.
(1) Complexity of security incidents. The extensive variety of
CCS CONCEPTS security products, each with thousands of custom and built-in
detectors, creates a complex incident landscape further com-
• Computing methodologies → Machine learning; • Security
pounded by a scarcity of labeled data.
and privacy; • Applied computing;
(2) High precision and recall. Analysts require reliable guidance,
KEYWORDS necessitating systems that deliver high precision and recall
across investigation, triaging, and remediation tasks.
Microsoft Copilot, guided response, machine learning, cybersecu-
rity, security operation centers (3) Scalable architecture. Generating recommendations at the
million-scale across terabytes of data requires a robust and
scalable ML architecture.
Permission to make digital or hard copies of all or part of this work for personal or
classroom use is granted without fee provided that copies are not made or distributed (4) Adaptive to unique SOC preferences. The system must be
for profit or commercial advantage and that copies bear this notice and the full citation able to adapt to specific operational workflows, product config-
on the first page. Copyrights for components of this work owned by others than the
author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or urations, and detection logic of individual SOCs.
republish, to post on servers or to redistribute to lists, requires prior specific permission
and/or a fee. Request permissions from [email protected].
(5) Continuous learning and improvement. To remain effective
Conference’17, July 2017, Washington, DC, USA against evolving cyber threats and changes in the security prod-
© 2024 Copyright held by the owner/author(s). Publication rights licensed to ACM. uct landscape, the system must continuously learn and improve
ACM ISBN 978-1-4503-XXXX-X/18/06
https://doi.org/XXXXXXX.XXXXXXX
autonomously.
Conference’17, July 2017, Washington, DC, USA Freitas & Kalajdjieski et al.

Figure 1: Overview of the Copilot Guided Response architecture. Train Pipeline: Running weekly, this process trains grade
and action recommendation models based on historical SOC telemetry. Inference Pipeline: Running every 15 minutes, this
process generates grade, action, and similar incident recommendations for incoming incidents by leveraging the models created
in the train pipeline. Embedding Pipeline: Running every 30 minutes until 180 days of historical embeddings exist, this job
creates historical embeddings of SOC incidents for the similar incident recommendation algorithm in the inference pipeline.

1.1 Contributions • Extensive Evaluation of CGR. We provide a comprehensive

We introduce Copilot Guided Response (Fig. 1), an ML framework evaluation of CGR’s performance, focusing on its industrial ap-
designed to tackle guided response at scale. Our framework makes plicability rather than comparisons with alternative designs or
significant contributions in the following areas: competing security products, which are often undisclosed. The
release of the GUIDE dataset enables researchers to explore new
• Copilot Guided Response (CGR). The Copilot Guided architectures for maximum performance. Our evaluation spans
Response architecture transforms cybersecurity guided response internal testing, expert collaborations, and customer feedback.
by detailing the first geo-distributed industry-scale framework Internal assessments on hundreds of thousands of unseen inci-
capable of processing millions of incidents each day with batch dents show triage models achieve 87% precision and 41% recall,
latency of just a few minutes. Our ML system scalably delivers while action models reach 99% precision and 62% recall. Collabo-
three core SOC capabilities—(1) investigation, (2) triaging, and (3) ration with Microsoft security experts confirms the efficacy of
remediation—seamlessly adapting to a range of scenarios, from our similar incident recommendations, with 94% of incidents
single alerts to complex incidents involving hundreds of alerts, deemed relevant, and 98% of incidents containing one or more
where each alert is categorized into one of hundreds of thousands recommendations. Customer feedback further underscores its
of distinct classes, with new classes continuously added. effectiveness, with 89% of interactions rated positively.
• Largest Cybersecurity Incident Dataset. We introduce GUIDE, • Impact to Microsoft Customers and Beyond. CGR is inte-
the largest publicly available collection of real-world cyberse- grated into the Microsoft Defender XDR product, a leader in
curity incidents under the permissive CDLA-2.0 license. This the market [30], and is deployed to hundreds of thousands of
extensive dataset includes over 13 million pieces of evidence organizations worldwide. The introduction of CGR to Microsoft
across 1.6 million alerts and 1 million incidents annotated with Defender XDR has significantly enhanced the operational ca-
ground-truth triage labels by customer security analysts, making pabilities of SOCs by streamlining the decision-making process
it an unparalleled resource for the development and evaluation and providing actionable insights across investigation, triaging,
of GR systems and beyond. By enabling researchers to study and remediation tasks. As a result, Microsoft Defender XDR cus-
real-world data, GUIDE advances the state of cybersecurity and tomers benefit from a more resilient security response, fortified
supports the development of next-generation ML systems. by adaptive ML-driven guided responses that are tailored to the
nuances of their specific security environments.
AI-Driven Guided Response for Security Operation Centers with Microsoft Copilot for Security Conference’17, July 2017, Washington, DC, USA

Term Definition methodologies and performance metrics are scarce. We address

Alert Potential security threat that was detected this gap by detailing the first industry-scale architecture for similar
incident recommendation in SOCs.
Detector A security rule or ML model that generates alerts
Entity File, IP, etc. evidence associated with an alert Triaging. Incident triaging is a vital and time intensive task typi-
cally performed by junior analysts to identify incidents requiring
Correlation A link between two alerts based on a shared entity
further investigation. This involves prioritizing incidents for deeper
Incident Related alerts that are correlated together review [1, 2, 4, 29, 37] and filtering them based on the likelihood of
SOC Security operation centers (SOC) protect enter- being true versus false positives [6, 18, 19, 42]. While some industry
prise organizations from threat actors solutions use ML to automate triaging [21], there is limited public
information on their architecture or performance. Although some
XDR Extended Detection and Response (XDR) plat- companies provide insights [12, 36, 42], these are often limited to
forms are used by SOCs to protect organizations controlled scenarios and lack critical deployment details. Our work
across the entire enterprise landscape addresses these gaps by detailing a scalable, adaptable ML triaging
Table 1: Terminology and definitions. architecture that is deployed globally to thousands of SOCs.

Remediation. The majority of remediation research centers on

2 BACKGROUND intrusion response systems (IRS), designed to notify SOC ana-
We provide an overview of Microsoft Copilot for Security and lysts or dynamically respond to detected intrusions. These sys-
review literature relevant to guided response. To enhance readabil- tems employ various decision-making models, including rule-based
ity, Table 1 details the terminology used in this paper. approaches [13], multi-objective optimization [27, 41, 45], game
theory [20, 39, 47], human-in-the-loop [34], reinforcement learn-
2.1 Microsoft Copilot for Security ing [3, 24, 43], and many others [7, 22, 40] to determine optimal
responses based on the system state, nature of the attack, and
Microsoft Copilot for Security is an AI-driven solution that en-
countermeasure impact. Despite advancements, there is limited dis-
hances security professionals’ workflows by offering real-time in-
cussion on how to scale these systems to handle industry demands,
sights and recommendations across Microsoft Defender XDR, Mi-
such as managing millions of incidents, handling complex scenarios
crosoft Sentinel, and Microsoft Intune. At launch, it introduced
with numerous alerts, and customizing responses for SOC prefer-
five skills: incident summarization, script analyzer, incident re-
ences. Our research bridges this gap, enhancing the scalability and
port, Kusto query assistant, and guided response [8]. While the
transparency of industry-scale guided remediation systems.
first four leverage LLMs with security-specific plugins and post-
processing [31], guided response includes three machine learning
sub-skills—grade recommendation, action recommendation, and 3 ARCHITECTURE OVERVIEW
similar incident recommendation—tailored to SOC preferences for
precise, context-specific insights. We detail the Copilot Guided Response (CGR) architecture, orga-
nized around three key pipelines: train, inference, and embedding,
2.2 Guided Response as illustrated in Figure 1. We utilize PySpark’s distributed com-
putational engine whenever possible, while reserving Python for
With the market introduction of Microsoft Copilot for Security last mile recommendation tasks that do not have PySpark support.
Guided Response, the concept of guided response was formally Below, we detail how these pipelines synergize to guide security
defined as “machine learning capabilities to contextualize an inci- analysts through the processes of investigating, triaging, and reme-
dent and learn from previous investigations to generate appropriate diating security incidents across the enterprise landscape.
response actions” [17]. Our analysis of academic and industry lit-
erature contextualizes relevant contributions within the domain • Train pipeline (Section 4). Running weekly, this process trains
of guided response into three distinct categories: (1) investigation the grade and action recommendation models using historical
that suggests next steps for further analysis; (2) triaging to deter- SOC telemetry to provide tailored responses. This procedure is
mine whether an incident is a true positive, false positive, or benign detailed across ten steps, T1 through T10.
positive (e.g., informational); and (3) remediation which proposes • Inference pipeline (Section 5). Operating every 15 minutes,
specific response actions to contain and resolve incidents. this pipeline generates grade and action recommendations for
Investigation. Assisting security analysts in their investigation incoming incidents by leveraging the models developed in the
of incidents is a pivotal aspect of cybersecurity. Machine learning train pipeline. Additionally, it provides similar incident recom-
assisted investigation typically encompasses: (1) similar incident mendations by matching new incidents with historically similar
identification [23, 28, 46], (2) investigation assistance [14, 35, 38], incidents generated in the embedding pipeline.
and (3) playbook recommendation [5, 25, 26]. Our research focuses • Embedding pipeline (Section 6). This pipeline runs every 30
on similar incident recommendation, an understudied topic in the minutes until 180 days of incident embeddings are generated.
context of SOCs. While a few industry solutions offer similar inci- These embeddings form the foundational data that allows the sim-
dent identification capabilities through methods such as comparing ilar incident recommendation algorithm in the inference pipeline
and counting identical artifacts [32, 33], details regarding their to effectively identify similar incidents.
Conference’17, July 2017, Washington, DC, USA Freitas & Kalajdjieski et al.

To adhere to privacy regulations, CGR is replicated across geo- Algorithm 1: Copilot Guided Response Training
graphic regions, utilizing Synapse to ensure consistency and compli- Input: Alert data A, minimum cardinality 𝑐, principal
ance. Consequently, the following sections focuses on development components 𝑘, max incidents sampled per
from the perspective of a single geographic region. IncidentHash 𝑚, max incidents stored per
IncidentHash 𝑠, grid search parameters G
4 TRAIN PIPELINE Output: Incident embeddings I & trained models Mt, Mr
Copilot Guided Response’s training architecture is detailed across Feature Engineering
two subsections—Section 4.1 which presents the key steps to col- A ← FeatureEngineering(A) ; // T1
lecting and preparing the data; and Section 4.2 which discusses the A ← FeatureSpaceCompression(A, 𝑐) ; // T2
model training and validation process. A step-by-step overview of
A ← OneHotEncoding(A) ; // T3
the entire training process is provided in Algorithm 1.
I ← FormIncidents(A) ; // T4
I ← SampleIncidents(I, 𝑚) ; // T5
4.1 Preprocessing
I, A ← PCA(I, 𝑘), PCA(A, 𝑘) ; // T6
We detail the ten step process (T1-T10) for creating alert and inci- StoreEmbeddings(I, 𝑠) ; // T7
dent dataframes leveraged across all three pipelines.
Model Training
T1—Feature engineering. We collect alert telemetry from mul- I′, A′ ← ConvertToPandas(I, A) ; // T8
tiple Azure Data Lake Storage (ADLS) tables and join them into a M𝑡 ← TrainTriageModel(I′, G) ; // T9
PySpark alert dataframe. Each row in the alert dataframe contains M𝑟 ← TrainRemediationModel(A′, G) ; // T9
columns for unique alert and incident identifiers, complemented by ValidateAndStoreModels(M𝑡 , M𝑟 ) ; // T10
customer-provided grade and remediation action, when available.
Additionally, each row contains 5 categorical feature columns—
OrganizationId, DetectorId, ProductId, Category, and Severity— numerical columns. For incidents with multiple grades, the major-
along with 67 engineered numerical feature columns, developed in ity label is applied, with ties going to the true positive class. We
close collaboration with Microsoft security research experts. We remove any incidents without a triage grade at this stage.
retain rows that lack a customer grade or action, as these alerts can
merge with other alerts to form incidents that do contain labels. T5—Sampling incidents. Given that incident processing steps are
significantly more memory-intensive than remediation actions in
T2—Feature space compression. Before converting the cate- the alert dataframe, we employ random sampling on the incident
gorical columns to one-hot-encoded representations, we must ad- dataframe to mitigate out-of-memory issues during downstream
dress the challenge of high cardinality in the DetectorId and OrgId processing steps. This sampling strategy involves creating a unique
columns. In various geographic regions, DetectorIds can exceed IncidentHash identifier for each incident by arranging the Detec-
100k and OrgIds can reach up to 50k, creating an extremely large torIds of an incident into an ordered list and hashing it using SHA1.
and sparse feature space that often leads to failures during dimen- We can then cap the number of incidents for each unique Inciden-
sionality reduction in the PySpark cluster. To mitigate this, we ag- tHash and triage grade to a maximum of 1,000.
gregate the feature space by substituting infrequent values—those T6—Dimensionality reduction. We independently apply prin-
associated with fewer than 10 alerts—with a generic value. This cipal component analysis (PCA) to both the incident and alert
method, while resulting in some information loss, ensures the sys- dataframes, each containing tens of thousands of columns. PCA is
tem remains within the computational boundaries of the cluster. chosen for its performane and availability within PySpark’s native
machine learning library, which supports distributed computing
T3—One-hot-encoding. With the preliminary adjustments to our
and allows for efficient feature space reduction. This mitigates the
feature space, we can convert all 6 categorical feature columns into
risk of out-of-memory errors that occur when centralizing large
their one-hot-encoded (OHE) form. This transformation includes
dataframes on the primary PySpark node for subsequent scikit-
key columns such as OrgId, ProductId, and DetectorId, which allows
learn model training. Our objective is to condense the feature space
the models to capture SOC-specific tendencies as well as product
to 𝑘 principal components that captures 95% of the original variance
and detector specific characteristics that evolve over time. We bi-
in each dataframe. Empirically, we find that setting 𝑘 = 40 meets
furcate the data and establish a secondary PySpark alert dataframe
this requirement. The resulting PCA weights are saved in an Azure
that only contains alerts with remediation actions, while retaining
Blob Storage container, enabling reuse within the inference and
all alerts in the original dataframe. Finally, we store the PySpark
embedding pipelines.
OHE pipeline in an Azure Blog Storage container so that it can be
used in the inference process to transform the categorical columns. T7—Store embeddings. The final step is to save the incident em-
beddings to an ADLS table to enhance similar incident recommen-
T4—Forming incidents. To enhance our ability to make precise dations within the inference pipeline. A strategic product decision
incident-level triaging decisions and investigation recommenda- dictates that only the top five most similar incidents are displayed
tions, we create a separate incident dataframe. This is achieved for any given incident. Consequently, we only store up to five in-
by aggregating alert rows based on shared IncidentIds from the stances of each incident, categorized by triage grade and the unique
alert dataframe containing all alerts, and summing their respective set of DetectorIds that comprise the IncidentHash.
AI-Driven Guided Response for Security Operation Centers with Microsoft Copilot for Security Conference’17, July 2017, Washington, DC, USA

4.2 Model Training Algorithm 2: Copilot Guided Response Inference

We train two models—a triage model to predict incident grades Input: Batched alert data 𝑨, max incidents stored per
and an action model to determine remediation actions for incident- IncidentHash 𝑠, triage and remediation models
related alerts—across three key steps: (1) converting the PySpark Mt, Mr with confidence thresholds 𝑐𝑡 , 𝑐𝑟
dataframes to Pandas and performing a stratified train-val-test Output: Triage, investigation, and remediation recs
split; (2) optimizing random forest models with a grid search; and Preprocessing
(3) validating new models against previous versions before storage. A ← FeatureEngineering(A) ; // T1
While PySpark’s native MLlib is an option, our experiments show
A ← FeatureSpaceCompression(A) ; // T2
it results in a 10% decrease in Macro-F1 score compared to scikit-
A ← OneHotEncoding(A) ; // T3
learn due to missing core capabilities. For instance, MLlib’s random
forest model is limited to a depth of 30, significantly constraining I ← FormIncidents(A) ; // T4
its ability to capture complex patterns. I, A ← PCA(I), PCA(A) ; // T6
StoreEmbeddings(I, 𝑠) ; // T7
T8—Dataset formation. We begin by converting the alert and I′, A′ ← ConvertToPandas(I, A)
incident PySpark dataframes created in Section 4.1 into Pandas.
Triage Recommendations
For each Pandas dataframe, we conduct a standard 70-10-20 train, R tri ← Mt (I)
validation, and test set split of the data, stratified by grade and R tri ← FilterByConfidence(R tri, 𝑐𝑡 )
action labels, respectively.
Investigation Recommendations
T9—Training process. We select a random forest model due to H ← RetrieveHistoricalEmbeddings(180 days)
its efficiency on our CPU-based PySpark infrastructure and its R inv ← ∅
reliable performance on tabular data. We conduct a grid search over foreach Inew ∈ Itri do
four key model parameters: 𝑛_𝑒𝑠𝑡𝑖𝑚𝑎𝑡𝑜𝑟𝑠 = {100, 200, 300, 400}, R inv ← R inv ∪ ExactHashMatch(Inew, H)
𝑚𝑎𝑥_𝑑𝑒𝑝𝑡ℎ = {30, 50, 75, 100}, 𝑚𝑖𝑛_𝑠𝑎𝑚𝑝𝑙𝑒𝑠_𝑠𝑝𝑙𝑖𝑡 = {5, 10, 15}, R inv ← R inv ∪ CosineSimilarityMatch(Inew, H)
and 𝑐𝑙𝑎𝑠𝑠_𝑤𝑒𝑖𝑔ℎ𝑡 = {′𝑏𝑎𝑙𝑎𝑛𝑐𝑒𝑑 ′, 𝑁𝑜𝑛𝑒}, and select the model with R inv ← SelectTopK(R inv, 5)
the highest macro-F1 score on the validation set. Remediation Recommendations
R rem ← Mr (A)
T10—Validation and model storage. Once the best model has R rem ← FilterByConfidence(R rem, 𝑐𝑟 )
been identified for both triage and remediation tasks, we compare R rem ← IdentifyEntities(R rem )
them with previous models to ensure quality between training R rem ← AggregateRecommendations(R rem, I)
cycles remains within a few percentage points. We do not require
each model to have a higher macro-F1 score since new detectors
and security products are onboarded over time, which can cause latest PCA models from the training pipeline to reduce the dimen-
fluctuations in performance. After validation, the models are saved sionality of both dataframes to form alert and incident embeddings.
to an Azure Blob Storage container for use in the inference pipeline. The incident embeddings are stored in an ADLS table to enhance
the similar incident recommendations, with a limit of five instances
5 INFERENCE PIPELINE per incident, categorized by triage grade and DetectorId. Finally,
Building on the infrastructure in Section 4, the inference pipeline we convert both PySpark dataframes into Pandas dataframes to fa-
processes batched data to provide guided response recommenda- cilitate subsequent triage, investigation, and remediation processes.
tions (Sec 5.1). For each new or updated incident, the trained mod-
els and historical embeddings operate across three phases: triage 5.2 Triage Recommendations
to predict grades (Sec 5.2), investigation to find similar incidents We leverage the incident embeddings produced during preprocess-
(Sec 5.3), and remediation to recommend response actions (Sec 5.4). ing, along with the latest version of the triage model, to generate
Recommendations dynamically update as incidents evolve and are triage recommendations. Each incident is evaluated by the model
stored in a table, ensuring rapid access for customers. Algorithm 1 and given a prediction of true positive (TP), false positive (FP), or
provides a step-by-step overview. benign positive (BP), the latter considered as an informational inci-
dent. The confidence of each recommendation is assessed against a
5.1 Preprocessing precision threshold of 0.9 to ensure that only reliable recommenda-
The initial phase of the inference pipeline prepares real-time batched tions are sent to SOC analysts.
alert data, retrieving the last 15 minutes of telemetry and loading
it into a PySpark dataframe. These alerts are then processed using 5.3 Investigation Recommendations
the feature space compression and one-hot encoding techniques We utilize the incident embeddings produced during the prepro-
outlined in Section 4.1. Next, we bifurcate the alert data into two dis- cessing step to generate recommendations for similar incidents.
tinct PySpark dataframes—one dedicated to generating remediation This process begins with the retrieval of historical incident embed-
predictions and the other for aggregating alerts into incidents for dings from our Azure Data Lake Storage (ADLS), going back up to
similar incident and triage recommendations, utilizing the aggre- 180 days. These embeddings capture past incidents in a vectorized
gation process described in Section 4.1. Afterwards, we apply the format, enabling efficient comparison. The core of our approach is
Conference’17, July 2017, Washington, DC, USA Freitas & Kalajdjieski et al.

to match new incidents with historically relevant incidents within 7.1 Setup
the same organization through a three-step matching process: We present results for the triage and remediation models across a
(1) Exact hash matching. We begin by identifying historical inci- sample of 12 regions, where each regional dataset is divided into
dents that share the same IncidentHash and triage recommenda- three stratified subsets: training (70%), validation (10%), and testing
tion. If less than five matches are found, we take incidents with (20%). Each run of model training job includes two key parameter
the same IncidentHash but differing triage recommendations. optimizations, performed independently for incident triage and
(2) Approximate matching with cosine similarity. If less than alert remediation embeddings:
five exact matches were found, we search for historical inci- (1) PCA component selection. We retain the top-k principal
dents based on the cosine similarity of their embeddings. This components capturing 95% of the data variance, with 𝑘 ranging
approach helps to identify incidents that share significant char- up to 100. While the optimal 𝑘 varies across runs and regions,
acteristics with the current incident. we find that 𝑘 = 40 performs well across most scenarios.
(3) Top-k similar incident selection. We select the top-k most (2) Random forest parameter tuning. We optimize key random
similar incidents, up to a maximum of five. Exact and cosine forest parameters using a grid search (see Section 4.2) We select
similarity matches are ordered, with a higher priority given to the model with the highest macro-F1 score on the validation
exact matches to ensure the most germane comparisons, and set and report precision and recall metrics, as is standard for
ties going to the most recent incident. imbalanced datasets [10, 11, 15, 16].

5.4 Remediation Recommendations Table 2 summarizes model training and performance statistics
over two weeks across select regions, including the count of unique
Using alert embeddings from preprocessing and the latest remedia-
DetectorIds, and alert/incident volumes. Regional variations are
tion model, we generate targeted response actions—contain user,
significant, with up to 31k detectors across thousands of organiza-
isolate machine, or stop virtual machine—for each alert with confi-
tions, complicating the training process. Incident size distribution
dence above a 0.9 precision threshold. The system identifies entities
(Figure 2) is long-tailed, with most incidents comprised of a few
(e.g., users, devices, VMs) using encoded rules based on security
alerts, with larger incidents representing a greater challenge for
domain knowledge, and then aggregates the individual alert rec-
triage and similar incident recommendations due to their rarity.
ommendations into comprehensive incident recommendations.
Limitations. This work focuses on presenting a scalable, unified
6 EMBEDDING PIPELINE framework for incident triaging, remediation, and similar incident
We generate historical incident embeddings that allow the similar recommendations, rather than evaluating alternative models or
incident recommendation algorithm to leverage up to 180 days of competing security products, which are often undisclosed. The
historical data when making recommendations. Due to the limita- release of the GUIDE dataset enables researchers to explore and
tions of the training pipeline in processing huge volumes of incident optimize new architectures for maximum performance. In addition,
telemetry across regions, we developed a specialized mechanism to CGR is limited to providing recommendations for existing detectors
generate historical embeddings. This ensures that our similar in- and does not address zero-day or other unmonitored attack vectors.
cident recommendation algorithm rapidly reaches comprehensive
historical coverage each time the inference pipeline is executed.
Continuous embedding generation. The embedding pipeline
operates in a continuous loop, with each iteration processing data
one day further back than the last. Leveraging the preprocessing
steps outlined in Section 5.1, we integrate a deduplication process
that loads historical incident embeddings and compares incident
hashes and triage recommendations to eliminate redundancy. We
store any IncidentHash and triage recommendation pairs from the
current batch, including those without a triage grade, provided they
do not exceed five stored embeddings—aligning with our policy of
recommending no more than five similar incidents at a time. New
incident embeddings are then saved to the ADLS table for use in
the inference pipeline. This procedure repeats until we have 180
days of historical incident embedding telemetry.

7 EXPERIMENTS
We evaluate CGR’s performance across three tasks: (1) triaging,
assessing the model’s ability to classify incidents as benign, mali-
cious, or informational; (2) investigation, analyzing the relevance of Figure 2: Sampled distribution of incident size—measured by
similar incident recommendations; and (3) remediation, evaluating the number of alerts per incident—exhibits a long-tailed pat-
its ability to predict effective threat mitigation actions. tern where the majority of incidents have only a few alerts.
AI-Driven Guided Response for Security Operation Centers with Microsoft Copilot for Security Conference’17, July 2017, Washington, DC, USA

Train Statistics Triage Results Remediation Results

Region # Rules # Alerts # Inc Supp % TP % FP % BP Pr Re F1 Supp % CA % ID % VM Pr Re F1
1 31k 18.4M 5.1M 49k 16 32 52 .86 .85 .85 113k 86 14 - 1 1 1
2 31k 14.5M 4.7M 46k 14 34 52 .92 .90 .91 166k 86 14 1 1 1 1
3 26k 11.9M 4.3M 97k 21 25 54 .85 .82 .84 180k 86 14 - .93 .99 .96
4 18k 9.2M 3.2M 83k 23 29 48 .87 .85 .86 82k 88 12 - 1 1 1
5 14k 6M 1.9M 99k 20 38 43 .88 .86 .86 46k 90 10 - 1 1 1
6 15k 6M 2M 116k 23 38 39 .87 .87 .87 21k 73 24 2 1 1 1
7 6.7k 2.5M 748k 138k 29 39 32 .86 .86 .86 11k 69 31 - 1 1 1
8 6.8k 2.1M 657k 139k 21 33 46 .88 .87 .88 9.8k 63 37 - 1 1 1
9 6.3k 1.7M 744k 87k 13 17 70 .90 .85 .87 12k 84 16 - .99 .99 1
10 3k 511k 250k 106k 14 38 48 .88 .86 .87 1.9k 45 55 - 1 1 1
11 798 109k 71k 23k 20 52 28 .84 .85 .84 - - - - - - -
12 579 25k 10k 8.9k 17 46 37 .87 .88 .87 270 45 55 - 1 1 1
Table 2: Left: Statistics on the number of unique DetectorIds, volume of alerts, and volume of incidents across 12 sampled
regions over two-weeks. Middle: Triage model statistics, including the number of graded incidents (“Supp”), the distribution of
grades (i.e., TP, FP, and BP), and model performance metrics evaluated through macro precision, recall, and F1 score. The triage
statistics are sourced from numerous first and third party providers, with a large proportion of BP and FP incidents coming
from third party providers. Right: Remediation model statistics, including the number of alerts with an action label (“Supp”),
the distribution of contain account, isolate device, and stop virtual machine actions, and model performance metrics.

7.2 GUIDE Dataset of 0.87, with a precision of 0.87, and a recall of 0.86. This perfor-
GUIDE includes over 13M pieces of evidence across 33 entity types, mance demonstrates the model’s ability to effectively manage the
encompassing 1.6M alerts and 1M incidents over a two-week pe- complexities of the triage task, ranging from incidents involving a
riod. Each incident is annotated with a ground-truth triage label by single alert to those comprising hundreds of alerts across numerous
customer security analysts, along with 26k alerts containing labels detectors and security products.
of remediation actions taken by customers. The dataset is derived To evaluate CGR’s effectiveness at a more granular level, we
from Region 2 (see Table 2) and includes telemetry across 6.1k or- examine performance across individual triage classifications in
ganizations, featuring 9.1k unique custom and built-in DetectorIds Region 2. The precision-recall curves in Figure 3 show that CGR
across numerous security products, covering 441 MITRE ATT&CK consistently achieves high AUC scores across triage labels. The
techniques [44]. We divide the dataset into a train (70%) and test set majority of misclassifications (81.1%) occur when incidents are
(30%), both available as CSV files on Kaggle. The release of GUIDE categorized as BP instead of TP or FP, or vice versa—a discrepancy
provides an unparalleled opportunity for the development of GR mainly due to the lack of a standardized definition for BP incidents,
systems and beyond, with CGR providing a foundational baseline. leading to classification inconsistencies across and within SOCs.
Additional details on GUIDE can be found in the Appendix. However, the more critical incident misclassification of TP as BP or
FP is rare (2.4%), as shown in Table 3.

7.3 Triage
The telemetry collection period for triage ranges from 7 to 180 days,
varying by region due to computational constraints in incident pre-
processing. Table 2 shows that the number of SOC-graded incidents
(abbreviated as "Supp" for support) ranges from 8.9k to 139k with
an imbalance among the three triage classes of 19% true positive,
35% false positive, and 46% benign positive (informational). While
larger regions tend to have more graded incidents, several factors
complicate the number of graded incidents reported during training:
(1) larger regions with more organizations and detectors require
more memory during OHE, reducing capacity for additional alerts;
(2) the frequency and types of grading vary significantly across
SOCs and regions, and (3) we limit the number of graded examples
per IncidentHash and triage label to 1k.

Offline results. In Table 2, we present the cross-region model

triage performance at the point of maximum macro-F1 score on the
precision-recall curve. The results show an average macro-F1 score Figure 3: PR curves of triage performance in Region 2
Conference’17, July 2017, Washington, DC, USA Freitas & Kalajdjieski et al.

Actual / Predicted TP BP FP Online results. While we achieve notable offline results, the col-
TP 929 170 50 lected data does not fully capture the SOC experience. A majority
BP 62 4,748 45 of alerts are not actioned by analysts, and not all alerts fit the three
FP 66 221 2,869 predefined remediation action types. As a result, the model’s cov-
Table 3: Confusion matrix of triage performance in Region 2. erage averages 62% across regions. However, as we integrate new
Diagonal values reflect correct classifications, highlighting remediation actions, the system’s coverage will naturally increase.
strong model performance across triage classes.

8 DEPLOYMENT
Online results. Triage recommendations are inherently dynamic,
Copilot Guided Response has been successfully deployed across
adapting as new alerts are added to incidents. However, as shown
the world, serving thousands of Microsoft Defender XDR customers
in Figure 2, since the vast majority of incidents are relatively short-
since its launch in April 2024. CGR has generated millions of guided
lived and involve only a few alerts, only 2% of incidents undergo a
response recommendations for triage, investigation, and remedia-
change in their initial triage recommendation. Furthermore, model
tion tasks, receiving a ‘positive’ user response rate of 89%, based
training pipelines across regions exhibit some bias, as not all types
on the confirmation or dismissal of recommendations.
of incidents and alerts are graded by SOC analysts. Coupled with
Our deployment infrastructure leverages a Synapse-based PyS-
a heightened precision threshold to ensure that recommendations
park cluster, customized to each geographical region. This infras-
are correct 90% of the time, the triage models effectively cover 41%
tructure includes: (a) an ADLS database ensuring both accessibility
of incidents across regions.
and secure management of telemetry; (b) an Azure Synapse back-
end that provides a robust framework for deployment; (c) an XXL
7.4 Investigation
PySpark pool featuring 60 executors, each equipped with 64 CPU
In collaboration with security experts at Microsoft, we manually cores and 400GB of RAM; (d) autoscaling to adjust executors based
evaluated our similar incident recommendations due to an absence on fluctuating load; and (e) automated re-execution of failed jobs to
of a definitive ground truth. For this evaluation, we randomly se- ensure continuous coverage. Due to the absence of native support
lected 1k incidents varying in size, detectors, products, organi- for model monitoring, versioning, and storage within Synapse, we
zations, and regions. Security researchers were then tasked with developed a custom infrastructure to support these capabilities.
randomly selecting a similar incident recommendation for each ref-
erence incident, and judging its relevance based on criteria such as
shared attack patterns, indicators of compromise, and entity types. 9 CONCLUSION
Offline results. The assessment found that 94% of the recom- Copilot Guided Response (CGR) represents the first time a cyber-
mended incidents were relevant, with only 2% deemed dissimilar. security company has openly discussed an industry-scale guided
The quality of recommendations tends to diminish for smaller or- response framework. CGR significantly enhances SOC operations
ganizations with limited historical data. Similarly, larger incidents by guiding security analysts through crucial investigation, triaging,
characterized by hundreds of alerts across multiple products and and remediation tasks, adeptly handling everything from simple
detectors also show decreased recommendation quality due to their alerts to complex incidents. The performance of CGR has been rig-
rarity. To address this, a cosine similarity threshold of 0.9 was orously evaluated through internal testing, collaboration with Mi-
identified as the cut-off threshold to ensure that only relevant rec- crosoft security experts, and extensive customer feedback, demon-
ommendations are presented to customers. Across regions, we find strating its effectiveness across all three tasks. Deployed globally
that 98% of all incidents have one or more recommendations. within Microsoft Defender XDR, CGR generates millions of guided
response recommendations weekly, with 89% of user interactions
7.5 Remediation receiving positive feedback. In addition, we release GUIDE, the
We collect 180 days of telemetry, utilizing a preprocessing pipeline largest publicly available collection of real-world security incidents,
for alerts that is significantly simpler than the one used for incident- comprising 13 million pieces of evidence across one million inci-
based triage. Unlike incidents, alerts do not require sampling due dents, each annotated with ground-truth triage labels by customer
to their less computationally intensive preprocessing, allowing for security analysts. As the first resource of its kind, GUIDE sets a new
a higher volume of training data, particularly in larger regions as standard for advancing the development and evaluation of guided
shown in Table 2. The volume of actioned alerts varies widely across response systems and beyond.
regions, from a few hundred to 180k, with a notable imbalance
among the remediation classes—67% are contain account (CA), 23%
isolate device (ID), and less than 1% stopping virtual machines (VM). ACKNOWLEDGMENTS
Future work to support remediation actions, such as quarantining We thank our colleagues who supported this research, including
files, deleting emails, and blocking IPs/URLs is ongoing. Shira Shacham, Yuval Derman, Oren Saban, Noa Bratman, Inbar
Offline results. We achieve an impressive average cross-region Rotem, Omri Kantor, Itamar Karavani, Mari Mishel, Yuval Katav,
macro-F1 score of 0.99. This high score reflects the relative simplic- Niv Zohar, Lior Camri, Leeron Luzzatto, Anna Karp, Ido Nitzan,
ity of predicting the appropriate remediation action for a single alert Corina Feurstein, Ya’ara Cohen, Nadia Tkach Mendes, Gil Shmaya,
compared to the complexities of incident triaging in our dataset. Pawel Partyka, Shachaf Levy, Blake Strom, along with many others.
AI-Driven Guided Response for Security Operation Centers with Microsoft Copilot for Security Conference’17, July 2017, Washington, DC, USA

REFERENCES [25] Irina Kraeva and Gulnara Yakhyaeva. 2021. Application of the metric learning
[1] Khalid Alsubhi, Issam Aib, and Raouf Boutaba. 2012. FuzMet: A fuzzy-logic based for security incident playbook recommendation. In 2021 IEEE 22nd International
alert prioritization engine for intrusion detection systems. International Journal Conference of Young Professionals in Electron Devices and Materials (EDM). IEEE,
of Network Management 22, 4 (2012), 263–284. 475–479.
[2] Khalid Alsubhi, Ehab Al-Shaer, and Raouf Boutaba. 2008. Alert prioritization in [26] Ryuta Kremer, Prasanna N Wudali, Satoru Momiyama, Toshinori Araki, Jun
intrusion detection systems. In NOMS 2008-2008 IEEE Network Operations and Furukawa, Yuval Elovici, and Asaf Shabtai. 2023. IC-SECURE: Intelligent System
Management Symposium. IEEE, 33–40. for Assisting Security Experts in Generating Playbooks for Automated Incident
[3] Hilala Alturkistani and Mohammed A El-Affendi. 2022. Optimizing cybersecurity Response. arXiv preprint arXiv:2311.03825 (2023).
incident response decisions using deep reinforcement learning. International [27] Xuan Li, Chunjie Zhou, Yu-Chu Tian, and Yuanqing Qin. 2018. A dynamic
Journal of Electrical and Computer Engineering 12, 6 (2022), 6768. decision-making approach for intrusion response in industrial control systems.
[4] Muhamad Erza Aminanto, Tao Ban, Ryoichi Isawa, Takeshi Takahashi, and IEEE Transactions on Industrial Informatics 15, 5 (2018), 2544–2554.
Daisuke Inoue. 2020. Threat alert prioritization using isolation forest and stacked [28] Tao Lin. 2018. A Data Triage Retrieval System for Cyber Security Operations
auto encoder with day-forward-chaining analysis. IEEE Access 8 (2020), 217977– Center. (2018).
217986. [29] Yushan Liu, Xiaokui Shu, Yixin Sun, Jiyong Jang, and Prateek Mittal. 2022. RAPID:
[5] Andy Applebaum, Shawn Johnson, Michael Limiero, and Michael Smith. 2018. Real-Time Alert Investigation with Context-aware Prioritization for Efficient
Playbook oriented cyber response. In 2018 National Cyber Summit (NCS). IEEE, Threat Discovery. In Proceedings of the 38th Annual Computer Security Applications
8–15. Conference. 827–840.
[6] Tao Ban, Ndichu Samuel, Takeshi Takahashi, and Daisuke Inoue. 2021. Combat [30] Allie Mellen, Joseph Blankenship, Sarah Morana, and Michael Belden. 2024. The
security alert fatigue with ai-assisted techniques. In Proceedings of the 14th Cyber Forrester Wave™: Extended Detection And Response Platforms, Q2 2024.
Security Experimentation and Test Workshop. 9–16. [31] Microsoft. 2024. Microsoft Copilot for Security.
[7] May Bashendy, Ashraf Tantawy, and Abdelkarim Erradi. 2023. Intrusion response [32] Palo Alto Networks. 2024. Cortex Related Incidents. https://docs-
systems for cyber-physical systems: A comprehensive survey. Computers & cortex.paloaltonetworks.com/r/Cortex-XSOAR/6.5/Cortex-XSOAR-
Security 124 (2023), 102984. Administrator-Guide/Manage-Related-Incidents
[8] Joey Caparas, Samantha Robertson, Aditi Srivastava, Denise Vangel, Stephanie [33] Palo Alto Networks. 2024. Cortex XSOAR Incident Similarity. https://xsoar.pan.
Savell, and Macky Cruz. 2024. What is Microsoft Copilot for Security? dev/docs/reference/scripts/d-bot-find-similar-incidents
[9] Forrester Consulting. 2020. The 2020 State Of Security Operations. [34] Palo Alto Networks. 2024. Suggested Remediation. https://docs-
[10] Rahul Duggal, Scott Freitas, Sunny Dhamnani, Duen Horng Chau, and Jimeng cortex.paloaltonetworks.com/r/Cortex-XDR/Cortex-XDR-Pro-Administrator-
Sun. 2021. Har: Hardness aware reweighting for imbalanced datasets. In 2021 Guide/Remediate-Changes-from-Malicious-Activity
IEEE International Conference on Big Data (Big Data). IEEE, 735–745. [35] Tien N Nguyen and Raymond Choo. 2021. Human-in-the-loop xai-enabled
[11] Rahul Duggal, Scott Freitas, Cao Xiao, Duen Horng Chau, and Jimeng Sun. 2020. vulnerability detection, investigation, and mitigation. In 2021 36th IEEE/ACM
REST: Robust and Efficient Neural Networks for Sleep Monitoring in the Wild. International Conference on Automated Software Engineering (ASE). IEEE, 1210–
In Proceedings of The Web Conference 2020. 1704–1714. 1212.
[12] Charles Feng, Shuning Wu, and Ningwei Liu. 2017. A user-centric machine learn- [36] Jonathan Oliver, Raghav Batta, Adam Bates, Muhammad Adil Inam, Shelly Mehta,
ing framework for cyber security operations center. In 2017 IEEE International and Shugao Xia. 2024. Carbon Filter: Real-time Alert Triage Using Large Scale
Conference on Intelligence and Security Informatics (ISI). IEEE, 173–175. Clustering and Fast Search. arXiv preprint arXiv:2405.04691 (2024).
[13] Bingrui Foo, Y-S Wu, Y-C Mao, Saurabh Bagchi, and Eugene Spafford. 2005. [37] Alina Oprea, Zhou Li, Robin Norris, and Kevin Bowers. 2018. Made: Security ana-
ADEPTS: Adaptive intrusion response using attack graphs in an e-commerce en- lytics for enterprise threat detection. In Proceedings of the 34th Annual Computer
vironment. In 2005 International Conference on Dependable Systems and Networks Security Applications Conference. 124–136.
(DSN’05). IEEE, 508–517. [38] Vihanga Heshan Perera, Amila Nuwan Senarathne, and Lakmal Rupasinghe.
[14] Muriel Figueredo Franco, Bruno Rodrigues, Eder John Scheid, Arthur Jacobs, 2019. Intelligent soc chatbot for security operation center. In 2019 International
Christian Killer, Lisandro Zambenedetti Granville, and Burkhard Stiller. 2020. Conference on Advancements in Computing (ICAC). IEEE, 340–345.
SecBot: a business-driven conversational agent for cybersecurity planning and [39] Yuanqing Qin, Qi Zhang, Chunjie Zhou, and Naixue Xiong. 2018. A risk-based
management. In 2020 16th international conference on network and service man- dynamic decision-making approach for cybersecurity protection in industrial
agement (CNSM). IEEE, 1–7. control systems. IEEE Transactions on Systems, Man, and Cybernetics: Systems 50,
[15] Scott Freitas, Yuxiao Dong, Joshua Neil, and Duen Horng Chau. 2021. A Large- 10 (2018), 3863–3870.
Scale Database for Graph Representation Learning. In Thirty-fifth Conference on [40] Alireza Shameli-Sendi, Naser Ezzati-Jivan, Masoume Jabbarifar, and Michel Da-
Neural Information Processing Systems Datasets and Benchmarks Track. genais. 2012. Intrusion response systems: survey and taxonomy. Int. J. Comput.
[16] Scott Freitas, Rahul Duggal, and Duen Horng Chau. 2022. MalNet: A large-scale Sci. Netw. Secur 12, 1 (2012), 1–14.
image database of malicious software. In Proceedings of the 31st ACM International [41] Alireza Shameli-Sendi, Habib Louafi, Wenbo He, and Mohamed Cheriet. 2016.
Conference on Information & Knowledge Management. 3948–3952. Dynamic optimal countermeasure selection for intrusion response system. IEEE
[17] Dianne Gali, Samantha Robertson, Daniel Simpson, and Stephanie Savell. 2024. Transactions on Dependable and Secure Computing 15, 5 (2016), 755–770.
Triage and investigate incidents with guided responses from Microsoft Copilot [42] Awalin Sopan, Matthew Berninger, Murali Mulakaluri, and Raj Katakam. 2018.
in Microsoft Defender. Building a Machine Learning Model for the SOC, by the Input from the SOC,
[18] Wajih Ul Hassan, Adam Bates, and Daniel Marino. 2020. Tactical provenance and Analyzing it for the SOC. In 2018 IEEE Symposium on Visualization for Cyber
analysis for endpoint detection and response systems. In 2020 IEEE Symposium Security (VizSec). IEEE, 1–8.
on Security and Privacy (SP). IEEE, 1172–1189. [43] Zheni S Stefanova and Kandethody M Ramachandran. 2018. Off-policy q-learning
[19] Wajih Ul Hassan, Shengjian Guo, Ding Li, Zhengzhang Chen, Kangkook Jee, technique for intrusion response in network security. World Academy of Science,
Zhichun Li, and Adam Bates. 2019. Nodoze: Combatting threat alert fatigue Engineering and Technology, International Science Index 136 (2018), 262–268.
with automated provenance triage. In network and distributed systems security [44] Blake E Strom, Andy Applebaum, Doug P Miller, Kathryn C Nickels, Adam G
symposium. Pennington, and Cody B Thomas. 2018. Mitre att&ck: Design and philosophy. In
[20] Shuang Huang, Chun-Jie Zhou, Shuang-Hua Yang, and Yuan-Qing Qin. 2015. Technical report. The MITRE Corporation.
Cyber-physical system security for networked industrial processes. International [45] Thomas Toth and Christopher Kruegel. 2002. Evaluating the impact of automated
Journal of Automation and Computing 12, 6 (2015), 567–578. intrusion response mechanisms. In 18th Annual Computer Security Applications
[21] IBM. 2020. IBM Automated Cyber Threat Triage and Response. https: Conference, 2002. Proceedings. IEEE, 301–310.
//events.afcea.org/afceacyber20/CUSTOM/pdf/IBM%20Automated%20Cyber% [46] Chen Zhong, Tao Lin, Peng Liu, John Yen, and Kai Chen. 2018. A cyber security
20Threat%20Triage%20and%20Response%20Solution%20Datasheet.pdf data triage operation retrieval system. Computers & Security 76 (2018), 12–31.
[22] Zakira Inayat, Abdullah Gani, Nor Badrul Anuar, Muhammad Khurram Khan, [47] Saman A Zonouz, Himanshu Khurana, William H Sanders, and Timothy M
and Shahid Anwar. 2016. Intrusion response systems: Foundations, design, and Yardley. 2013. RRE: A game-theoretic intrusion response and recovery engine.
challenges. Journal of Network and Computer Applications 62 (2016), 53–74. IEEE Transactions on Parallel and Distributed Systems 25, 2 (2013), 395–406.
[23] Yuxuan Jiang, Chaoyun Zhang, Shilin He, Zhihao Yang, Minghua Ma, Si Qin, Yu
Kang, Yingnong Dang, Saravan Rajmohan, Qingwei Lin, et al. 2024. Xpert: Em-
powering incident management with query recommendations via large language
models. In Proceedings of the IEEE/ACM 46th International Conference on Software
Engineering. 1–13.
[24] Joseph Khoury and Mohamed Nassar. 2020. A hybrid game theory and reinforce-
ment learning approach for cyber-physical systems security. In NOMS 2020-2020
IEEE/IFIP Network Operations and Management Symposium. IEEE, 1–9.
Conference’17, July 2017, Washington, DC, USA Freitas & Kalajdjieski et al.

APPENDIX Feature Description

Id Unique ID for each OrgId-IncidentId pair
A Dataset Overview
OrgId Organization identifier
GUIDE represents the largest publicly available collection of real-
IncidentId Organizationally unique incident identifier
world cybersecurity incidents, containing over 13 million pieces of
evidence across 1.6 million alerts and 1M annotated incidents. Under AlertId Unique identifier for an alert
the permissive CDLA-2.0 license, this dataset offers a unique oppor- Timestamp Time the alert was created
tunity for researchers and practitioners to develop and benchmark DetectorId Unique ID for the alert generating detector
advanced ML models on authentic and comprehensive security
AlertTitle Title of the alert
telemetry. See Table 4 for a description of each dataset field.
Category Category of the alert
We provide three hierarchies of data: (1) evidence, (2) alert, and
(3) incident. At the bottom level, evidence supports an alert. For MitreTechniques MITRE ATT&CK techniques involved in alert
example, an alert may be associated with multiple pieces of evi- IncidentGrade SOC grade assigned to the incident
dence such as an IP address, email, and user details, each containing ActionGrouped SOC alert remediation action (high level)
specific supporting metadata. Above that, we have alerts that con-
ActionGranular SOC alert remediation action (fine-grain)
solidate multiple pieces of evidence to signify a potential security
incident. These alerts provide a broader context by aggregating EntityType Type of entity involved in the alert
related evidences to present a more comprehensive picture of the EvidenceRole Role of the evidence in the investigation
potential threat. At the highest level, incidents encompass one or Roles Additional metadata on evidence role in alert
more alerts, representing a cohesive narrative of a security breach DeviceId Unique identifier for the device
or threat scenario.
DeviceName Name of the device
Sha256 SHA-256 hash of the file
B Benchmarking
IpAddress IP address involved
With the release of GUIDE, we aim to establish a standardized
Url URL involved
benchmark for guided response systems using real-world data. The
primary objective of the dataset is to accurately predict incident AccountSid On-premises account identifier
triage grades—true positive (TP), benign positive (BP), and false AccountUpn Email account identifier
positive (FP)—based on historical customer responses. To support AccountObjectId Entra ID account identifier
this, we provide a training dataset containing 45 features, labels, and AccountName Name of the on-premises account
unique identifiers across 1M triage-annotated incidents. We divide
NetworkMessageId Org-level identifier for email message
the dataset into a train set containing 70% of the data and a test set
with 30%, stratified based on triage grade ground-truth, OrgId, and EmailClusterId Unique identifier for the email cluster
DetectorId. We ensure that incidents are stratified together within RegistryKey Registry key involved
the train and test sets to ensure the relevance of evidence and alert RegistryValueName Name of the registry value
rows. The CSV files are hosted on Kaggle: https://www.kaggle.com/ RegistryValueData Data of the registry value
datasets/Microsoft/microsoft-security-incident-prediction
ApplicationId Unique identifier for the application
A secondary objective of GUIDE is to benchmark the remediation
capabilities of guided response systems. To this end, we release 26k ApplicationName Name of the application
ground-truth labels for predicting remediation actions for alerts, OAuthApplicationId OAuth application identifier
available at both granular and aggregate levels. The recommended ThreatFamily Malware family associated with a file
metric for evaluating research using the GUIDE dataset is macro-F1
FileName Name of the file
score, along with details on precision and recall.
FolderPath Path of the file folder
ResourceIdName Name of the Azure resource
C Privacy
ResourceType Type of Azure resource
To ensure privacy, we implement a stringent anonymization pro-
cess. Initially, sensitive values are pseudo-anonymized using SHA1 OSFamily Family of the operating system
hashing techniques. This step ensures that unique identifiers are ob- OSVersion Version of the operating system
fuscated while maintaining their uniqueness for consistency across AntispamDirection Direction of the antispam filter
the dataset. Following this, we replace these hashed values with SuspicionLevel Level of suspicion
randomly generated IDs to further enhance anonymity and prevent
LastVerdict Final verdict of threat analysis
any potential re-identification. Additionally, we introduce noise to
the timestamps, ensuring that the temporal aspects of the data can- CountryCode Country code evidence appears in
not be traced back to specific events. This multi-layered approach, State State of evidence appears in
combining pseudo-anonymization and randomization, safeguards City City evidence appears in
the privacy of all entities involved while maintaining the integrity
and utility of the dataset for research and development purposes. Table 4: Description of each column in the GUIDE dataset