Microsft Guide Paper 1
Microsft Guide Paper 1
Figure 1: Overview of the Copilot Guided Response architecture. Train Pipeline: Running weekly, this process trains grade
and action recommendation models based on historical SOC telemetry. Inference Pipeline: Running every 15 minutes, this
process generates grade, action, and similar incident recommendations for incoming incidents by leveraging the models created
in the train pipeline. Embedding Pipeline: Running every 30 minutes until 180 days of historical embeddings exist, this job
creates historical embeddings of SOC incidents for the similar incident recommendation algorithm in the inference pipeline.
To adhere to privacy regulations, CGR is replicated across geo- Algorithm 1: Copilot Guided Response Training
graphic regions, utilizing Synapse to ensure consistency and compli- Input: Alert data A, minimum cardinality 𝑐, principal
ance. Consequently, the following sections focuses on development components 𝑘, max incidents sampled per
from the perspective of a single geographic region. IncidentHash 𝑚, max incidents stored per
IncidentHash 𝑠, grid search parameters G
4 TRAIN PIPELINE Output: Incident embeddings I & trained models Mt, Mr
Copilot Guided Response’s training architecture is detailed across Feature Engineering
two subsections—Section 4.1 which presents the key steps to col- A ← FeatureEngineering(A) ; // T1
lecting and preparing the data; and Section 4.2 which discusses the A ← FeatureSpaceCompression(A, 𝑐) ; // T2
model training and validation process. A step-by-step overview of
A ← OneHotEncoding(A) ; // T3
the entire training process is provided in Algorithm 1.
I ← FormIncidents(A) ; // T4
I ← SampleIncidents(I, 𝑚) ; // T5
4.1 Preprocessing
I, A ← PCA(I, 𝑘), PCA(A, 𝑘) ; // T6
We detail the ten step process (T1-T10) for creating alert and inci- StoreEmbeddings(I, 𝑠) ; // T7
dent dataframes leveraged across all three pipelines.
Model Training
T1—Feature engineering. We collect alert telemetry from mul- I′, A′ ← ConvertToPandas(I, A) ; // T8
tiple Azure Data Lake Storage (ADLS) tables and join them into a M𝑡 ← TrainTriageModel(I′, G) ; // T9
PySpark alert dataframe. Each row in the alert dataframe contains M𝑟 ← TrainRemediationModel(A′, G) ; // T9
columns for unique alert and incident identifiers, complemented by ValidateAndStoreModels(M𝑡 , M𝑟 ) ; // T10
customer-provided grade and remediation action, when available.
Additionally, each row contains 5 categorical feature columns—
OrganizationId, DetectorId, ProductId, Category, and Severity— numerical columns. For incidents with multiple grades, the major-
along with 67 engineered numerical feature columns, developed in ity label is applied, with ties going to the true positive class. We
close collaboration with Microsoft security research experts. We remove any incidents without a triage grade at this stage.
retain rows that lack a customer grade or action, as these alerts can
merge with other alerts to form incidents that do contain labels. T5—Sampling incidents. Given that incident processing steps are
significantly more memory-intensive than remediation actions in
T2—Feature space compression. Before converting the cate- the alert dataframe, we employ random sampling on the incident
gorical columns to one-hot-encoded representations, we must ad- dataframe to mitigate out-of-memory issues during downstream
dress the challenge of high cardinality in the DetectorId and OrgId processing steps. This sampling strategy involves creating a unique
columns. In various geographic regions, DetectorIds can exceed IncidentHash identifier for each incident by arranging the Detec-
100k and OrgIds can reach up to 50k, creating an extremely large torIds of an incident into an ordered list and hashing it using SHA1.
and sparse feature space that often leads to failures during dimen- We can then cap the number of incidents for each unique Inciden-
sionality reduction in the PySpark cluster. To mitigate this, we ag- tHash and triage grade to a maximum of 1,000.
gregate the feature space by substituting infrequent values—those T6—Dimensionality reduction. We independently apply prin-
associated with fewer than 10 alerts—with a generic value. This cipal component analysis (PCA) to both the incident and alert
method, while resulting in some information loss, ensures the sys- dataframes, each containing tens of thousands of columns. PCA is
tem remains within the computational boundaries of the cluster. chosen for its performane and availability within PySpark’s native
machine learning library, which supports distributed computing
T3—One-hot-encoding. With the preliminary adjustments to our
and allows for efficient feature space reduction. This mitigates the
feature space, we can convert all 6 categorical feature columns into
risk of out-of-memory errors that occur when centralizing large
their one-hot-encoded (OHE) form. This transformation includes
dataframes on the primary PySpark node for subsequent scikit-
key columns such as OrgId, ProductId, and DetectorId, which allows
learn model training. Our objective is to condense the feature space
the models to capture SOC-specific tendencies as well as product
to 𝑘 principal components that captures 95% of the original variance
and detector specific characteristics that evolve over time. We bi-
in each dataframe. Empirically, we find that setting 𝑘 = 40 meets
furcate the data and establish a secondary PySpark alert dataframe
this requirement. The resulting PCA weights are saved in an Azure
that only contains alerts with remediation actions, while retaining
Blob Storage container, enabling reuse within the inference and
all alerts in the original dataframe. Finally, we store the PySpark
embedding pipelines.
OHE pipeline in an Azure Blog Storage container so that it can be
used in the inference process to transform the categorical columns. T7—Store embeddings. The final step is to save the incident em-
beddings to an ADLS table to enhance similar incident recommen-
T4—Forming incidents. To enhance our ability to make precise dations within the inference pipeline. A strategic product decision
incident-level triaging decisions and investigation recommenda- dictates that only the top five most similar incidents are displayed
tions, we create a separate incident dataframe. This is achieved for any given incident. Consequently, we only store up to five in-
by aggregating alert rows based on shared IncidentIds from the stances of each incident, categorized by triage grade and the unique
alert dataframe containing all alerts, and summing their respective set of DetectorIds that comprise the IncidentHash.
AI-Driven Guided Response for Security Operation Centers with Microsoft Copilot for Security Conference’17, July 2017, Washington, DC, USA
to match new incidents with historically relevant incidents within 7.1 Setup
the same organization through a three-step matching process: We present results for the triage and remediation models across a
(1) Exact hash matching. We begin by identifying historical inci- sample of 12 regions, where each regional dataset is divided into
dents that share the same IncidentHash and triage recommenda- three stratified subsets: training (70%), validation (10%), and testing
tion. If less than five matches are found, we take incidents with (20%). Each run of model training job includes two key parameter
the same IncidentHash but differing triage recommendations. optimizations, performed independently for incident triage and
(2) Approximate matching with cosine similarity. If less than alert remediation embeddings:
five exact matches were found, we search for historical inci- (1) PCA component selection. We retain the top-k principal
dents based on the cosine similarity of their embeddings. This components capturing 95% of the data variance, with 𝑘 ranging
approach helps to identify incidents that share significant char- up to 100. While the optimal 𝑘 varies across runs and regions,
acteristics with the current incident. we find that 𝑘 = 40 performs well across most scenarios.
(3) Top-k similar incident selection. We select the top-k most (2) Random forest parameter tuning. We optimize key random
similar incidents, up to a maximum of five. Exact and cosine forest parameters using a grid search (see Section 4.2) We select
similarity matches are ordered, with a higher priority given to the model with the highest macro-F1 score on the validation
exact matches to ensure the most germane comparisons, and set and report precision and recall metrics, as is standard for
ties going to the most recent incident. imbalanced datasets [10, 11, 15, 16].
5.4 Remediation Recommendations Table 2 summarizes model training and performance statistics
over two weeks across select regions, including the count of unique
Using alert embeddings from preprocessing and the latest remedia-
DetectorIds, and alert/incident volumes. Regional variations are
tion model, we generate targeted response actions—contain user,
significant, with up to 31k detectors across thousands of organiza-
isolate machine, or stop virtual machine—for each alert with confi-
tions, complicating the training process. Incident size distribution
dence above a 0.9 precision threshold. The system identifies entities
(Figure 2) is long-tailed, with most incidents comprised of a few
(e.g., users, devices, VMs) using encoded rules based on security
alerts, with larger incidents representing a greater challenge for
domain knowledge, and then aggregates the individual alert rec-
triage and similar incident recommendations due to their rarity.
ommendations into comprehensive incident recommendations.
Limitations. This work focuses on presenting a scalable, unified
6 EMBEDDING PIPELINE framework for incident triaging, remediation, and similar incident
We generate historical incident embeddings that allow the similar recommendations, rather than evaluating alternative models or
incident recommendation algorithm to leverage up to 180 days of competing security products, which are often undisclosed. The
historical data when making recommendations. Due to the limita- release of the GUIDE dataset enables researchers to explore and
tions of the training pipeline in processing huge volumes of incident optimize new architectures for maximum performance. In addition,
telemetry across regions, we developed a specialized mechanism to CGR is limited to providing recommendations for existing detectors
generate historical embeddings. This ensures that our similar in- and does not address zero-day or other unmonitored attack vectors.
cident recommendation algorithm rapidly reaches comprehensive
historical coverage each time the inference pipeline is executed.
Continuous embedding generation. The embedding pipeline
operates in a continuous loop, with each iteration processing data
one day further back than the last. Leveraging the preprocessing
steps outlined in Section 5.1, we integrate a deduplication process
that loads historical incident embeddings and compares incident
hashes and triage recommendations to eliminate redundancy. We
store any IncidentHash and triage recommendation pairs from the
current batch, including those without a triage grade, provided they
do not exceed five stored embeddings—aligning with our policy of
recommending no more than five similar incidents at a time. New
incident embeddings are then saved to the ADLS table for use in
the inference pipeline. This procedure repeats until we have 180
days of historical incident embedding telemetry.
7 EXPERIMENTS
We evaluate CGR’s performance across three tasks: (1) triaging,
assessing the model’s ability to classify incidents as benign, mali-
cious, or informational; (2) investigation, analyzing the relevance of Figure 2: Sampled distribution of incident size—measured by
similar incident recommendations; and (3) remediation, evaluating the number of alerts per incident—exhibits a long-tailed pat-
its ability to predict effective threat mitigation actions. tern where the majority of incidents have only a few alerts.
AI-Driven Guided Response for Security Operation Centers with Microsoft Copilot for Security Conference’17, July 2017, Washington, DC, USA
7.2 GUIDE Dataset of 0.87, with a precision of 0.87, and a recall of 0.86. This perfor-
GUIDE includes over 13M pieces of evidence across 33 entity types, mance demonstrates the model’s ability to effectively manage the
encompassing 1.6M alerts and 1M incidents over a two-week pe- complexities of the triage task, ranging from incidents involving a
riod. Each incident is annotated with a ground-truth triage label by single alert to those comprising hundreds of alerts across numerous
customer security analysts, along with 26k alerts containing labels detectors and security products.
of remediation actions taken by customers. The dataset is derived To evaluate CGR’s effectiveness at a more granular level, we
from Region 2 (see Table 2) and includes telemetry across 6.1k or- examine performance across individual triage classifications in
ganizations, featuring 9.1k unique custom and built-in DetectorIds Region 2. The precision-recall curves in Figure 3 show that CGR
across numerous security products, covering 441 MITRE ATT&CK consistently achieves high AUC scores across triage labels. The
techniques [44]. We divide the dataset into a train (70%) and test set majority of misclassifications (81.1%) occur when incidents are
(30%), both available as CSV files on Kaggle. The release of GUIDE categorized as BP instead of TP or FP, or vice versa—a discrepancy
provides an unparalleled opportunity for the development of GR mainly due to the lack of a standardized definition for BP incidents,
systems and beyond, with CGR providing a foundational baseline. leading to classification inconsistencies across and within SOCs.
Additional details on GUIDE can be found in the Appendix. However, the more critical incident misclassification of TP as BP or
FP is rare (2.4%), as shown in Table 3.
7.3 Triage
The telemetry collection period for triage ranges from 7 to 180 days,
varying by region due to computational constraints in incident pre-
processing. Table 2 shows that the number of SOC-graded incidents
(abbreviated as "Supp" for support) ranges from 8.9k to 139k with
an imbalance among the three triage classes of 19% true positive,
35% false positive, and 46% benign positive (informational). While
larger regions tend to have more graded incidents, several factors
complicate the number of graded incidents reported during training:
(1) larger regions with more organizations and detectors require
more memory during OHE, reducing capacity for additional alerts;
(2) the frequency and types of grading vary significantly across
SOCs and regions, and (3) we limit the number of graded examples
per IncidentHash and triage label to 1k.
Actual / Predicted TP BP FP Online results. While we achieve notable offline results, the col-
TP 929 170 50 lected data does not fully capture the SOC experience. A majority
BP 62 4,748 45 of alerts are not actioned by analysts, and not all alerts fit the three
FP 66 221 2,869 predefined remediation action types. As a result, the model’s cov-
Table 3: Confusion matrix of triage performance in Region 2. erage averages 62% across regions. However, as we integrate new
Diagonal values reflect correct classifications, highlighting remediation actions, the system’s coverage will naturally increase.
strong model performance across triage classes.
8 DEPLOYMENT
Online results. Triage recommendations are inherently dynamic,
Copilot Guided Response has been successfully deployed across
adapting as new alerts are added to incidents. However, as shown
the world, serving thousands of Microsoft Defender XDR customers
in Figure 2, since the vast majority of incidents are relatively short-
since its launch in April 2024. CGR has generated millions of guided
lived and involve only a few alerts, only 2% of incidents undergo a
response recommendations for triage, investigation, and remedia-
change in their initial triage recommendation. Furthermore, model
tion tasks, receiving a ‘positive’ user response rate of 89%, based
training pipelines across regions exhibit some bias, as not all types
on the confirmation or dismissal of recommendations.
of incidents and alerts are graded by SOC analysts. Coupled with
Our deployment infrastructure leverages a Synapse-based PyS-
a heightened precision threshold to ensure that recommendations
park cluster, customized to each geographical region. This infras-
are correct 90% of the time, the triage models effectively cover 41%
tructure includes: (a) an ADLS database ensuring both accessibility
of incidents across regions.
and secure management of telemetry; (b) an Azure Synapse back-
end that provides a robust framework for deployment; (c) an XXL
7.4 Investigation
PySpark pool featuring 60 executors, each equipped with 64 CPU
In collaboration with security experts at Microsoft, we manually cores and 400GB of RAM; (d) autoscaling to adjust executors based
evaluated our similar incident recommendations due to an absence on fluctuating load; and (e) automated re-execution of failed jobs to
of a definitive ground truth. For this evaluation, we randomly se- ensure continuous coverage. Due to the absence of native support
lected 1k incidents varying in size, detectors, products, organi- for model monitoring, versioning, and storage within Synapse, we
zations, and regions. Security researchers were then tasked with developed a custom infrastructure to support these capabilities.
randomly selecting a similar incident recommendation for each ref-
erence incident, and judging its relevance based on criteria such as
shared attack patterns, indicators of compromise, and entity types. 9 CONCLUSION
Offline results. The assessment found that 94% of the recom- Copilot Guided Response (CGR) represents the first time a cyber-
mended incidents were relevant, with only 2% deemed dissimilar. security company has openly discussed an industry-scale guided
The quality of recommendations tends to diminish for smaller or- response framework. CGR significantly enhances SOC operations
ganizations with limited historical data. Similarly, larger incidents by guiding security analysts through crucial investigation, triaging,
characterized by hundreds of alerts across multiple products and and remediation tasks, adeptly handling everything from simple
detectors also show decreased recommendation quality due to their alerts to complex incidents. The performance of CGR has been rig-
rarity. To address this, a cosine similarity threshold of 0.9 was orously evaluated through internal testing, collaboration with Mi-
identified as the cut-off threshold to ensure that only relevant rec- crosoft security experts, and extensive customer feedback, demon-
ommendations are presented to customers. Across regions, we find strating its effectiveness across all three tasks. Deployed globally
that 98% of all incidents have one or more recommendations. within Microsoft Defender XDR, CGR generates millions of guided
response recommendations weekly, with 89% of user interactions
7.5 Remediation receiving positive feedback. In addition, we release GUIDE, the
We collect 180 days of telemetry, utilizing a preprocessing pipeline largest publicly available collection of real-world security incidents,
for alerts that is significantly simpler than the one used for incident- comprising 13 million pieces of evidence across one million inci-
based triage. Unlike incidents, alerts do not require sampling due dents, each annotated with ground-truth triage labels by customer
to their less computationally intensive preprocessing, allowing for security analysts. As the first resource of its kind, GUIDE sets a new
a higher volume of training data, particularly in larger regions as standard for advancing the development and evaluation of guided
shown in Table 2. The volume of actioned alerts varies widely across response systems and beyond.
regions, from a few hundred to 180k, with a notable imbalance
among the remediation classes—67% are contain account (CA), 23%
isolate device (ID), and less than 1% stopping virtual machines (VM). ACKNOWLEDGMENTS
Future work to support remediation actions, such as quarantining We thank our colleagues who supported this research, including
files, deleting emails, and blocking IPs/URLs is ongoing. Shira Shacham, Yuval Derman, Oren Saban, Noa Bratman, Inbar
Offline results. We achieve an impressive average cross-region Rotem, Omri Kantor, Itamar Karavani, Mari Mishel, Yuval Katav,
macro-F1 score of 0.99. This high score reflects the relative simplic- Niv Zohar, Lior Camri, Leeron Luzzatto, Anna Karp, Ido Nitzan,
ity of predicting the appropriate remediation action for a single alert Corina Feurstein, Ya’ara Cohen, Nadia Tkach Mendes, Gil Shmaya,
compared to the complexities of incident triaging in our dataset. Pawel Partyka, Shachaf Levy, Blake Strom, along with many others.
AI-Driven Guided Response for Security Operation Centers with Microsoft Copilot for Security Conference’17, July 2017, Washington, DC, USA
REFERENCES [25] Irina Kraeva and Gulnara Yakhyaeva. 2021. Application of the metric learning
[1] Khalid Alsubhi, Issam Aib, and Raouf Boutaba. 2012. FuzMet: A fuzzy-logic based for security incident playbook recommendation. In 2021 IEEE 22nd International
alert prioritization engine for intrusion detection systems. International Journal Conference of Young Professionals in Electron Devices and Materials (EDM). IEEE,
of Network Management 22, 4 (2012), 263–284. 475–479.
[2] Khalid Alsubhi, Ehab Al-Shaer, and Raouf Boutaba. 2008. Alert prioritization in [26] Ryuta Kremer, Prasanna N Wudali, Satoru Momiyama, Toshinori Araki, Jun
intrusion detection systems. In NOMS 2008-2008 IEEE Network Operations and Furukawa, Yuval Elovici, and Asaf Shabtai. 2023. IC-SECURE: Intelligent System
Management Symposium. IEEE, 33–40. for Assisting Security Experts in Generating Playbooks for Automated Incident
[3] Hilala Alturkistani and Mohammed A El-Affendi. 2022. Optimizing cybersecurity Response. arXiv preprint arXiv:2311.03825 (2023).
incident response decisions using deep reinforcement learning. International [27] Xuan Li, Chunjie Zhou, Yu-Chu Tian, and Yuanqing Qin. 2018. A dynamic
Journal of Electrical and Computer Engineering 12, 6 (2022), 6768. decision-making approach for intrusion response in industrial control systems.
[4] Muhamad Erza Aminanto, Tao Ban, Ryoichi Isawa, Takeshi Takahashi, and IEEE Transactions on Industrial Informatics 15, 5 (2018), 2544–2554.
Daisuke Inoue. 2020. Threat alert prioritization using isolation forest and stacked [28] Tao Lin. 2018. A Data Triage Retrieval System for Cyber Security Operations
auto encoder with day-forward-chaining analysis. IEEE Access 8 (2020), 217977– Center. (2018).
217986. [29] Yushan Liu, Xiaokui Shu, Yixin Sun, Jiyong Jang, and Prateek Mittal. 2022. RAPID:
[5] Andy Applebaum, Shawn Johnson, Michael Limiero, and Michael Smith. 2018. Real-Time Alert Investigation with Context-aware Prioritization for Efficient
Playbook oriented cyber response. In 2018 National Cyber Summit (NCS). IEEE, Threat Discovery. In Proceedings of the 38th Annual Computer Security Applications
8–15. Conference. 827–840.
[6] Tao Ban, Ndichu Samuel, Takeshi Takahashi, and Daisuke Inoue. 2021. Combat [30] Allie Mellen, Joseph Blankenship, Sarah Morana, and Michael Belden. 2024. The
security alert fatigue with ai-assisted techniques. In Proceedings of the 14th Cyber Forrester Wave™: Extended Detection And Response Platforms, Q2 2024.
Security Experimentation and Test Workshop. 9–16. [31] Microsoft. 2024. Microsoft Copilot for Security.
[7] May Bashendy, Ashraf Tantawy, and Abdelkarim Erradi. 2023. Intrusion response [32] Palo Alto Networks. 2024. Cortex Related Incidents. https://docs-
systems for cyber-physical systems: A comprehensive survey. Computers & cortex.paloaltonetworks.com/r/Cortex-XSOAR/6.5/Cortex-XSOAR-
Security 124 (2023), 102984. Administrator-Guide/Manage-Related-Incidents
[8] Joey Caparas, Samantha Robertson, Aditi Srivastava, Denise Vangel, Stephanie [33] Palo Alto Networks. 2024. Cortex XSOAR Incident Similarity. https://xsoar.pan.
Savell, and Macky Cruz. 2024. What is Microsoft Copilot for Security? dev/docs/reference/scripts/d-bot-find-similar-incidents
[9] Forrester Consulting. 2020. The 2020 State Of Security Operations. [34] Palo Alto Networks. 2024. Suggested Remediation. https://docs-
[10] Rahul Duggal, Scott Freitas, Sunny Dhamnani, Duen Horng Chau, and Jimeng cortex.paloaltonetworks.com/r/Cortex-XDR/Cortex-XDR-Pro-Administrator-
Sun. 2021. Har: Hardness aware reweighting for imbalanced datasets. In 2021 Guide/Remediate-Changes-from-Malicious-Activity
IEEE International Conference on Big Data (Big Data). IEEE, 735–745. [35] Tien N Nguyen and Raymond Choo. 2021. Human-in-the-loop xai-enabled
[11] Rahul Duggal, Scott Freitas, Cao Xiao, Duen Horng Chau, and Jimeng Sun. 2020. vulnerability detection, investigation, and mitigation. In 2021 36th IEEE/ACM
REST: Robust and Efficient Neural Networks for Sleep Monitoring in the Wild. International Conference on Automated Software Engineering (ASE). IEEE, 1210–
In Proceedings of The Web Conference 2020. 1704–1714. 1212.
[12] Charles Feng, Shuning Wu, and Ningwei Liu. 2017. A user-centric machine learn- [36] Jonathan Oliver, Raghav Batta, Adam Bates, Muhammad Adil Inam, Shelly Mehta,
ing framework for cyber security operations center. In 2017 IEEE International and Shugao Xia. 2024. Carbon Filter: Real-time Alert Triage Using Large Scale
Conference on Intelligence and Security Informatics (ISI). IEEE, 173–175. Clustering and Fast Search. arXiv preprint arXiv:2405.04691 (2024).
[13] Bingrui Foo, Y-S Wu, Y-C Mao, Saurabh Bagchi, and Eugene Spafford. 2005. [37] Alina Oprea, Zhou Li, Robin Norris, and Kevin Bowers. 2018. Made: Security ana-
ADEPTS: Adaptive intrusion response using attack graphs in an e-commerce en- lytics for enterprise threat detection. In Proceedings of the 34th Annual Computer
vironment. In 2005 International Conference on Dependable Systems and Networks Security Applications Conference. 124–136.
(DSN’05). IEEE, 508–517. [38] Vihanga Heshan Perera, Amila Nuwan Senarathne, and Lakmal Rupasinghe.
[14] Muriel Figueredo Franco, Bruno Rodrigues, Eder John Scheid, Arthur Jacobs, 2019. Intelligent soc chatbot for security operation center. In 2019 International
Christian Killer, Lisandro Zambenedetti Granville, and Burkhard Stiller. 2020. Conference on Advancements in Computing (ICAC). IEEE, 340–345.
SecBot: a business-driven conversational agent for cybersecurity planning and [39] Yuanqing Qin, Qi Zhang, Chunjie Zhou, and Naixue Xiong. 2018. A risk-based
management. In 2020 16th international conference on network and service man- dynamic decision-making approach for cybersecurity protection in industrial
agement (CNSM). IEEE, 1–7. control systems. IEEE Transactions on Systems, Man, and Cybernetics: Systems 50,
[15] Scott Freitas, Yuxiao Dong, Joshua Neil, and Duen Horng Chau. 2021. A Large- 10 (2018), 3863–3870.
Scale Database for Graph Representation Learning. In Thirty-fifth Conference on [40] Alireza Shameli-Sendi, Naser Ezzati-Jivan, Masoume Jabbarifar, and Michel Da-
Neural Information Processing Systems Datasets and Benchmarks Track. genais. 2012. Intrusion response systems: survey and taxonomy. Int. J. Comput.
[16] Scott Freitas, Rahul Duggal, and Duen Horng Chau. 2022. MalNet: A large-scale Sci. Netw. Secur 12, 1 (2012), 1–14.
image database of malicious software. In Proceedings of the 31st ACM International [41] Alireza Shameli-Sendi, Habib Louafi, Wenbo He, and Mohamed Cheriet. 2016.
Conference on Information & Knowledge Management. 3948–3952. Dynamic optimal countermeasure selection for intrusion response system. IEEE
[17] Dianne Gali, Samantha Robertson, Daniel Simpson, and Stephanie Savell. 2024. Transactions on Dependable and Secure Computing 15, 5 (2016), 755–770.
Triage and investigate incidents with guided responses from Microsoft Copilot [42] Awalin Sopan, Matthew Berninger, Murali Mulakaluri, and Raj Katakam. 2018.
in Microsoft Defender. Building a Machine Learning Model for the SOC, by the Input from the SOC,
[18] Wajih Ul Hassan, Adam Bates, and Daniel Marino. 2020. Tactical provenance and Analyzing it for the SOC. In 2018 IEEE Symposium on Visualization for Cyber
analysis for endpoint detection and response systems. In 2020 IEEE Symposium Security (VizSec). IEEE, 1–8.
on Security and Privacy (SP). IEEE, 1172–1189. [43] Zheni S Stefanova and Kandethody M Ramachandran. 2018. Off-policy q-learning
[19] Wajih Ul Hassan, Shengjian Guo, Ding Li, Zhengzhang Chen, Kangkook Jee, technique for intrusion response in network security. World Academy of Science,
Zhichun Li, and Adam Bates. 2019. Nodoze: Combatting threat alert fatigue Engineering and Technology, International Science Index 136 (2018), 262–268.
with automated provenance triage. In network and distributed systems security [44] Blake E Strom, Andy Applebaum, Doug P Miller, Kathryn C Nickels, Adam G
symposium. Pennington, and Cody B Thomas. 2018. Mitre att&ck: Design and philosophy. In
[20] Shuang Huang, Chun-Jie Zhou, Shuang-Hua Yang, and Yuan-Qing Qin. 2015. Technical report. The MITRE Corporation.
Cyber-physical system security for networked industrial processes. International [45] Thomas Toth and Christopher Kruegel. 2002. Evaluating the impact of automated
Journal of Automation and Computing 12, 6 (2015), 567–578. intrusion response mechanisms. In 18th Annual Computer Security Applications
[21] IBM. 2020. IBM Automated Cyber Threat Triage and Response. https: Conference, 2002. Proceedings. IEEE, 301–310.
//events.afcea.org/afceacyber20/CUSTOM/pdf/IBM%20Automated%20Cyber% [46] Chen Zhong, Tao Lin, Peng Liu, John Yen, and Kai Chen. 2018. A cyber security
20Threat%20Triage%20and%20Response%20Solution%20Datasheet.pdf data triage operation retrieval system. Computers & Security 76 (2018), 12–31.
[22] Zakira Inayat, Abdullah Gani, Nor Badrul Anuar, Muhammad Khurram Khan, [47] Saman A Zonouz, Himanshu Khurana, William H Sanders, and Timothy M
and Shahid Anwar. 2016. Intrusion response systems: Foundations, design, and Yardley. 2013. RRE: A game-theoretic intrusion response and recovery engine.
challenges. Journal of Network and Computer Applications 62 (2016), 53–74. IEEE Transactions on Parallel and Distributed Systems 25, 2 (2013), 395–406.
[23] Yuxuan Jiang, Chaoyun Zhang, Shilin He, Zhihao Yang, Minghua Ma, Si Qin, Yu
Kang, Yingnong Dang, Saravan Rajmohan, Qingwei Lin, et al. 2024. Xpert: Em-
powering incident management with query recommendations via large language
models. In Proceedings of the IEEE/ACM 46th International Conference on Software
Engineering. 1–13.
[24] Joseph Khoury and Mohamed Nassar. 2020. A hybrid game theory and reinforce-
ment learning approach for cyber-physical systems security. In NOMS 2020-2020
IEEE/IFIP Network Operations and Management Symposium. IEEE, 1–9.
Conference’17, July 2017, Washington, DC, USA Freitas & Kalajdjieski et al.