0% found this document useful (0 votes)

6 views

Assignment

The ReadyAssist project aims to enhance Security Operation Centers by developing a machine learning model to classify cybersecurity incidents into triage grades: true positive (TP), benign positive (BP), and false positive (FP) using the GUIDE dataset. The project involves data exploration, preprocessing, model training, evaluation, and documentation to ensure the model's effectiveness in real-world applications. Success will be measured through macro-F1 score, precision, and recall, with the goal of improving the overall security posture of enterprise environments.

Uploaded by

Akshat Raj Sharma

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

6 views

Assignment

Uploaded by

Akshat Raj Sharma

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 5

Project Title

ReadyAssist: Classifying Cybersecurity Incidents with

Machine Learning

Domain Cybersecurity and Machine Learning

Problem Statement:

Imagine you are working as a data scientist at Ready Assist, tasked with enhancing the
efficiency of Security Operation Centers (SOCs) by developing a machine learning
model that can accurately predict the triage grade of cybersecurity incidents. Utilizing
the comprehensive GUIDE dataset, your goal is to create a classification model that
categorizes incidents as true positive (TP), benign positive (BP), or false positive (FP)
based on historical evidence and customer responses. The model should be robust
enough to support guided response systems in providing SOC analysts with precise,
context-rich recommendations, ultimately improving the overall security posture of
enterprise environments.

You need to train the model using the train.csv dataset and provide evaluation
metrics—macro-F1 score, precision, and recall—based on the model's performance on
the test.csv dataset. This ensures that the model is not only well-trained but also
generalizes effectively to unseen data, making it reliable for real-world applications.

Business Use Cases:

The solution developed in this project can be implemented in various business

scenarios, particularly in the field of cybersecurity. Some potential applications include:

● Security Operation Centers (SOCs): Automating the triage process by

accurately classifying cybersecurity incidents, thereby allowing SOC analysts to
prioritize their efforts and respond to critical threats more efficiently.
● Incident Response Automation: Enabling guided response systems to
automatically suggest appropriate actions for different types of incidents, leading
to quicker mitigation of potential threats.
● Threat Intelligence: Enhancing threat detection capabilities by incorporating
historical evidence and customer responses into the triage process, which can
lead to more accurate identification of true and false positives.
● Enterprise Security Management: Improving the overall security posture of
enterprise environments by reducing the number of false positives and ensuring
that true threats are addressed promptly.

Approach:

1. Data Exploration and Understanding:

a. Initial Inspection: Start by loading the train.csv dataset and perform
an initial inspection to understand the structure of the data, including the
number of features, types of variables (categorical, numerical), and the
distribution of the target variable (TP, BP, FP).
b. Exploratory Data Analysis (EDA): Use visualizations and statistical
summaries to identify patterns, correlations, and potential anomalies in the
data. Pay special attention to class imbalances, as they may require
specific handling strategies later on.
2. Data Preprocessing:
a. Handling Missing Data: Identify any missing values in the dataset and
decide on an appropriate strategy, such as imputation, removing affected
rows, or using models that can handle missing data inherently.
b. Feature Engineering: Create new features or modify existing ones to
improve model performance. For example, combining related features,
deriving new features from timestamps (like hour of the day or day of the
week), or normalizing numerical variables.
c. Encoding Categorical Variables: Convert categorical features into
numerical representations using techniques like one-hot encoding, label
encoding, or target encoding, depending on the nature of the feature and
its relationship with the target variable.
3. Data Splitting:
a. Train-Validation Split: Before diving into model training, split the
train.csv data into training and validation sets. This allows for tuning
and evaluating the model before final testing on test.csv. Typically, a
70-30 or 80-20 split is used, but this can vary depending on the dataset's
size.
b. Stratification: If the target variable is imbalanced, consider using
stratified sampling to ensure that both the training and validation sets have
similar class distributions.
4. Model Selection and Training:
a. Baseline Model: Start with a simple baseline model, such as a logistic
regression or decision tree, to establish a performance benchmark. This
helps in understanding how complex the model needs to be.
b. Advanced Models: Experiment with more sophisticated models such as
Random Forests, Gradient Boosting Machines (e.g., XGBoost, LightGBM),
and Neural Networks. Each model should be tuned using techniques like
grid search or random search over hyperparameters.
c. Cross-Validation: Implement cross-validation (e.g., k-fold
cross-validation) to ensure the model's performance is consistent across
different subsets of the data. This reduces the risk of overfitting and
provides a more reliable estimate of the model's performance.
5. Model Evaluation and Tuning:
a. Performance Metrics: Evaluate the model using the validation set,
focusing on macro-F1 score, precision, and recall. Analyze these metrics
across different classes (TP, BP, FP) to ensure balanced performance.
b. Hyperparameter Tuning: Based on the initial evaluation, fine-tune
hyperparameters to optimize model performance. This may involve
adjusting learning rates, regularization parameters, tree depths, or the
number of estimators, depending on the model type.
c. Handling Class Imbalance: If class imbalance is a significant issue,
consider techniques such as SMOTE (Synthetic Minority Over-sampling
Technique), adjusting class weights, or using ensemble methods to boost
the model's ability to handle minority classes effectively.
6. Model Interpretation:
a. Feature Importance: After selecting the best model, analyze feature
importance to understand which features contribute most to the
predictions. This can be done using methods like SHAP values,
permutation importance, or model-specific feature importance measures.
b. Error Analysis: Perform an error analysis to identify common
misclassifications. This can provide insights into potential improvements,
such as additional feature engineering or refining the model's complexity.
7. Final Evaluation on Test Set:
a. Testing: Once the model is finalized and optimized, evaluate it on the
test.csv dataset. Report the final macro-F1 score, precision, and recall
to assess how well the model generalizes to unseen data.
b. Comparison to Baseline: Compare the performance on the test set to
the baseline model and initial validation results to ensure consistency and
improvement.
8. Documentation and Reporting:
a. Model Documentation: Thoroughly document the entire process,
including the rationale behind chosen methods, challenges faced, and
how they were addressed. Include a summary of key findings and model
performance.
b. Recommendations: Provide recommendations on how the model can be
integrated into SOC workflows, potential areas for future improvement,
and considerations for deployment in a real-world setting.

Results:

By the end of the project, Candidate should aim to achieve the following outcomes:

● A machine learning model capable of accurately predicting the triage grade of

cybersecurity incidents (TP, BP, FP) with high macro-F1 score, precision, and
recall.
● A comprehensive analysis of model performance, including insights into which
features are most influential in the prediction process.
● Documentation that details the model development process, including data
preprocessing, model selection, evaluation, and potential deployment strategies.

Project Evaluation metrics:

The success and effectiveness of the project will be evaluated based on the following
metrics:

● Macro-F1 Score: A balanced metric that accounts for the performance across all
classes (TP, BP, FP), ensuring that each class is treated equally.
● Precision: Measures the accuracy of the positive predictions made by the
model, which is crucial for minimizing false positives.
● Recall: Measures the model's ability to correctly identify all relevant instances
(true positives), which is important for ensuring that real threats are not missed.

Data Set Overview:

We provide three hierarchies of data: (1) evidence, (2) alert, and (3) incident. At
the bottom level, evidence supports an alert. For example, an alert may be
associated with multiple pieces of evidence such as an IP address, email, and
user details, each containing specific supporting metadata. Above that, we have
alerts that consolidate multiple pieces of evidence to signify a potential security
incident. These alerts provide a broader context by aggregating related
evidences to present a more comprehensive picture of the potential threat. At the
highest level, incidents encompass one or more alerts, representing a cohesive
narrative of a security breach or threat scenario.

The primary objective of the dataset is to accurately predict incident triage

grades—true positive (TP), benign positive (BP), and false positive (FP)—based
on historical customer responses. To support this, we provide a training dataset
containing 45 features, labels, and unique identifiers across 1M triage-annotated
incidents. We divide the dataset into a train set containing 70% of the data and a
test set with 30%, stratified based on triage grade ground-truth, OrgId, and
DetectorId. We ensure that incidents are stratified together within the train and
test sets to ensure the relevance of evidence and alert rows.

Data Set Explanation:

The GUIDE dataset contains records of cybersecurity incidents along with their
corresponding triage grades (TP, BP, FP) based on historical evidence and customer
responses. Preprocessing steps may include:

● Handling Missing Data: Identifying and addressing any missing values in the
dataset.
● Feature Engineering: Creating new features or modifying existing ones to
improve model performance, such as combining related features or encoding
categorical variables.
● Normalization/Standardization: Scaling numerical features to ensure that all
input data is on a similar scale, which can be important for certain machine
learning models.

Dataset Link : Dataset

Project Deliverables:

● Source Code: Well-documented code that includes all steps from data
preprocessing to model evaluation.
● Model File: The trained machine learning model ready for deployment.

КТП 8кл
No ratings yet
КТП 8кл
14 pages
Machine Learning Project Checklist
100% (1)
Machine Learning Project Checklist
10 pages
MMAKR
No ratings yet
MMAKR
13 pages
Project Report(Cyber_).docx
No ratings yet
Project Report(Cyber_).docx
14 pages
batch 7 conference paper (1)
No ratings yet
batch 7 conference paper (1)
5 pages
Data Science Checklist
No ratings yet
Data Science Checklist
22 pages
AI course help guide
No ratings yet
AI course help guide
3 pages
AICS Topics
No ratings yet
AICS Topics
250 pages
IT Specialist: Artificial Intelligence Exam Prep - 500 Questions for Certification Success (0225)
From Everand
IT Specialist: Artificial Intelligence Exam Prep - 500 Questions for Certification Success (0225)
Satou Takahiro
No ratings yet
Capstone 2 Corizo
No ratings yet
Capstone 2 Corizo
2 pages
Dl Arch Packets
No ratings yet
Dl Arch Packets
21 pages
ML Checklist PDF
No ratings yet
ML Checklist PDF
4 pages
DS Model Steps
No ratings yet
DS Model Steps
8 pages
DATA MINING AND MACHINE LEARNING. PREDICTIVE TECHNIQUES: REGRESSION, GENERALIZED LINEAR MODELS, SUPPORT VECTOR MACHINE AND NEURAL NETWORKS
From Everand
DATA MINING AND MACHINE LEARNING. PREDICTIVE TECHNIQUES: REGRESSION, GENERALIZED LINEAR MODELS, SUPPORT VECTOR MACHINE AND NEURAL NETWORKS
César Pérez López
No ratings yet
Context: Description
No ratings yet
Context: Description
5 pages
Each Stage of A Data Mining Project
No ratings yet
Each Stage of A Data Mining Project
5 pages
Final Report (1)
No ratings yet
Final Report (1)
17 pages
Unit 7 ML
No ratings yet
Unit 7 ML
33 pages
Capstone Project Guidelines Data Science
No ratings yet
Capstone Project Guidelines Data Science
2 pages
turover prediction
No ratings yet
turover prediction
52 pages
Credit_Card_Approval_Prediction_Report-Final
No ratings yet
Credit_Card_Approval_Prediction_Report-Final
27 pages
8824 Shivam Darekar Report - 8824 Shivam Darekar
No ratings yet
8824 Shivam Darekar Report - 8824 Shivam Darekar
7 pages
Fraud Claim Detection
No ratings yet
Fraud Claim Detection
13 pages
CE802 Pilot
No ratings yet
CE802 Pilot
2 pages
SRS Cyber
No ratings yet
SRS Cyber
11 pages
_ML Report__22112037
No ratings yet
_ML Report__22112037
9 pages
Hackathon Best Practices
No ratings yet
Hackathon Best Practices
2 pages
Project Guidelines Credit Score Classification (1)
No ratings yet
Project Guidelines Credit Score Classification (1)
3 pages
Lab1
No ratings yet
Lab1
3 pages
DATA MINING and MACHINE LEARNING. PREDICTIVE TECHNIQUES: ENSEMBLE METHODS, BOOSTING, BAGGING, RANDOM FOREST, DECISION TREES and REGRESSION TREES.: Examples with MATLAB
From Everand
DATA MINING and MACHINE LEARNING. PREDICTIVE TECHNIQUES: ENSEMBLE METHODS, BOOSTING, BAGGING, RANDOM FOREST, DECISION TREES and REGRESSION TREES.: Examples with MATLAB
César Pérez López
No ratings yet
Lec 2
No ratings yet
Lec 2
13 pages
Finalized_blackbook_Group_28
No ratings yet
Finalized_blackbook_Group_28
42 pages
Ai Project Assignement 04 Mussab(Fa22 Bce 073)
No ratings yet
Ai Project Assignement 04 Mussab(Fa22 Bce 073)
5 pages
FRA Project Report - Chilla Nagaraju
100% (1)
FRA Project Report - Chilla Nagaraju
66 pages
Module 5.pptx_20250608_201231_0000
No ratings yet
Module 5.pptx_20250608_201231_0000
43 pages
Internship Report
No ratings yet
Internship Report
7 pages
CP4252 Machine Learning Lab Manual
No ratings yet
CP4252 Machine Learning Lab Manual
26 pages
PYTHON PROGRAMMING FOR MACHINE LEARNING-220901004_compressed (1)
No ratings yet
PYTHON PROGRAMMING FOR MACHINE LEARNING-220901004_compressed (1)
6 pages
ml_pipeline
No ratings yet
ml_pipeline
6 pages
Rahul Phase 4...
No ratings yet
Rahul Phase 4...
13 pages
How to Evaluate Machine Learning Models - Yulinda Rizky
No ratings yet
How to Evaluate Machine Learning Models - Yulinda Rizky
15 pages
A3 Classification and Feature Engineering
No ratings yet
A3 Classification and Feature Engineering
2 pages
? Task
No ratings yet
? Task
23 pages
Ass Report
No ratings yet
Ass Report
6 pages
Topic 4
No ratings yet
Topic 4
6 pages
Presentation 12 (6)
No ratings yet
Presentation 12 (6)
11 pages
Capstone Project Guidelines
No ratings yet
Capstone Project Guidelines
2 pages
Credit Risk Project
No ratings yet
Credit Risk Project
11 pages
Asiign2 Aaryan Ai
No ratings yet
Asiign2 Aaryan Ai
11 pages
DM Assignment 2
No ratings yet
DM Assignment 2
2 pages
FAI Lecture - 23-10-2023 PDF
No ratings yet
FAI Lecture - 23-10-2023 PDF
12 pages
Machine Learning Project Checklist
No ratings yet
Machine Learning Project Checklist
30 pages
6- Review & final adjustment of a diabetic prediction model
No ratings yet
6- Review & final adjustment of a diabetic prediction model
2 pages
Paper Presentation - IDS
No ratings yet
Paper Presentation - IDS
2 pages
Machine learning lab manual
No ratings yet
Machine learning lab manual
22 pages
S-9
No ratings yet
S-9
18 pages
Data Mining - Lab 2
No ratings yet
Data Mining - Lab 2
5 pages
DS Assignment (1)
No ratings yet
DS Assignment (1)
2 pages
ML QB Ans
No ratings yet
ML QB Ans
141 pages
ADS 5
No ratings yet
ADS 5
5 pages
dsbda_5
No ratings yet
dsbda_5
4 pages
Figure 8.3: Criminal Psychology
No ratings yet
Figure 8.3: Criminal Psychology
2 pages
MIL-L6 (Media and Information Languages)
100% (1)
MIL-L6 (Media and Information Languages)
17 pages
Situation Awareness
No ratings yet
Situation Awareness
40 pages
1.5 Escaping The Madness TOK Journal Entry Exemplars - 2
No ratings yet
1.5 Escaping The Madness TOK Journal Entry Exemplars - 2
2 pages
Text and Visual Dimensions of Information and Media: Makato Integrated School
No ratings yet
Text and Visual Dimensions of Information and Media: Makato Integrated School
2 pages
Nlc-Learning Camp Volunteer-Daily and Weekly Class Program
100% (1)
Nlc-Learning Camp Volunteer-Daily and Weekly Class Program
12 pages
Human Nature Pacing Calendar
No ratings yet
Human Nature Pacing Calendar
5 pages
Definition of Discussion Text
No ratings yet
Definition of Discussion Text
8 pages
Cosco Moore Islam Behavior Mapping
No ratings yet
Cosco Moore Islam Behavior Mapping
7 pages
PKBK3073 English Language Teaching Methodology For Pupils With Learning Difficulties
No ratings yet
PKBK3073 English Language Teaching Methodology For Pupils With Learning Difficulties
26 pages
Neuro Linguistic Programming and Application in Treatment of Phobias 2010 Complementary Therapies in Clinical Practice
No ratings yet
Neuro Linguistic Programming and Application in Treatment of Phobias 2010 Complementary Therapies in Clinical Practice
5 pages
EAPP - LAS (Week 4)
100% (1)
EAPP - LAS (Week 4)
4 pages
Critical Journal Review: English Teaching Material Development
No ratings yet
Critical Journal Review: English Teaching Material Development
9 pages
EPS 541 - Formative Assessment Project
No ratings yet
EPS 541 - Formative Assessment Project
6 pages
C1 W2
No ratings yet
C1 W2
18 pages
Neuropsychological Interventions For Children Volume 1 From Earlypreventive Stimulation To Rehabilitation Caroline De Oliveira Cardoso pdf download
No ratings yet
Neuropsychological Interventions For Children Volume 1 From Earlypreventive Stimulation To Rehabilitation Caroline De Oliveira Cardoso pdf download
66 pages
CAIIB Success Class 7 ABM Module B Part 1 PDF by AB
No ratings yet
CAIIB Success Class 7 ABM Module B Part 1 PDF by AB
36 pages
Ecologia Del Desarrollo Humano y Aprendizaje
No ratings yet
Ecologia Del Desarrollo Humano y Aprendizaje
2 pages
Module 3 Linguistic and Literary Development of Children and Adolescent
No ratings yet
Module 3 Linguistic and Literary Development of Children and Adolescent
8 pages
Stud Ieg Ids
No ratings yet
Stud Ieg Ids
20 pages
Turing 6-1
No ratings yet
Turing 6-1
12 pages
Creating A Cohesive
No ratings yet
Creating A Cohesive
11 pages
Assignment: May 2022 Semester
No ratings yet
Assignment: May 2022 Semester
19 pages
Erik Erikson's Stages of Psychosocial Development: Stage Developmental Tasks
No ratings yet
Erik Erikson's Stages of Psychosocial Development: Stage Developmental Tasks
43 pages
Week-5 - NEW KINDER-DLL From NEW TG 2017
No ratings yet
Week-5 - NEW KINDER-DLL From NEW TG 2017
8 pages
Epistemo Global - Vikas Leadership School
No ratings yet
Epistemo Global - Vikas Leadership School
7 pages
Role Playing As A Technique For Teaching Speaking
No ratings yet
Role Playing As A Technique For Teaching Speaking
8 pages
Machine Learning Enhanced Voice Interation Revolutionizing Windows
No ratings yet
Machine Learning Enhanced Voice Interation Revolutionizing Windows
6 pages
TBO VII-chomsky Normal Form
No ratings yet
TBO VII-chomsky Normal Form
38 pages

Assignment

Uploaded by

Assignment

Uploaded by

Project Title

ReadyAssist: Classifying Cybersecurity Incidents with

Domain Cybersecurity and Machine Learning

Business Use Cases:

The solution developed in this project can be implemented in various business

●​ Security Operation Centers (SOCs): Automating the triage process by

1.​ Data Exploration and Understanding:

●​ A machine learning model capable of accurately predicting the triage grade of

Project Evaluation metrics:

Data Set Overview:

The primary objective of the dataset is to accurately predict incident triage

Data Set Explanation:

Dataset Link : Dataset

You might also like

● Security Operation Centers (SOCs): Automating the triage process by

1. Data Exploration and Understanding:

● A machine learning model capable of accurately predicting the triage grade of