Assignment
Assignment
Problem Statement:
Imagine you are working as a data scientist at Ready Assist, tasked with enhancing the
efficiency of Security Operation Centers (SOCs) by developing a machine learning
model that can accurately predict the triage grade of cybersecurity incidents. Utilizing
the comprehensive GUIDE dataset, your goal is to create a classification model that
categorizes incidents as true positive (TP), benign positive (BP), or false positive (FP)
based on historical evidence and customer responses. The model should be robust
enough to support guided response systems in providing SOC analysts with precise,
context-rich recommendations, ultimately improving the overall security posture of
enterprise environments.
You need to train the model using the train.csv dataset and provide evaluation
metrics—macro-F1 score, precision, and recall—based on the model's performance on
the test.csv dataset. This ensures that the model is not only well-trained but also
generalizes effectively to unseen data, making it reliable for real-world applications.
Approach:
Results:
By the end of the project, Candidate should aim to achieve the following outcomes:
The success and effectiveness of the project will be evaluated based on the following
metrics:
● Macro-F1 Score: A balanced metric that accounts for the performance across all
classes (TP, BP, FP), ensuring that each class is treated equally.
● Precision: Measures the accuracy of the positive predictions made by the
model, which is crucial for minimizing false positives.
● Recall: Measures the model's ability to correctly identify all relevant instances
(true positives), which is important for ensuring that real threats are not missed.
We provide three hierarchies of data: (1) evidence, (2) alert, and (3) incident. At
the bottom level, evidence supports an alert. For example, an alert may be
associated with multiple pieces of evidence such as an IP address, email, and
user details, each containing specific supporting metadata. Above that, we have
alerts that consolidate multiple pieces of evidence to signify a potential security
incident. These alerts provide a broader context by aggregating related
evidences to present a more comprehensive picture of the potential threat. At the
highest level, incidents encompass one or more alerts, representing a cohesive
narrative of a security breach or threat scenario.
The GUIDE dataset contains records of cybersecurity incidents along with their
corresponding triage grades (TP, BP, FP) based on historical evidence and customer
responses. Preprocessing steps may include:
● Handling Missing Data: Identifying and addressing any missing values in the
dataset.
● Feature Engineering: Creating new features or modifying existing ones to
improve model performance, such as combining related features or encoding
categorical variables.
● Normalization/Standardization: Scaling numerical features to ensure that all
input data is on a similar scale, which can be important for certain machine
learning models.
Project Deliverables:
● Source Code: Well-documented code that includes all steps from data
preprocessing to model evaluation.
● Model File: The trained machine learning model ready for deployment.