Automated Emerging Cyber Threat Identification and Profiling Based On Natural Language Processing
Automated Emerging Cyber Threat Identification and Profiling Based On Natural Language Processing
The time window between the disclosure of a new cyber vulnerability and its use by
cybercriminals has been getting smaller and smaller over time. Recent episodes, such
as Log4j vulnerability, exemplifies this well. Within hours after the exploit being
released, attackers started scanning the internet looking for vulnerable hosts to
deploy threats like cryptocurrency miners and ransomware on vulnerable systems.
Thus, it becomes imperative for the cybersecurity defense strategy to detect threats
and their capabilities as early as possible to maximize the success of prevention
actions. Although crucial, discovering new threat s is a challenging activity for
security analysts due to the immense volume of data and information sources to be
analyzed for signs that a threat is emerging. In this sense, we present a framework for
automatic identification and profiling of emerging threats using Twitter messages as
a source of events and MITRE ATT&CK as a source of knowledge for threat
characterization. The framework comprises three main parts: identification of cyber
threats and their names; profiling the identified threat in terms of its intentions or
goals by employing two machine learning layers to filter and classify tweets; and
alarm generation based on the threat’s risk. The main contribution of our work is the
approach to characterize or profile the identified threats in terms of their intentions or
goals, providing additional context on the threat and avenues for mitigation. In our
experiments, the profiling stage reached an F1 score of 77% in correctly profiling
discovered threats.
EXISTING SYSTEM
OSINT is the collection, analysis, and use of data from openly available sources for
intelligence purposes [21]. Examples of sources for OSINT are public blogs, dark
and deep websites, forums, and social media. In such platforms, any person or entity
on the Internet can publish, in real-time, information in natural language related to
cyber security, including incidents, new threats, and vulnerabilities. Among the
OSINT sources for cyber threat intelligence, we can highlight the social media
Twitter as one of the most representative [22]. Cyber security experts, system
administrators, and hackers constantly use Twitter to discuss technical details about
cyber attacks and share their experiences [4].
Utilization of OSINT to automatically identify cyber threats via social media, forums
and other openly available sources using text analytics was proposed in different
researches [1], [23], [7], [24], [25], [26], [13], [27] and [28]. However, most
proposals focus on identifying important events related to cyber threats or
vulnerabilities but do not propose identifying and profiling cyber threats.
Amongst research, [13] proposes an early cyber threat warning system that mines
online chatter from cyber actors on social media, security blogs, and dark web
forums to identify words that signal potential cyber-attacks. The framework is
comprised by two main components: text mining and warning generation. The text
mining phase consists on pre-processing the input data to identify potential threat
names by discarding ‘‘known’’ terms and selecting repeating ‘‘unknown’’ among
different sources as they potentially can be the name of a new or discovered cyber
threat. The second component, warning generation, is responsible for issuing alarms
for unknown terms that meet some requirements, like appearing twice in a given
period of time. The approach presented in this research uses keyword filtering as the
only strategy to identify cyber threat names, which may result in false positives as
unknown words may appear in tweets or other content not necessarily related to
cyber security. Additionally, this research does not profile the identified cyber threat.
First, the proposed approach does not name the identified threat. Naming the threat is
an important step to cyber threat intelligence
as it may allow analysts to identify and mitigate campaigns based on the historic
modus operandi employed by a given threat or group.
In [1], a framework for automatically gathering cyber threat intelligence from Twitter
is presented. The framework utilizes a novelty detection model to classify the tweets
as relevant or irrelevant to Cyber threat intelligence. The novelty classifier learns the
features of cyber threat intelligence from the threat descriptions in the Common
Vulnerabilities and Exposures (CVE) database 5 and classifies a new unseen tweet as
normal or abnormal in relation to Cyber threat intelligence. The normal tweets are
considered as Cyber threat relevant while the abnormal tweets are considered as
Cyber threat-irrelevant. The paper evaluates the framework on a data set created
from the tweets collected over a period of twelve months in 2018 from 50 influential
Cyber security-related accounts. During the evaluation, the framework achieved the
highest performance of 0.643 measured by the F1-score metric for classifying Cyber
threat tweets. According to the authors, the proposed approach outperformed several
baselines including binary classification models. Also, was analyzed the correctly
classified cyber threat tweets and discovered that 81 of them do not contain a CVE
identifier. The authors have also found that 34 out of the 81 tweets can be associated
with a CVE identifier included in the top 10 most similar CVE descriptions of each
tweet. Despite presenting a proposal to distinguish between relevant and irrelevant
tweets, the proposal does not address the identification of threats and their intentions.
Those are important requirements for Cyber Threat Intelligence in formulating
defense strategies against emerging threats.
The tool proposed in [23] collects tweets from a selected subset of accounts using the
Twitter streaming API, and then, by using keyword-based filtering, it discards tweets
unrelated to the monitored infrastructure assets. To classify and extract information
from tweets the paper uses a sequence of two deep neural networks. The first is a
binary classifier based on a Convolutional Neural Network (CNN) architecture used
for Natural Language Processing (NLP) [29]. It receives tweets that may be
referencing an asset from the monitored infrastructure and labels them as either
relevant when the tweets contain security-related information, or irrelevant
otherwise.
Proposed System
The overall goal of this work is to propose an approach to automatically identify and
profile emerging cyber threats based on OSINT (Open Source Intelligence) in order
to generate timely alerts to cyber security engineers. To achieve this goal, we
propose a solution whose macro steps are listed below.
SYSTEM REQUIREMENTS
SOFTWARE REQUIREMENTS:
Front-End : Python.
Back-End : Django-ORM