Automated Emerging Cyber Threat Identification and Profiling Based On Natural Language Processing

The document discusses existing systems for identifying cyber threats from social media data. It notes limitations in existing approaches, including not naming identified threats, relying on external classifiers, and not profiling threats. The proposed system aims to address these limitations by automatically identifying and profiling emerging threats using Twitter data and the MITRE ATT&CK framework.

Uploaded by

Venkatesh Rachamalla

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

70 views

Automated Emerging Cyber Threat Identification and Profiling Based On Natural Language Processing

Uploaded by

Venkatesh Rachamalla

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 10

Automated Emerging Cyber Threat Identification and

Profiling Based on Natural Language Processing

ABSTRACT

The time window between the disclosure of a new cyber vulnerability and its use by
cybercriminals has been getting smaller and smaller over time. Recent episodes, such
as Log4j vulnerability, exemplifies this well. Within hours after the exploit being
released, attackers started scanning the internet looking for vulnerable hosts to
deploy threats like cryptocurrency miners and ransomware on vulnerable systems.
Thus, it becomes imperative for the cybersecurity defense strategy to detect threats
and their capabilities as early as possible to maximize the success of prevention
actions. Although crucial, discovering new threat s is a challenging activity for
security analysts due to the immense volume of data and information sources to be
analyzed for signs that a threat is emerging. In this sense, we present a framework for
automatic identification and profiling of emerging threats using Twitter messages as
a source of events and MITRE ATT&CK as a source of knowledge for threat
characterization. The framework comprises three main parts: identification of cyber
threats and their names; profiling the identified threat in terms of its intentions or
goals by employing two machine learning layers to filter and classify tweets; and
alarm generation based on the threat’s risk. The main contribution of our work is the
approach to characterize or profile the identified threats in terms of their intentions or
goals, providing additional context on the threat and avenues for mitigation. In our
experiments, the profiling stage reached an F1 score of 77% in correctly profiling
discovered threats.
EXISTING SYSTEM

Cybersecurity is becoming an ever-increasing concern for most organizations and

much research has been developed in this field over the last few years. Inside these
organizations, the Security Operations Center (SOC) is the central nervous system
that provides the necessary security against cyber threats. However, to be effective,
the SOC requires timely and relevant threat intelligence to accurately and properly
monitor, maintain, and secure an IT infrastructure. This leads security analysts to
strive for threat awareness by collecting and reading various information feeds.
However, if done manually, this results in a tedious and extensive task that may
result in little knowledge being obtained given the large amounts of irrelevant
information. Research has shown that Open Source Intelligence (OSINT) provides
useful information to identify emerging cyber threats.

OSINT is the collection, analysis, and use of data from openly available sources for
intelligence purposes [21]. Examples of sources for OSINT are public blogs, dark
and deep websites, forums, and social media. In such platforms, any person or entity
on the Internet can publish, in real-time, information in natural language related to
cyber security, including incidents, new threats, and vulnerabilities. Among the
OSINT sources for cyber threat intelligence, we can highlight the social media
Twitter as one of the most representative [22]. Cyber security experts, system
administrators, and hackers constantly use Twitter to discuss technical details about
cyber attacks and share their experiences [4].

Utilization of OSINT to automatically identify cyber threats via social media, forums
and other openly available sources using text analytics was proposed in different
researches [1], [23], [7], [24], [25], [26], [13], [27] and [28]. However, most
proposals focus on identifying important events related to cyber threats or
vulnerabilities but do not propose identifying and profiling cyber threats.
Amongst research, [13] proposes an early cyber threat warning system that mines
online chatter from cyber actors on social media, security blogs, and dark web
forums to identify words that signal potential cyber-attacks. The framework is
comprised by two main components: text mining and warning generation. The text
mining phase consists on pre-processing the input data to identify potential threat
names by discarding ‘‘known’’ terms and selecting repeating ‘‘unknown’’ among
different sources as they potentially can be the name of a new or discovered cyber
threat. The second component, warning generation, is responsible for issuing alarms
for unknown terms that meet some requirements, like appearing twice in a given
period of time. The approach presented in this research uses keyword filtering as the
only strategy to identify cyber threat names, which may result in false positives as
unknown words may appear in tweets or other content not necessarily related to
cyber security. Additionally, this research does not profile the identified cyber threat.

In [26] an identification and classification approach of cyber threat indicators in the

Twitter stream is presented. The research proposes a data-driven approach for
modeling and classification of tweets using a cascaded Convolutional Neural
Network (CNN) architecture to both classify tweets as related or not to cyber security
and classify the cyber-related tweets into a fixed listed of cyber threats. The proposed
solution includes a pre-processing phase that uses IBM’s Watson Natural Language
API to identify tweets related to cyber security according to Watson classification
results. Additionally, in the pre-processing phase, there is a pre-labeling step
performed by simple string matching on the pure tweet text. The threat types
considered were: ‘‘vulnerability’’, ‘‘DDoS’’, ‘‘ransomware’’, ‘‘botnet’’, ‘‘data
leak’’, ‘‘zero-day’’ and ‘‘general’’. Further, the proposed approach uses CNN
models trained to classify each tweet as relevant or irrelevant to cyber security. The
relevant tweets are passed to a second CNN layer to be classified as one of the 8
different threat types mentioned above. There are important differences of our
proposal compared to this one.

First, the proposed approach does not name the identified threat. Naming the threat is
an important step to cyber threat intelligence
as it may allow analysts to identify and mitigate campaigns based on the historic
modus operandi employed by a given threat or group.

Second, the proposed approach relies on an external component to classify tweets as

related or not to cyber security as opposed to our approach that proposes a
component to classify tweets using machine learning trained with the evolving
knowledge from MITRE ATT&CK. Third, instead of using a keyword match to pre-
filter threats and a fixed list of threat types, we present an approach to profile the
identified cyber threat to spot in which phase of phases of the cyber kill chain the
given threat operates in. This is important for a cyber threat analyst as he or she may
employ the necessary mitigation steps depending on the threat profile.

In [1], a framework for automatically gathering cyber threat intelligence from Twitter
is presented. The framework utilizes a novelty detection model to classify the tweets
as relevant or irrelevant to Cyber threat intelligence. The novelty classifier learns the
features of cyber threat intelligence from the threat descriptions in the Common
Vulnerabilities and Exposures (CVE) database 5 and classifies a new unseen tweet as
normal or abnormal in relation to Cyber threat intelligence. The normal tweets are
considered as Cyber threat relevant while the abnormal tweets are considered as
Cyber threat-irrelevant. The paper evaluates the framework on a data set created
from the tweets collected over a period of twelve months in 2018 from 50 influential
Cyber security-related accounts. During the evaluation, the framework achieved the
highest performance of 0.643 measured by the F1-score metric for classifying Cyber
threat tweets. According to the authors, the proposed approach outperformed several
baselines including binary classification models. Also, was analyzed the correctly
classified cyber threat tweets and discovered that 81 of them do not contain a CVE
identifier. The authors have also found that 34 out of the 81 tweets can be associated
with a CVE identifier included in the top 10 most similar CVE descriptions of each
tweet. Despite presenting a proposal to distinguish between relevant and irrelevant
tweets, the proposal does not address the identification of threats and their intentions.
Those are important requirements for Cyber Threat Intelligence in formulating
defense strategies against emerging threats.

The tool proposed in [23] collects tweets from a selected subset of accounts using the
Twitter streaming API, and then, by using keyword-based filtering, it discards tweets
unrelated to the monitored infrastructure assets. To classify and extract information
from tweets the paper uses a sequence of two deep neural networks. The first is a
binary classifier based on a Convolutional Neural Network (CNN) architecture used
for Natural Language Processing (NLP) [29]. It receives tweets that may be
referencing an asset from the monitored infrastructure and labels them as either
relevant when the tweets contain security-related information, or irrelevant
otherwise.

Relevant tweets are processed for information extraction by a Named Entity

Recognition (NER) model, implemented as a Bidirectional Long Short-Term
Memory (BiLSTM) neural network [30]. This network labels each word in a tweet
with one of six entities used to locate relevant information. Furthermore, the authors
chose to use the application of deep learning techniques because of its advantages in
the NLP domain [31]. Thus, they propose an end-to-end threat intelligence tool that
relies on neural networks with no feature engineering.
Disadvantages
 An existing system never implemented Multi-Class machine learning (ML)
algorithms - the next steps in the pipeline.
 An existing system didn’t implement the following method PROCESS
IDENTIFIED AND CLASSIFIED THREATS.

Proposed System

The overall goal of this work is to propose an approach to automatically identify and
profile emerging cyber threats based on OSINT (Open Source Intelligence) in order
to generate timely alerts to cyber security engineers. To achieve this goal, we
propose a solution whose macro steps are listed below.

1) Continuously monitoring and collecting posts from prominent people and

companies on Twitter to mine unknown terms related to cyber threats and malicious
campaigns;
2) Using Natural Language Processing (NLP) and Machine Learning (ML) to
identify those terms most likely to be threat names and discard those least likely;
3) Leveraging MITRE ATT&CK techniques’ procedures examples to identify most
likely tactic employed by the discovered threat;
4) Generating timely alerts for new or developing threats along with its
characterization or goals associated with a risk rate based on how fast the threat is
evolving since its identification.
Advantages

To conduct a cyber-attack, malicious actors typically have to

1) Identify vulnerabilities,
2) acquire the necessary tools and tradecraft to successfully exploit them,
3) choose a target and recruit participants,
4) Create or purchase the infrastructure needed, and
5) Plan and execute the attack. Other actors— system administrators, security
analysts, and even victims— may discuss vulnerabilities or coordinate a response to
attacks

SYSTEM REQUIREMENTS