0% found this document useful (0 votes)
59 views8 pages

Cyber Threat Detection From Twitter

Uploaded by

yafet samuel
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
59 views8 pages

Cyber Threat Detection From Twitter

Uploaded by

yafet samuel
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 8

c IEEE 2019

Accepted for the 2019 International Joint Conference on Neural Networks

Cyberthreat Detection from Twitter using


Deep Neural Networks
Nuno Dionísio, Fernando Alves, Pedro M. Ferreira and Alysson Bessani
LASIGE, Faculdade de Ciências, Universidade de Lisboa
Lisboa 1749-016, Portugal
Email: {ndionisio, falves}@lasige.di.fc.ul.pt, {pmf, anbessani}@ciencias.ulisboa.pt

Abstract—To be prepared against cyberattacks, most organi- as a natural aggregator of multiple sources [5]. This social
zations resort to security information and event management media platform offers a large and diverse pool of users, high
systems to monitor their infrastructures. These systems depend
arXiv:1904.01127v1 [cs.LG] 1 Apr 2019

accessibility, timeliness, thus producing a large volume of


on the timeliness and relevance of the latest updates, patches
and threats provided by cyberthreat intelligence feeds. Open data. These properties remain true regarding the cybersecurity
source intelligence platforms, namely social media networks field [4]. From security researchers, companies and enthusiasts
such as Twitter, are capable of aggregating a vast amount to hacker groups, there is a rich and timely flow of security-
of cybersecurity-related sources. To process such information related data.
streams, we require scalable and efficient tools capable of identi- The tool proposed in this paper collects tweets from a
fying and summarizing relevant information for specified assets.
This paper presents the processing pipeline of a novel tool that selected subset of accounts using the Twitter streaming API,
uses deep neural networks to process cybersecurity information and then, by using keyword-based filtering, it discards tweets
received from Twitter. A convolutional neural network identifies unrelated to the monitored infrastructure assets. To classify
tweets containing security-related information relevant to assets and extract information from tweets we use a sequence of two
in an IT infrastructure. Then, a bidirectional long short-term deep neural networks. The first is a binary classifier based
memory network extracts named entities from these tweets to
form a security alert or to fill an indicator of compromise. The on a Convolutional Neural Network (CNN) architecture used
proposed pipeline achieves an average 94% true positive rate and for Natural Language Processing (NLP) [6]. It receives tweets
91% true negative rate for the classification task and an average that may be referencing an asset from the monitored infras-
F1-score of 92% for the named entity recognition task, across tructure and labels them as either relevant when the tweets
three case study infrastructures. contain security-related information, or irrelevant otherwise.
Index Terms—deep neural networks, threat detection, indica-
tors of compromise, Twitter, OSINT Relevant tweets are processed for information extraction by
a Named Entity Recognition (NER) model, implemented as
a Bidirectional Long Short-Term Memory (BiLSTM) neural
I. I NTRODUCTION
network [7]. This network labels each word in a tweet with
Cybersecurity is becoming an ever-increasing concern for one of six entities used to locate relevant information.
most organizations, with studies estimating the cost of cy- We are seeking a complete end-to-end architecture with
bercrime to be up to 0.8 percent of the global GDP [1]. no requirement for feature engineering and extra components
A Security Operations Center (SOC) requires timely and in the processing pipeline. Furthermore, we have chosen to
relevant threat intelligence to accurately and properly monitor, pursue the application of deep learning techniques because
maintain, and secure an IT infrastructure. This leads security of its advantages in the NLP domain [8]. Thus, we propose
analysts to strive for threat awareness by collecting and reading an end-to-end threat intelligence tool which relies on neural
various information feeds. However, if done manually, this networks with no feature engineering. The pipeline is capable
results in a tedious and extensive task that may result in of receiving tweets relevant to an infrastructure, selecting those
little knowledge being obtained given the large amounts of which appear to contain relevant information regarding an
irrelevant information an analyst may have to go through. asset’s security and extracting valuable entities which can be
Although there is the option of subscribing a paid service to used to issue a security alert or to improve an IoC.
receive curated feeds, research has shown that Open Source To evaluate our approach we collected tweets correspond-
Intelligence (OSINT) provides useful information to create ing to three IT infrastructures defined by three private or-
Indicators of Compromise (IoC) [2]–[4]. ganizations: a nation-wide power utility, a large company’s
In this paper, we present a threat intelligence tool that cybersecurity consultancy department, and a worldwide travel
employs deep neural network architectures to process a data services provider. We establish a methodology through which,
stream, identifying relevant security-related information and according to a defined evaluation metric, we compare several
extracting relevant entities, thus synthesizing valuable knowl- variations to the deep learning architectures to select a model
edge that can be used by a SOC. which provides the best performance. Furthermore, we com-
Although other text-based sources could leverage our ap- pare the proposed models to other well-known classifiers and
proach, we opted to focus on Twitter due to its ability to act provide a detailed analysis of the results obtained.
The evaluation shows that our approach is capable of
finding, on average, more than 92% of the relevant tweets,
and matching cybersecurity-relevant labels to named entities
within these tweets with an average F1-score above 90%.
Based on the best models obtained in our experiments, we
retrieved tweets where the NER models were capable of
Fig. 1. Twitter threat detection pipeline.
extracting relevant entities and performed a brief analysis
which demonstrates the timeliness of Twitter as a valuable
source for relevant cyberthreat awareness. provide information in a more structured and formal manner
II. R ELATED W ORK when compared to Twitter. The authors take advantage of some
properties of these articles, such as a set of context terms
In this section, we review previous work that used Twitter or and their grammatical relations. Unlike iACE, our approach
other OSINT sources for cybersecurity, and employed machine aims to explore the use of deep learning techniques to extract
learning for cyberthreat detection. Additionally, we review grammatical relations automatically, removing the necessity
previous research on deep learning techniques to process for manual feature engineering.
tweets and perform NER or other NLP tasks. Concurrently with this work, Zhou et al. [9] employed
A. Machine learning and OSINT for cyberthreat detection a NER architecture for extracting IoC from cybersecurity
reports. Differently from our architecture, their proposal in-
Previous work has explored the possibility of using Twitter tegrates manually engineered features and an attention mech-
as an information source for cyberthreat awareness. Le Sceller anism with the BiLSTM.
et al. [2] proposed SONAR, an automatic keyword-centric self-
learned framework that can detect, geolocate and categorize
B. Deep learning for tweet analysis
cybersecurity events in near real-time, based on a Twitter
stream. The authors describe the framework in three phases. Using a similar architecture to the one we use, Severyn et
The first is similar to ours, as the framework queries Twitter al. [10] perform sentiment analysis on Twitter data taken from
with a list of keywords to retrieve a continuous stream of the SemEval 2015 Twitter Sentiment Analysis challenge. In
tweets. However, instead of querying Twitter as a whole, comparison to other systems submitted to the challenge, their
we focus on a pre-defined set of accounts that are more solution ranked in the two first positions in both tasks.
likely to tweet security-related content, thus avoiding possible Badjatiya et al. [11] experimented with several classifiers
performance decrease due to noisy content. Moreover, the including, but not limited to, deep learning methods to detect
keywords we use are not cyberthreat related (e.g., attacks hate speech in tweets. The authors report that deep learning
and vulnerabilities). Instead, we query for a list of keywords techniques perform significantly better than other methods.
that define the assets in the monitored infrastructure. The Regarding applications of deep learning for NER tasks
second phase of SONAR is about event detection. A clustering using Twitter data, Jimeno-Yepes et al. [12] implemented a
technique is used to aggregate similar tweets that may report sequence-to-sequence LSTM architecture for the annotation
the same cybersecurity event. Given that the system relies on of medical entities to support public health surveillance. The
keywords being up-to-date, the third phase aims to find new architecture presented outperforms the previous state-of-the-
keywords based on their co-occurrence with other previously art. Regarding future work, the authors plan on using the
defined tweets. These two phases differ from ours in scope, architecture proposed by Lample et al. [7], which is the one
since our aim is not to build a general information security in which we have based our NER approach.
news feed but rather a tool to gather the most recent informa-
tion related to the security of a pre-defined list of assets. III. T OOL A RCHITECTURE
Sabottke et al. [4] proposed a Twitter-based exploit detector
using an SVM classifier. The detector is capable of extracting Our tool follows the high-level three-stage pipeline archi-
vulnerability-related information from Twitter, augment it with tecture depicted in Figure 1 to process data from Twitter
additional sources and predict if the vulnerability is exploitable and extract relevant cybersecurity information. The first stage
in a real-world scenario. An interesting feature from this work collects tweets through the Twitter API, filters them based on
is the consideration of adversarial interference to deceive the a set of keywords and normalizes tweets to a specific format.
classifier. These adversaries vary in their knowledge of the In the second stage, a binary classifier labels tweets as either
classifier and in the complexity of their poison-attacks that relevant, meaning the tweets are likely to contain valuable
attempt to deceive the classifier. Currently, our system uses information about an asset, or irrelevant otherwise. Finally, in
sets of pre-defined user accounts based on the likelihood that the information extraction stage, relevant tweets are processed
they tweet about the security of the protected IT infrastructure. by an NER network. The information extracted can be used
Regarding IoC extraction from OSINT, Liao et al. [3] to issue a security alert or to enrich an existing IoC in a
developed iACE, an automated tool capable of IoC extraction. threat intelligence platform such as MISP [13]. The following
It extracts information from technical articles, which tend to sections describe the three stages.
A. Data collection
The data collection stage comprises the execution of three
tasks: data collection, keyword-based pre-filtering, and text
pre-processing.
1) Collector: The collector requires a set of Twitter ac-
counts from which it will query the Twitter’s streaming API.
These accounts can be from companies, security vendors,
hacker groups, security analysts, and researchers. Ideally,
they are chosen based on their likelihood of tweeting about
components present in the monitored IT infrastructure. This
approach of retrieving tweets from a specific set of accounts
can drastically reduce the amount of irrelevant information that
is gathered by the collector.
2) Filter: By assuming that a tweet about an infrastructure
asset has to mention its properties and components, the filter
module employs a set of user-defined keywords describing Fig. 2. Convolutional Neural Network architecture for sentence classification.
the infrastructure being monitored to drop irrelevant tweets,
thus further decreasing the amount of information that flows
through the pipeline. For instance, if an analyst desires to As the filter slides down the matrix, the resulting feature map
be informed about potential threats to a web service hosted will have a length of n − h + 1, where n is the number of
on a cloud platform, the set of keywords should include words in the sentence.
operating systems, the cloud platform being used and all other A single feature ci , produced by filter W , is computed as,
components supporting the asset in question. ci = σ(W Si:i+h−1 + b) , (1)
3) Pre-processing: At the end of the data collection stage,
a pre-processing module transforms tweets that passed the where σ is a non-linear function, denotes the Hadamard
filter, making their representation uniform before subsequent product, b is a bias term, and Si:i+h−1 denotes a sub-matrix
processing by the classifier and NER modules. Characters are from S by taking rows i to i + h − 1. Given that this operation
converted to lower-case, and hyperlinks and special characters computes one feature for each stride of the filter over the
are removed, except for ‘.’, ‘-’, ‘_’, and ‘:’, as they are often embedded matrix, the convolutional layer outputs a set of k×f
used in IDs, version numbers, or component names. feature map vectors, each containing n − h + 1 features.
4) Max-over-time-pooling layer: Intuitively, if one feature
B. Classification map results from sliding a filter over the embedding matrix,
The classification stage aims to detect tweets that contain the feature values composing that map denote how strong the
security-related information so that only these will proceed feature is within a specific input window. We reduce such
to the Named Entity Recognition (NER) stage. To perform feature map into a single value providing information about
this task, we implemented a binary classifier using a CNN [6] the presence or absence of a feature, through the max-pooling
whose architecture can be described by five layers: input, operation [16]. In the end, this layer helps to filter redundant
embedding, convolution, max-over-time-pooling, and output. information, thus preventing overfitting and decreasing the
Figure 2 shows this architecture. computational burden. As each feature map gets reduced to
1) Input layer: The CNN receives a sentence represented a single value, this layer outputs a k × f feature vector.
by a sequence of n integers, each representing a word token, 5) Output layer: Before using the selected features in the
just as shown in Figure 2. output layer of the network, dropout [17] is applied to the
2) Embedding layer: Each integer corresponds to a d- feature vector. This procedure sets a proportion of the feature
dimensional numeric vector representing the semantic value vector elements to zero, preventing gradients to propagate
of the corresponding word. These word vectors can be ran- through these elements and co-adapting the nodes. Dropout
domly initialized or extracted from previously trained language acts as a form of regularization [18], preventing overfitting and
models (e.g., GloVE [14] or Word2Vec [15]) providing a promoting generalization. Finally, feature nodes are used by a
starting point for the semantic value. In both cases, the learning fully-connected softmax layer which outputs the probability of
algorithm may further adjust the word vector. As a result of a tweet to contain relevant information. If relevant, the tweet
this layer, we have an n × d sentence embedding matrix S. moves to the next stage of the pipeline, otherwise it is dropped.
3) Convolution layer: With a set of k learnable kernels,
each containing f filters with height h and width d, the C. Named entity recognition
convolution operation will slide down these kernels over the In the NER phase, we aim to extract information from
embedded matrix S, producing k ×f feature maps. The height tweets that have been considered relevant by the classifier.
and width of the filters correspond to the number of words We have based our model on a BiLSTM neural network [7].
covered and to the dimension of the word vectors, respectively. Figure 3 depicts its architecture.
As a result, this layer generates an n×(2×hc ) state matrix,
for concatenation with the word-level embedding matrix.
4) Tweet-level BiLSTM layer: At this stage we have a
matrix of n words, each word represented by a numeric vector
with dimension dw + (2 × hc ).
Similar to the process described for the word-level BiLSTM,
we feed this tweet representation to another BiLSTM layer,
word by word. However, while in the previous layer we only
retrieve the final hidden state, in this layer we read it at every
timestep (every word representation read).
Thus, the output of this layer is an n × ht matrix, where ht
is the length of the BiLSTM’s cell state vector.
Fig. 3. Bidirectional Long Short-Term Memory architecture for named entity
5) Output layer: In the final layer of the NER model, we
recognition. have a fully-connected neural network and a Conditional Ran-
dom Fields (CRF) module [20]. The fully-connected neural
network contains m neurons, m being the number of possible
This network locates and labels valuable security-related labels that the model can predict. As an input to this layer,
entities such as monitored infrastructure assets, vulnerabilities, we have the n × ht matrix provided by the previous layer,
attacks, and vulnerability repository IDs mentioned in tweets. therefore creating an n × m output matrix. This matrix can
To the extent possible, we defined security-related entities be thought as a score matrix, where the activations of the m
according to descriptions from the ENISA risk management neurons produce a score for each label.
glossary [19]. In the following, we briefly describe the network Then, this scoring matrix is used to compute the final
layers: input, embedding, word-level BiLSTM, tweet-level sequence of labels. Instead of choosing the highest label score
BiLSTM, and output. for each word, thus modeling labeling decisions independently,
1) Input layer: In addition to the n integers representing the CRF module allows to model them jointly. Given a
the word tokens applied to the input layer of the classifier, this sequence of words and corresponding score vectors, and can-
network receives secondary sequences of nc integers, one for didate label sequences, a linear-chain CRF computes a single
each word at the input. Each secondary sequence represents score for each candidate sequence of labels. By selecting
the characters that form a word token. These sequences allow the highest single score, this layer outputs the corresponding
the network to be less constrained by the vocabulary used candidate sequence of n labels that match the input sequence
during training. of n words at the input layer.
2) Embedding layer: This layer’s functionality is similar IV. E XPERIMENTAL S ETUP
to the classifier’s regarding the sequence of n integers that
This section describes the experimental work conducted to
correspond to the words in tweets. These are converted to a
design both the classification and NER neural networks. Addi-
dw -dimensional numeric vector, thus providing a word-level
tionally, we describe the approach employed to obtain results
semantic representation of the tweet. This results in an n × dw
for other state-of-the-art or well-established methodologies, for
matrix representation where n is the number of words and dw
comparison purposes.
is the length of the embedded word vectors.
Additionally, the nc integers in each secondary sequence A. Datasets
are converted to dc -dimensional numeric vectors, providing a To build our system, we continuously collect tweets from
character-level representation of the tweet. Therefore, besides two sets of accounts, denoted S1 and S2. The tweets are related
the word-level embedded matrix, each word has a correspond- to IT infrastructures specified by three private organizations:
ing c × dc character-level matrix representation, where c is the a worldwide travel services provider, a global-company cy-
number of characters in a word and dc is the length of the bersecurity department, and a nation-wide power utility. Each
embedded character vectors. of the three infrastructures denoted A, B, and C, is defined
3) Word-level BiLSTM layer: The embedded character ma- by a set of assets, which in turn are specified by a variable
trix corresponding to each word is fed to a BiLSTM network, number of keywords. For a subset of tweets collected over
containing two cells that read the sequence of input character four-months, tweets were analyzed and labelled relevant (or
vectors in opposite directions. Both cells possess an hc - not) for the security of the designated assets.1
dimensional vector hidden state, that is updated at every Each infrastructure dataset was split and organized into three
time step (i.e., at every character read). After reading all the subset groups, as described in Table I, one for training the
characters in the embedded character matrix, the cell states are models (A1, B1, C1), another for validation (A2, B2, C2) and
extracted and concatenated. These vectors hold a character- the remaining one for testing (A3, B3, C3).
level representation from both left-to-right and right-to-left
readings. 1 https://github.com/ndionysus/twitter-cyberthreat-detection
TABLE I The two deep architecture models described in previous
DATASETS CORRESPONDING TO COMPANIES A, B AND C. sections were implemented using TensorFlow [26]. Training
Datasets Time Interval Accounts Positives Negatives Total
was conducted with the TensorFlow implementation of the
A1 1074 694 1768
Adam optimizer [27] using the default parameters. The
21/11/2016
B1 to 1201 638 1839 procedure was executed for 100 epochs with a batch size
C1 27/01/2017 S1 1293 1473 2766 of 256, using the validation dataset for early stopping. The
A2 27/01/2017 282 767 1049 SVM and MLP models were implemented using the Apache
B2 to 387 671 1058 Spark Machine Learning library [28]. Training was carried
C2 27/02/2017 325 592 917
out using the Stochastic Gradient Descent (SGD) [29] opti-
A3 27/02/2017 219 313 532
B3 to S1 + S2 250 247 497
mizer for the SVM model, and the Limited-memory Broy-
C3 27/03/2017 289 358 647 den–Fletcher–Goldfarb–Shanno (L-BFGS) [30] optimizer for
the MLP. In both cases, most parameters have been kept equal
to the package defaults, except those considered in a grid
TABLE II
NAMED ENTITIES TO BE EXTRACTED FROM A TWEET. search executed to select appropriate models. For the SVM,
100 iterations were employed, whereas 200 were used for the
Label Description MLP.
O Does not contain useful information. As a grid search was used to select appropriate hyper-
ORG Company or organization. parameter values and design variables settings for these mod-
PRO A product or asset.
els, in all cases a 10-fold cross-validation methodology was
VER A version number, possibly from the identified asset or product.
VUL May be referencing the existence of a threat or a vulnerability.
employed. The 10-fold results were averaged and used to
ID An identifier, either from a public repository such as the National select the best models and variants for comparison. Regarding
Vulnerability Database (NVD) [21], or from an update or patch. the classification task, besides using the CNN described, we
also adapted the BiLSTM architecture for classification. The
adaptation consists of removing the CRF layer, using only the
The validation subset is distinguished solely by the time last state of both LSTMs, and having only two neurons at the
interval, being collected one month after the training subset. output for binary classification.
More importantly, the testing subset is distinguished from the Having picked the best configuration of each model type
others not only by the time interval but also because it includes and variants for each infrastructure, we train the models with
tweets from an additional set of accounts. We have chosen this the full training set, using the validation sets for early stopping,
strategy so that the models’ performance and generalization and then test on the testing sets (A3, B3, and C3).
capability assessment considers both tweets in the future of Next, we describe the grid search procedure executed,
training data and a combination of that with previously unseen highlighting the hyper-parameter values and design variable
Twitter accounts. settings considered.
Table II displays the labels considered for the NER task.
C. Model optimization using grid search
Although we defined the entities considering descriptions from
the ENISA risk management glossary [19], we adopted broad 1) Embedding Vectors and BiLSTM hidden states: Re-
definitions, so that the labels become more distinguishable garding the initialization of the embedding vectors of deep
by the network. For example, the label VUL includes both neural architectures, we tested randomly initialized embed-
vulnerabilities and threats. The reason for this is twofold. On ding, GloVE pre-trained embedding [14], and Word2vec pre-
the one hand, we have limited data for training. Thus, more trained embedding [15]. For the random initialization case,
specific entity definitions could result in too few instances in we tested with vector dimensions of 100, 200, and 300. The
the datasets. On the other hand, as Twitter is an informal vector dimension applied in pre-trained embedding approaches
OSINT source, many users of interest do not follow any was 300. Regarding the NER BiLSTM approach, character
international standard when referring to security matters. Once embedding was done only using random initialization with
more data is available and labeled, the number of entities could vector dimension varied within {25, 50, 100}.
be extended, allowing for a more accurate taxonomy. For the NER model, the sizes of both state vectors varied
within the same alternatives as the randomly initialized em-
bedding vectors.
B. Training and evaluation methodology
2) CNN Convolutional layer: In the convolutional layer of
To compare the deep-learning neural network classifier the CNN, we varied the kernels in number, height, and number
methodology described in previous sections we used a linear of filters, int the following way:
approach, the Support Vector Machine (SVM) [22], and a non- • Number of kernels: varied between 3 and 6;
linear one, the Multi-Layer Perceptron (MLP) [23], [24]. • Kernel heights: varied incrementally according to the
The described deep-learning NER methodology was com- number of kernels, either in a sequential manner (e.g.
pared to the Stanford CRF NER approach [25], trained with 2,3,4) or by parity (e.g., 3,5,7, or 2,4,6);
the procedure and parameters suggested by the authors. • Number of filters: varied within {64, 128, 192, 256};
TABLE III
B INARY CLASSIFIER RESULTS ACROSS ALL INFRASTRUCTURES .
A1 A2 A3 B1 B2 B3 C1 C2 C3
TPR TNR TPR TNR TPR TNR TPR TNR TPR TNR TPR TNR TPR TNR TPR TNR TPR TNR
SVM 0.953 0.970 0.915 0.866 0.785 0.831 0.977 0.943 0.995 0.450 0.996 0.575 0.968 0.972 0.951 0.586 0.989 0.555
MLP 0.976 0.987 0.867 0.951 0.739 0.926 0.992 0.977 0.858 0.910 0.818 0.882 0.993 0.992 0.843 0.878 0.888 0.862
CNN + Random 0.988 0.995 0.916 0.904 0.810 0.886 0.998 0.992 0.918 0.891 0.925 0.890 0.989 0.995 0.873 0.871 0.954 0.901
CNN + GloVE 0.979 0.979 0.936 0.919 0.839 0.901 0.993 0.995 0.959 0.930 0.956 0.927 0.993 0.993 0.890 0.864 0.966 0.904
CNN + Word2Vec 0.998 0.999 0.943 0.932 0.849 0.927 0.994 0.979 0.949 0.911 0.947 0.915 0.995 0.999 0.917 0.875 0.958 0.911
BiLSTM + Random 0.999 0.971 0.915 0.906 0.913 0.904 0.978 0.978 0.920 0.949 0.912 0.964 0.990 0.932 0.865 0.872 0.938 0.911
BiLSTM + GloVE 0.986 0.979 0.897 0.940 0.863 0.946 0.973 0.983 0.951 0.918 0.936 0.943 0.985 0.975 0.874 0.870 0.955 0.911
BiLSTM + Word2Vec 0.995 0.995 0.901 0.897 0.890 0.911 0.977 0.985 0.907 0.951 0.912 0.951 0.997 0.989 0.880 0.872 0.934 0.919

3) Dropout regularization: Considering the CNN classifica- TABLE IV


tion model, the dropout rate varied within {0.3, 0.5, 0.6}. The A RCHITECTURES OF THE BEST PERFORMING MODELS FOR EACH DATASET.
NER BiLSTM models used two dropout layers, the first after Dataset Kernels Number of Filters Dropout Rate
the word-level BiLSTM and the second after the sentence-level A [3,5,7] 128 0.5
BiLSTM. We considered only two settings: no dropout or 0.5 B [3,5,7] 256 0.5
dropout probability, which resulted in four possible variants. C [3,5,7,9] 256 0.5
4) SVM and MLP models: The SVM and MLP classifiers
use the Term Frequency - Inverse Document Frequency
(TF-IDF) approach to obtain a numeric representation of tested both SVM and MLP models with concatenated pre-
tweets. The TF-IDF feature vector size was varied within the trained word-vector inputs. However, this approach offered
set: {30, 50, 80, 100, 200, 300, 500, 750, 1000, 1500, 3000}. poor results, therefore it was not further explored. Regarding
For the SVM, the regularization parameter C was varied generalization, we observe that the performance degradation
within {0.01, 0.02, 0.05, 0.1, 0.2, 0.5, 1, 2, 5}, while the SGD is small, leaving the results comfortably above 90% in most
step size was varied within {0.1, 0.5, 1, 1.5, 2, 5}. Regarding cases. Importantly, we conclude that there is no significant
the MLP, the number of layers varied from 2 to 8 and the performance degradation resulting from the addition of tweets
number of neurons within {5, 7, 10, 12, 14, 16, 18, 20}. from a second set of accounts in the testing sets.
Focusing on the CNN results, Table IV displays the best
V. R ESULTS configurations that resulted from the grid search for each
infrastructure. The results favour models with a small number
Regarding the tweet classification models, we used the True of kernels with oddly spaced heights ([3, 5, 7] means three
Positive Rate (TPR) and True Negative Rate (TNR) metrics filters with heights 3, 5, and 7), a large number of filters and
for evaluation, as these carry useful information for SOC 0.5 dropout probability.
analysts. For the NER task, the models were evaluated using
the precision, recall, and F1 metrics. B. Named entity recognition
The results presented shows the best model found in the Table V displays the best performing models, including the
grid search executed for each model type and variant. In the results obtained by the Stanford CRF NER approach.
classification task, we consider best the model whose (TPR, In comparison to the Stanford CRF, in general, the BiLSTM
TNR) is closest in the Euclidean sense to the (1, 1) ideal result. architectures achieved better results across all infrastructures.
For NER, we adopted the best F1-score to find the best model. The exception is the Precision metric achieved in datasets A2
and C2, where the Stanford CRF showed equal performance.
A. Tweet classification In general, BiLSTM networks correctly detect the target
Table III summarizes the results obtained by the different entities in more than 90% of the cases.
classification models tested, highlighting the different word Except for infrastructure A, the use of pre-trained em-
embedding approaches. The result pairs in boldface denote bedding word vectors provided a slight gain. However, the
the best compromise between TPR and TNR in each dataset. improvement is not significant enough to provide a definitive
The results indicate that across the three infrastructures the conclusion. Unlike in the tweet classification task, where both
CNN achieves the best overall results and pre-trained embed- security-related and non-security-related tweets are present, in
ding vectors are preferable to randomly initialized ones. The the NER task, only the former is present. Thus, the advantage
exception occurs in infrastructure A testing set, in which case of using pre-trained embedding vectors based on non-security
a BiLSTM model with randomly initialized embedding vectors related text may not be as pronounced.
achieved the best result. Well-established techniques such as Although in infrastructures A and C the best models ob-
SVMs or MLPs produced worse results, being incapable of tained the highest F1-score in both validation and testing sets,
achieving a good balance over TPR and TNR in several in the case of infrastructure B a model different from the best
cases. Using frequency-based features instead of semantic performing model of the validation set obtained a better F1-
word vectors may have influenced the results obtained. We score. However, the difference between these is negligible.
TABLE V
NER F1- SCORE RESULTS ACROSS ALL INFRASTRUCTURES .

A1 A2 A3 B1 B2 B3 C1 C2 C3
Stanford CRF 1.000 0.917 0.852 0.999 0.879 0.810 0.999 0.899 0.906
BiLSTM + Random 0.999 0.932 0.906 0.999 0.919 0.882 0.987 0.925 0.928
BiLSTM + GloVE 0.999 0.926 0.894 0.997 0.932 0.888 0.998 0.924 0.932
BiLSTM + Word2Vec 1.000 0.893 0.864 0.998 0.927 0.890 0.984 0.928 0.934

TABLE VI
V ULNERABILITIES PUBLISHED ON TWITTER PRIOR TO BEING DISCLOSED IN NVD

NVD date Tweet date Tweet CVSS


A2 06/02/2017 31/01/2017 Vuln: Linux Kernel CVE-2017-5546 Local Denial of Service Vulnerability https://t.co/bLEJIb1ZVD 7.8
PRO ID VUL

08/06/2017 27/03/2017 VMware Player12.x &lt 12.5.4Drag-and-Drop Feature Guest-to-HostCode Execution(VMSA-2017-0005)(Linux)https://t.co/xMIP5JlvOZ 9.9
VER VUL
A3 PRO ID PRO

27/03/2017 24/03/2017 Vuln: Broadcom BCM4339 SoC CVE-2017-6957 Stack-Based Buffer Overflow Vulnerability https://t.co/vR6EznOsBi 8.1
PRO ID ID VUL

B2 20/03/2017 29/01/2017 Vuln: Apache Tomcat CVE-2016-6816 Security Bypass Vulnerability https://t.co/PfOdfDGIfy 7.1
PRO ID VUL

27/07/2018 01/03/2017 Vuln: Red Hat CloudForms Management Engine CVE-2017-2632 Privilege Escalation Vulnerability https://t.co/Vm0fMMM1Rc 4.9
PRO ID
B3 VUL

20/03/2017 16/03/2017 Vuln: Apache Tomcat CVE-2016-6816 Security Bypass Vulnerability https://t.co/FK5nXKcfy8 #bugtraq 7.1
PRO ID VUL

16/03/2017 06/02/2017 #Vuln: #Microsoft #Windows CVE-2017-0016 Memory Corruption Vulnerability https://t.co/ZR3DVVgx3j #bugtraq 5.9
ORG PRO ID
C2 VUL

15/02/2017 14/02/2017 ZDI-17-109 : Adobe Flash Player MessageChannel Type Confusion Remote Code Execution Vulnerability https://t.co/hTaiCS671W 8.8
ID PRO VUL

27/07/2018 01/03/2017 Vuln: Red Hat CloudForms Management Engine CVE-2017-2632 Privilege Escalation Vulnerability https://t.co/Vm0fMMM1Rc 4.9
PRO ID
C3 VUL

11/06/2018 08/03/2017 #Vuln: #Mozilla #Firefox MFSA 2017-05 Multiple Security Vulnerabilities https://t.co/POFeaWjREj #bugtraq 7.5
ORG PRO ID VUL

Thus, the choice of the best model may depend on favoring the The BiLSTM NER model recognized the most important
testing set F1-score over the validation sets’ or on considering aspects of these tweets, such as the infrastructure asset, the
an average of both preferable to make the decision. vulnerability, and useful identifiers such as the Common
Regarding the dimension of the character embedding vector Vulnerabilities and Exposures (CVE). These identified entities
and the character-level hidden state of BiLSTM networks, could then have been used to issue a security warning or to
the best performing models favored (8 out of 9) the highest fill an IoC in a threat sharing platform.
available value of 100. For the word-level hidden state vector Although our current datasets are not large, the results ob-
dimension, most models (8 out of 9) used the smallest avail- tained and the information relevance and timeliness justify the
able value of 100. Finally, regarding the dropout layers, no possibility of using Twitter as an OSINT source for cyberthreat
clear trend could be observed in the results. discovery. Furthermore, even though we did not identify one
case in the datasets used where the tweet references a zero-day
VI. A NALYSIS OF I NDICATORS OF C OMPROMISE exploit without mentioning a CVE or similar identifier, such
was the case in late August of 2018. A Twitter user made
Regarding the applicability of Twitter for cyberthreat aware- public a zero-day vulnerability in Microsoft Windows’ task
ness, we analyzed the tweets labeled relevant by the classifier scheduler, providing a proof-of-concept exploit.2 We sent the
in the validation and testing sets. By using the ID label, we original tweet through our pipeline and the classifier and NER
analyzed the corresponding NVD vulnerability entry to verify models correctly labeled the tweet as relevant and identified
the existence of tweets mentioning these vulnerabilities priorly the asset in question to be Microsoft Windows. The exploit was
to the disclosure date, and to find their severity according to officially made public only on the 9th of September, regardless
the Common Vulnerability Scoring System (CVSS) [31]. of the original tweet appearance on the 24th of August.3
A sample of such tweets found is displayed in Table VI. Thus, by combining the timeliness of the Twitter informa-
Each entry shows the tweet date and the NVD disclosure date, tion stream with the ability of deep neural architectures to
the tweet with the labels identified by the NER model, and the accurately detect relevant tweets and identify useful pieces of
CVSS score. The number of days since tweet publication to information therein, OSINT-based threat intelligence platforms
NVD disclosure ranged from 1 to 148, clearly showing the can improve significantly from the current state-of-art, to
timeliness with which the deep neural network-based OSINT provide targeted, timely, and relevant threat intelligence.
processing pipeline can provide vulnerability information to
2 https://www.zdnet.com/article/windows-zero-day-vulnerability-disclosed-
organizations’ SOCs. The tweets CVSS score range from a
through-twitter/
medium 4.9 score to a 9.9 critical score, thus showing the 3 https://portal.msrc.microsoft.com/en-US/security-guidance/advisory/CVE-
relevance of the information found. 2018-8440
VII. C ONCLUSIONS [9] S. Zhou, Z. Long, L. Tan, and H. Guo, “Automatic identification
of indicators of compromise using neural-based sequence labelling,”
This paper proposes deep neural network architectures to Computing Research Repository, 2018.
implement the core tasks of a processing pipeline to ob- [10] A. Severyn and A. Moschitti, “Twitter Sentiment Analysis with Deep
Convolutional Neural Networks,” in Proc. of the 38th International
tain timely, relevant and targeted security-related information ACM SIGIR Conference on Research and Development in Information
from Twitter. The proposed system is capable of gathering Retrieval. Association for Computing Machinery, 2015.
tweets from a set of accounts, filtering them based on a set [11] P. Badjatiya, S. Gupta, M. Gupta, and V. Varma, “Deep Learning for
Hate Speech Detection in Tweets,” in Proc. of the 26th International
of keywords defining an infrastructure to monitor, selecting Conference on World Wide Web Companion (WWW). International
the tweets containing relevant information, and identifying World Wide Web Conferences Steering Committee, 2017.
useful pieces of information in these tweets. For that, we [12] A. J. Yepes and A. MacKinlay, “Ner for medical entities in twitter using
sequence to sequence neural networks,” in Proc. of the Australasian
implemented convolutional and bidirectional long short-term Language Technology Association Workshop 2016, 2016.
memory neural networks. We compare the performance of the [13] C. Wagner, A. Dulaunoy, G. Wagener, and A. Iklody, “MISP: The Design
proposed approach to well-established methodologies, veri- and Implementation of a Collaborative Threat Intelligence Sharing Plat-
form,” in Proc. of the 2016 ACM on Workshop on Information Sharing
fying that the deep neural network architectures outperform and Collaborative Security (WISCS). Association for Computing
those methodologies. Three case studies specified by one Machinery, 2016.
nation-wide and two world-wide private organizations were [14] J. Pennington, R. Socher, and C. D. Manning, “GloVe: Global Vectors
for Word Representation,” in Proc. of the Empirical Methods in Natural
used to validate the approach. Across the three case studies, Language Processing, 2014.
the convolutional neural network binary classifier achieved [15] T. Mikolov, K. Chen, G. S. Corrado, and J. Dean, “Efficient estimation
an average TPR and TNR of 92%, while the named entity of word representations in vector space,” 2013.
[16] R. Collobert, J. Weston, L. Bottou, M. Karlen, K. Kavukcuoglu, and
recognition BiLSTM model achieved an average F1-score of P. Kuksa, “Natural Language Processing (Almost) from Scratch,” Jour-
92% in detecting specified labels. nal of Machine Learning Research, 2011.
Future research will focus on exploring multi-task learning [17] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhut-
dinov, “Dropout: A Simple Way to Prevent Neural Networks from
architectures to shape our pipeline into a fully end-to-end Overfitting,” Journal of Machine Learning Research, 2014.
neural network and to evaluate its impact on the models’ [18] S. Wager, S. Wang, and P. S. Liang, “Dropout Training as Adaptive
performance and on the requirements for pipeline adaptation Regularization,” in Advances in Neural Information Processing Systems
26, C. J. C. Burges, L. Bottou, M. Welling, Z. Ghahramani, and K. Q.
over time. Weinberger, Eds. Curran Associates, Inc., 2013.
[19] ENISA, “Risk Management - Glossary,” https://www.enisa.europa.
ACKNOWLEGMENT eu/topics/threat-risk-management/risk-management/current-risk/
risk-management-inventory/glossary, accessed: Sept. 2018.
This work was partially supported by the EC through [20] J. D. Lafferty, A. McCallum, and F. C. N. Pereira, “Conditional Random
funding of the H2020 DiSIEM project (H2020-700692), and Fields: Probabilistic Models for Segmenting and Labeling Sequence
by the LASIGE Research Unit (UID/CEC/00408/2019). Data,” in Proc. of the 18th International Conference on Machine
Learning (ICML). Morgan Kaufmann Publishers Inc., 2001.
[21] Information Technology Laboratory, “National Vulnerability Database,”
R EFERENCES https://nvd.nist.gov/, accessed: Jan. 2019.
[1] Center for Strategic and International Studies (CSIS) and McAfee, [22] C. Cortes and V. Vapnik, “Support-Vector Networks,” Journal of Ma-
“Economic Impact of Cybercrime — No Slowing Down Report,” https: chine Learning Research, 1995.
//www.csis.org/analysis/economic-impact-cybercrime, accessed: Nov. [23] F. Rosenblatt, “The perceptron: A probabilistic model for information
2018. storage and organization in the brain.” Psychological review, 1958.
[2] Q. Le Sceller, E. B. Karbab, M. Debbabi, and F. Iqbal, “SONAR: [24] D. E. Rumelhart, G. E. Hinton, and R. J. Williams, “Learning internal
Automatic Detection of Cyber Security Events over the Twitter Stream,” representations by error propagation,” Tech. Rep., 1986.
in Proc. of the 12th International Conference on Availability, Reliability [25] J. R. Finkel, T. Grenager, and C. Manning, “Incorporating Non-local
and Security (ARES). Association for Computing Machinery, 2017. Information into Information Extraction Systems by Gibbs Sampling,”
[3] X. Liao, K. Yuan, X. Wang, Z. Li, L. Xing, and R. Beyah, “Acing the in Proc. of the 43rd Annual Meeting on Association for Computational
IOC Game: Toward Automatic Discovery and Analysis of Open-Source Linguistics (ACL). Association for Computational Linguistics, 2005.
Cyber Threat Intelligence,” in Proc. of the 2016 ACM SIGSAC Confer- [26] M. Abadi, P. Barham, J. Chen, Z. Chen, A. Davis, J. Dean, M. Devin,
ence on Computer and Communications Security (CCS). Association S. Ghemawat, G. Irving, M. Isard, M. Kudlur, J. Levenberg, R. Monga,
for Computing Machinery, 2016. S. Moore, D. G. Murray, B. Steiner, P. Tucker, V. Vasudevan, P. Warden,
[4] C. Sabottke, O. Suciu, and T. Dumitras, “Vulnerability Disclosure in M. Wicke, Y. Yu, and X. Zheng, “TensorFlow: A System for Large-
the Age of Social Media: Exploiting Twitter for Predicting Real-World scale Machine Learning,” in Proc. of the 12th USENIX Conference
Exploits,” in Proc. of the 24th USENIX Security Symposium (USENIX on Operating Systems Design and Implementation (OSDI). USENIX
Security 15). USENIX Association, 2015. Association, 2016.
[5] A. Attarwala, S. Dimitrov, and A. Obeidi, “How efficient is Twitter: [27] D. P. Kingma and J. Ba, “Adam: A Method for Stochastic Optimization,”
Predicting 2012 U.S. presidential elections using Support Vector Ma- in Proc. of the 3rd International Conference on Learning Representa-
chine via Twitter and comparing against Iowa Electronic Markets,” in tions (ICLR), 2015.
Intelligent Systems Conference, 2017. [28] X. Meng, J. Bradley, B. Yavuz, E. Sparks, S. Venkataraman, D. Liu,
[6] Y. Kim, “Convolutional Neural Networks for Sentence Classification,” J. Freeman, D. Tsai, M. Amde, S. Owen et al., “Mllib: Machine learning
in Proc. of the 2014 Conference on Empirical Methods in Natural in apache spark,” Journal of Machine Learning Research, 2016.
Language Processing. Association for Computational Linguistics, 2014. [29] L. Bottou, “Large-scale machine learning with stochastic gradient de-
[7] G. Lample, M. Ballesteros, S. Subramanian, K. Kawakami, and C. Dyer, scent,” in Proc. of COMPSTAT’2010. Springer, 2010.
“Neural architectures for named entity recognition,” in Proc. of the 2016 [30] D. C. Liu and J. Nocedal, “On the limited memory BFGS method for
Conference of the North American Chapter of the Association for Com- large scale optimization,” Mathematical programming, 1989.
putational Linguistics: Human Language Technologies. Association for [31] Forum of Incident Response and Security Teams, “Common Vulnera-
Computational Linguistics, 2016. bility Scoring System,” https://www.first.org/cvss/, accessed: Jan. 2019.
[8] LeCun Yann, Bengio Yoshua, and Hinton Geoffrey, “Deep learning,”
Nature, 2015.

You might also like