0% found this document useful (0 votes)
380 views12 pages

Detecting Cyber Security Threats in Weblogs

This document discusses detecting cyber security threats from weblogs using probabilistic models. It introduces latent semantic analysis and probabilistic latent semantic analysis as methods to analyze weblog posts and detect keywords related to cyber security topics. The goal is to track trends and conversations in the blogosphere regarding cyber threats, cyber crime, and terrorism. By applying a probabilistic approach, the authors aim to improve information retrieval from weblogs and provide an analytical foundation for future security intelligence analysis using weblogs.

Uploaded by

ayoub_it
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
380 views12 pages

Detecting Cyber Security Threats in Weblogs

This document discusses detecting cyber security threats from weblogs using probabilistic models. It introduces latent semantic analysis and probabilistic latent semantic analysis as methods to analyze weblog posts and detect keywords related to cyber security topics. The goal is to track trends and conversations in the blogosphere regarding cyber threats, cyber crime, and terrorism. By applying a probabilistic approach, the authors aim to improve information retrieval from weblogs and provide an analytical foundation for future security intelligence analysis using weblogs.

Uploaded by

ayoub_it
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 12

Detecting Cyber Security Threats in Weblogs

Using Probabilistic Models

Flora S. Tsai and Kap Luk Chan

School of Electrical & Electronic Engineering,


Nanyang Technological University, Singapore, 639798
[email protected]

Abstract. Organizations and governments are becoming vulnerable to


a wide variety of security breaches against their information infrastruc-
ture. The magnitude of this threat is evident from the increasing rate of
cyber attacks against computers and critical infrastructure. Weblogs, or
blogs, have also rapidly gained in numbers over the past decade. Weblogs
may provide up-to-date information on the prevalence and distribution of
various cyber security threats as well as terrorism events. In this paper,
we analyze weblog posts for various categories of cyber security threats
related to the detection of cyber attacks, cyber crime, and terrorism. Ex-
isting studies on intelligence analysis have focused on analyzing news or
forums for cyber security incidents, but few have looked at weblogs. We
use probabilistic latent semantic analysis to detect keywords from cyber
security weblogs with respect to certain topics. We then demonstrate how
this method can present the blogosphere in terms of topics with measur-
able keywords, hence tracking popular conversations and topics in the
blogosphere. By applying a probabilistic approach, we can improve infor-
mation retrieval in weblog search and keywords detection, and provide
an analytical foundation for the future of security intelligence analysis of
weblogs.

Keywords: cyber security, weblog, blog, probabilistic latent semantic


analysis, cyber crime, cyber terrorism, data mining.

1 Introduction
Cyber security is defined as the intersection of computer, network, and informa-
tion security issues which directly affect the national security infrastructure [15].
Cyber security problems are frequent, serious, and global in nature. The number
of cyber attacks by persons and malicious software are increasing rapidly. Many
cyber criminals or hackers may post their ongoing achievements in weblogs, or
blogs, which are websites where entries are made in a reverse chronological order.
In addition, weblogs may provide up-to-date information on the prevalence and
distribution of various cyber security incidents and threats.
Weblogs range in scope from individual diaries to arms of political campaigns,
media programs, and corporations. Weblogs’ explosive growth is generating large
volumes of raw data and is considered by many industry watchers one of the top

C.C. Yang et al. (Eds.): PAISI 2007, LNCS 4430, pp. 46–57, 2007.

c Springer-Verlag Berlin Heidelberg 2007
Detecting Cyber Security Threats in Weblogs Using Probabilistic Models 47

ten industry trends [3]. Blogosphere is the collective term encompassing all blogs
as a community or social network. Because of the huge volume of existing weblog
posts and their free format nature, information in the blogosphere is rather
random and chaotic, but immensely valuable in the right context. Weblogs can
thus potentially contain usable and measurable information related to cyber
security threats, such as malware, viruses, cyber blackmail, and other cyber
crime.
With the amazing growth of blogs on the web, the blogosphere affects much
in the media. Studies on the blogosphere include measuring the influence of the
blogosphere [6], analyzing the blog threads for discovering the important bloggers
[11], determining the spatiotemporal theme pattern on blogs [10], focusing the
topic-centric view of the blogosphere [1], detecting the blogs growing trends [7],
tracking the propagation of discussion topics in the blogosphere [8], and searching
and detecting topics in corporate blogs [16].
Existing studies have focused on analyzing forums and news articles for cy-
ber threats [12,18,19], but few have looked at weblogs. In this paper, we focus
on analyzing cyber security weblogs, which are blogs providing commentary or
analysis of cyber security threats and incidents.
In our work, we analyzed various weblog posts to detect the keywords of
various topics of the blog entries, hence tracking the trends and topics of conver-
sations in the blogosphere. Probabilistic Latent Semantic Analysis (PLSA) was
used to detect the keywords from various cyber security blog entries with respect
to certain topics. By using PLSA, we can present the blogosphere in terms of
topics with measurable keywords.
The paper is organized as follows. Section 2 reviews the related work on
intelligence analysis and extraction of useful information from weblogs. Section
3 describes an overview of the Latent Semantic models such as Latent Semantic
Analysis and Probabilistic Latent Semantic Analysis model for mining of weblog-
related topics. Section 4 presents experimental results, and Section 5 concludes
the paper.

2 Review of Related Work


This section reviews related work in intelligence analysis and extraction of useful
information from weblogs.

2.1 Intelligence Analysis


Intelligence analysis is the process of producing formal descriptions of situations
and entities of strategic importance [17]. Although its practice is found in its
purest form inside intelligence agencies, such as the CIA in the United States
or MI6 in the UK, its methods are also applicable in fields such as business
intelligence or competitive intelligence.
Recent works related to security intelligence analysis include using entity rec-
ognizers to extract names of people, organizations, and locations from news
48 F.S. Tsai and K.L. Chan

articles, and applying probabilistic topic models to learn the latent structure
behind the named entities and other words [12]. Another study analyzed the
evolution of terror attack incidents from online news articles using techniques re-
lated to temporal and event relationship mining [18]. In addition, Support Vector
Machines were used for improving document classification for the insider threat
problem within the intelligence community by analyzing a collection of docu-
ments from the Center for Nonproliferation Studies (CNS) related to weapons
of mass destruction [19]. These studies illustrate the growing need for security
intelligence analysis, and the usage of machine learning and information retrieval
techniques to provide such analysis. However, much work has yet to be done in
obtaining intelligence information from the vast collection of weblogs that exist
throughout the world.

2.2 Information Extraction from Weblogs


Current weblog text analysis focuses on extracting useful information from we-
blog entry collections, and determining certain trends in the blogophere. NLP
(Natural Language Processing) algorithms have been used to determine the most
important keywords and proper names within a certain time period from thou-
sands of active weblogs, which can automatically discover trends across blogs, as
well as detect key persons, phrases and paragraphs [7]. A study on the propaga-
tion of discussion topics through the social network in the blogophere developed
algorithms to detect the long-term and short-term topics and keywords, which
were then validated with real weblog entry collections [8]. On evaluating the
suitable methods of ranking term significance in an evolving RSS feed corpus,
three statistical feature selection methods were implemented: χ2 , Mutual Infor-
mation (MI ) and Information Gain (I ), and the conclusion was that χ2 method
seems to be the best among all, but full human classification exercise would be
required to further evaluate such method [14]. A probabilistic approach based
on PLSA was proposed in [10] to extract common themes from blogs, and also
generate the theme life cycle for each given location and the theme snapshots for
each given time period. PLSA has also been previously used for weblog search
and mining of corporate blogs [16].
Our work differs from existing studies in two respects: (1) We focus on cyber
security weblog entries which has not been studied before in the context of
intelligence analysis (2) We have used probabilistic models to extract popular
keywords for each topic in order to detect themes and trends in cyber threats
and terrorism events.

3 Latent Semantic Models


This section reviews the latent semantic models used for this work, which involve
latent sematic analysis and extending probabilistic latent semantic analysis for
topic detection in weblogs.
Detecting Cyber Security Threats in Weblogs Using Probabilistic Models 49

3.1 Latent Semantic Analysis


Latent Semantic Analysis (LSA) [4] is a well-known technique for information
retrieval and document classification. LSA solves two fundamental problems in
natural language processing: synonymy and polysemy:

– In synonymy, different words may have the same meaning. Thus, a person
issuing a query in a search engine may use a different word from what appears
in a document, and may not retrieve the document.
– In polysemy, the same word can have multiple meanings, so a searcher can
get unwanted documents with the alternate meanings.

LSA solves the problem of lexical matching methods by using statistically


derived conceptual indices instead of individual words for retrieval [2]. LSA uses
a term-document matrix (TDM) which describes patterns of term (word) distri-
bution across a set of documents.
LSA then finds a low-rank approximation which is smaller and less noisy than
the original term-document matrix. The downsizing of the matrix is achieved
through the use of singular value decomposition (SVD), where the set of all the
terms is then represented by a vector space of lower dimensionality than the
total number of terms in the vocabulary. The consequence of the rank lowering
is that some dimensions get “merged”.
In LSA, each element of the n × m term-document matrix reflects the occur-
rence of a particular word in a particular document, i.e.,

A = [aij ], (1)

where aij is the number of times or frequency in which term i appears in docu-
ment j. As each word will not usually appear in every document, the matrix A
is typically sparse with rarely any noticeable nonzero structure [2].
The matrix A is then factored into the product of three matrices using SVD.
Given a matrix A, where rank(A) = r, the SVD of A is defined as:

A = USVT . (2)

The columns of U and V are referred to as the left and right singular vectors,
respectively, and the singular values of A are the diagonal elements of S, or the
nonnegative square roots of the n eigenvalues of AAT .
As defined by Equation (2), the SVD is used to represent the original rela-
tionships among terms and documents as sets of linearly-independent vectors.
Performing truncated SVD by using the k -largest singular values and correspond-
ing singular vectors, the original TDM can be reduced to a smaller collection of
vectors in k -space for conceptual query processing [2].

3.2 Probabilistic Latent Semantic Analysis for Weblog Mining


Probabilistic Latent Semantic Analysis (PLSA) [9] is based on a generative prob-
abilistic model that stems from a statistical approach to LSA [4]. PLSA is able to
50 F.S. Tsai and K.L. Chan

capture the polysemy and synonymy in text for applications in the information
retrieval domain. Similar to LSA, PLSA uses a term-document matrix which
describes patterns of term (word) distribution across a set of documents (blog
entries). By implementing PLSA, topics are generated from the blog entries,
where each topic produces a list of word usage, using the maximum likelihood
estimation method, the expectation maximization (EM) algorithm.
The starting point for PLSA is the aspect model [9]. The aspect model is
a latent variable model for co-occurrence data associating an unobserved class
variable zk ∈ {z1 , . . . , zk } with each observation, an observation being the oc-
currence of a keyword in a particular blog entry. There are three probabilities
used in PLSA:

1. P (bi ) denotes the probability that a keyword occurrence will be observed in


a particular blog entry bi ,
2. P (wj |zk ) denotes the class-conditional probability of a specific keyword con-
ditioned on the unobserved class variable zk ,
3. P (zk |di ) denotes a blog-specific probability distribution over the latent vari-
able space.

In the collection, the probability of each blog and the probability of each
keyword are known, while the probability of an aspect given a blog and the
probability of a keyword given an aspect are unknown. By using the above three
probabilities and conditions, three fundamental schemes are implemented:
1. select a blog entry bi with probability P (bi ),
2. pick a latent class zk with probability P (zk |bi ),
3. generate a keyword wj with probability P (wj |zk ).
As a result, a joint probability model is obtained in asymmetric parameteri-
zation:

P (bi , wj ) = P (bi )P (wj |bi ), (3)


K
P (wj |bi ) = P (wj |zk )P (zk |bi ) (4)
k=1

After the aspect model is generated, the model is fitted using the EM algo-
rithm. The EM algorithm involves two steps, namely the expectation (E) step
and the maximization (M) step. The E-step computes the posterior probability
for the latent variable, by implying Bayes’ formula, so the parameterization of
joint probability model is obtained as:

P (wj |zk )P (zk |bi )


P (zk |bi , wj ) = K (5)
l=1 P (wj |zl )P (zl |bi )
Detecting Cyber Security Threats in Weblogs Using Probabilistic Models 51

The M-step updates the parameters based on the expected complete data
log-likelihood depending on the posterior probability resulted from the E-step.
Hence the M-step re-estimates the following two probabilities:
N
i=1 n(bi , wj )P (zk |bi , wj )
P (wj |zk ) = M N (6)
m=1 i=1 n(bi , wm )P (zk |bi , wm )
M
j=1 n(bi , wj )P (zk |bi , wj )
P (zk |bi ) = (7)
n(bi )
The EM iteration is continued to increase the likelihood function until the
specific conditions are met and the program is terminated. These conditions can
be a convergence condition, or a cut-off point, which is specified for reaching a
local maximum, rather than a global maximum.
In short, the PLSA model selects the model parameter values that maximize
the probability of the observed data, and returns the relevant probability dis-
tributions by implying the EM algorithm. Word usage analysis with the aspect
model is a common application of the aspect model. Based on the pre-processed
term-document matrix, the blogs are then classified onto different aspects or
topics. For each aspect, the keyword usage, such as the probable words in the
class-conditional distribution P (wj |zk ), is determined. Empirical results indi-
cate the advantages of PLSA in reducing perplexity, and high performance of
precision and recall in information retrieval [9].

4 Experiments and Results


We have used latent semantic models to analyze weblogs related to cyber security
threats and incidents, and applied probabilistic models for weblog analysis on
our dataset. Dimensionality reduction was performed with latent semantic anal-
ysis to show the similarity plot of weblog terms. We extract the most relevant
categories and show the topics extracted for each category. Experiments show
that the probabilistic model can reveal interesting patterns in the underlying
topics for our dataset of security-related weblogs.

4.1 Data Corpus


For our experiments, we extracted a subset of the Nielson BuzzMetrics weblog
data corpus1 that focuses on blogs related to cyber security threats and incidents
related to cyber crime and terrorism. The original dataset consists of 14 million
weblog posts collected by Nielsen BuzzMetrics for May 2006. Although the blog
entries span only a short period of time, they are indicative of the amount and
variety of blog posts that exists in different languages throughout the world.
Blog entries in the English language related to cyber security threats such as
malware, cyber crime, and terrorism were extracted and stored for use in our
analysis. Figure 1 shows an excerpt of a weblog post related to cyber blackmail.
1
http://www.icwsm.org/data.html
52 F.S. Tsai and K.L. Chan

—————————————————————————————————————–
Cyber blackmail is on the increase ... Criminal gangs have moved away from the stealth
use of infected computers ... to direct blackmailing of victims. ... Cyber blackmailing
is done ... by encrypting data or by corrupting system information. The criminal then
demands a ransom for its return to the victim. ...
—————————————————————————————————————–

Fig. 1. Excerpt of weblog post related to cyber blackmail and ransom

There are a total of 5493 entries in our dataset, and each weblog entry is saved
as a text file for further text preprocessing. For the preprocessing of the blog
data, HTML tags were removed and lexical analysis was performed by removing
stopwords, stemming, and pruning using the Text to Matrix Generator (TMG)
[20]. The total number of terms after pruning and stopword removal is 797. The
term-document matrix was then input to the LSA and PLSA algorithms.

4.2 Semantic Detection of Terms


We used the LSA model [4] for analyzing semantic detection of terms, as LSA is
able to consider weblog entries with similar words which are semantically close.
The results of applying LSA on this term-document matrix (with k =2) is shown
in Figure 2.
The plot shows the similarity in two-dimensional space of the terms in the
weblog entries. Although many terms are not visible because of the large number
of words, there are a few groupings evident from the graph. Some of the visible
terms include the grouping of spyware, malware, and software at the top center
of the plot. Another group visible at the right include Iraq, war, Bush, and

0.8
nsa
0.6 program
spywar phonecomput
0.4 malwar secur
softwar data
userwindow compani record call
virudatabas privaci
domestspy agenc
million
inform bush
0.2 instal file
surveil
collect usa
activ protect democratintellig report govern
presid
custom technolog
telephon
mine
warrant ten network
search
track servic congress
committe
anti
busicommun
sourc number
internet senat
investig
republican
cia law
todai administr
nation american
remov
access
monitor
research
email machin
mail
director web
target provid
constitut
www
liberti
document illeg
efforthttp
articl
listen
free
suspectrun threat
depart
site secret
system
gener
linkwork
media power iraq
tool
largestfocus
analysi
product
convers
code
corpor
privat
gen approv
confirm
onlin
market
potenti
version
manag
violat
senior
card updat
requir
trust
paper
detail
social chenei
poll
address
check
critic
page
list
creat
concern
develop
increasprevent danger
organ legal
control problem
border
author
intern
major
partielect
home
polici
monei
bill includ
issu
start
washington
interest talk
weapon
georg
blog stori
support hous
iraqi
question
iran
content
spread
awar
test
additkei
abil
launch
fine
conduct
standard
gather
gain
push
januari
simpl
contact
wai
written
measur
sell
wideemerg
tilight
discoveasi
independ
popular
approach
cross
huge
promis
coupl
damag
goal
amount identifi
risk
drop
design
specif
enforc
project
grow
purpos
newspap
websit
main
march
wrote
institut
origin
seek publish
safe
rate
reveal
stuffsecretari
object
cost
strategi
initi
falspick
basic piec
limit
econom
fund
visit
began
practic
region
fall
studi effect
level
front
step
local
worri michael
direct
name
invas
account
drive
expert
process
establish
bui success
posittax
larg
pai
ad
tell
chief
share
latest
suggest
break
billion
cover
class
insid
regim
entir
troubl
central
alleg
haven ignor
british
price
total pass
south full
letter
game
presenttop
hard
democraci
command
set
conserv
campaign
build
discuss
relat
areasend
lead
offer
dealclinton
ago
current
result
past move
note
mean
global job
ask
opennuclear
writecivil
right
base
chang
oil
baghdad
univers
import
matter
polic
student
kind
orderdoesntroop
press
foreign
book
head offic
recent
citizen
stop
immigr
fear
continu
histori
turn white
findmade
week
word
friend group forc
comment
offici
read
back newcountritime
militari
polit
post state
0 engag
seri
notic
rise
extrem
conflict
prepar
worth
directli
aren
actual
poor
town
caus
signific
enter
consid
period ag
avoid
roll
readi
opposit
low
differ
mark
forward
burn
favor
hide
alli
educ
imag
experitype
progress
longer
invad
date
common
due
challeng
produc
need
similar
fieldassoci
met
hot follow
express
miss
david
woman
determin
popul
justifi
lack
spent
blow
water
particip
pull
earlier
happi
wallshot
combat
blame b add
pictur
repres
attent
threaten
mistak
singl
land
ground
nice
surpris
opposimagin
red
accus
wors
hundr
abus
argu
aid
figur
situat
review
respond
reach
northhelp
sort
protest
late
announc
join car
special
form
let
stai
earli
realiz
camp
movement
violenc
defeat
voic
heart
ey
strong
commit
admit
complet
refus
meet
accept
rais
individu
caught
short
kid
subject
demand
choic
dollar
team simpli
morn
exist
april
interview
cut arm
attempt
fridai
futurwarn
fire
destroi
bit
west
western
hold
gave
daili
lose
quot
knowledg bring
term
street
debat
argument
moment
strike
absolut
respect religion
return
civilian
refer
worst
prove half
deni
carri
suppos
trade
natur sens
yeah
societi
wait
heard
receiv
a
sign
learn
air
insurg
europ
appear
gun
begin
pretti
expect
mention
battlhit
small
come
cultur
explain
realiti mass
isn
agent rule
hussein
true
win
remain
allow middl
view
high
defend
children
odiboi
appar
failur mind
fail
train
opinion john
take
action
nswer
night
line
armi
thousand
east lifreedom
young
rememb
side
destruct
messagidea
peac
involv
speech
close
blackmovi
leav
statement
understand
believ
pentagon
lawyer
guess
religi
abu
radic stand
cleardefens
christian
think
captur
fbi arab
yesterdai
innocciti
center hear
york
vote
bad
care
face act
liber
month
saddam
israel lot
place
happen
respons
real
wrong
soldier big
thought long
give
watch
put
left public
found
enemi
taliban
bomb
school
pakistan
reason
love
plan
plai
end
dont
oper
point
fact
leader
video
told fight
journalshow goodthing
zarqawi
afghanistanmake
yeardon
unit peopl war
brought
fly walk
capit
opportun
fair sound
father
son
wasnled
possiblevil
minut
doubt
arrest
held
mission
sit
declar
room
serv women
moral
agre
dead truth
hate
suicid
crimin
hour
evid
final
knew great
crime
speak member
claim
handgui
releas
didn
human
hope
men part personlive world
suffer event
connect dai terror
diplane
tuesdai
chanc
victim
cell
save hell
decid
decis
murder feder feel
thursdai
tortur
charg
want man muslimislam qaeda
−0.2 plot
spend
execut
brotherwon famili case court
terrorist
lost hijack
rest justic america
kill
role god attack
−0.4 convictdie septemb osama
judg
sept
−0.6 laden
bin
trial
−0.8
death
−1 life
sentenc prison
zacaria
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Fig. 2. Two-dimensional plot of terms for weblog entries using LSA


Detecting Cyber Security Threats in Weblogs Using Probabilistic Models 53

spywar
0.5
malwar
0.4
softwar data
user compani
0.3 window record
viru privaci spy agenc
million inform
databas file domest
0.2 instal surveil usa protect
technolog collect activ senat democrat
custom mine telephon
ten
search network
committe congress
commun investig
number republican
warrant track
machin web anti
servic
target busi
provid sourc
internet cia todai law
remov access
monitor mail illeg
site
effort
listenhttp
articl
depart threat
secret power
0.1 focus
analysi research
email director
document
updat constitut
www
libertisuspect free
run
problem system
generlink work
media stori
hous
tool codeproduct
convers
market
corpor onlin
potenti
version managapprov
confirm
trust requir
paper poll
address
check chenei
critic
page list
creat organdanger
legal
control author
intern
major
parti
polici
monei border
home
elect includ
issu
start support
talk
weapon iraqi
question
iran
largest
content
spread
awartest
abil privat
kei violat
senior
card
gen
drop
specif
enforc risksocial
increas
identifi
reveal
design
stuffdetail
safedevelop
rate concern
prevent
publish
secretari
level effect
michael full
letter job
write
ask
open
changbill
nuclear
civil
right
base
citizen
stop recent
immigr washington
interest
officfind georg
blog
white forc polit
comment
addit
wide launch
januari
push
simpl
measur
sell
discovwai
ti
emerg
independ
popular project
fine
conduct
standard
gather
gain purpos
websit
main
march
bui
establish
wrote
institut
contact
written
easi
approach
cross
initi
huge step
worri
grow
newspap
origin
econom
fund
seek
costobject
limit
visit
began
practic
region name
front
local
account
drive
process
expert
posit
success
piec
latest
share direct
invas tax
larg
pai
ad
campaign
chief
tell
break area
suggest
billion
coverclass discuss
relat
deal
current send
offer
ago game
set
conserv
present
build
lead top
hard
command
democraci move
clinton
note baghdad
importunivers
matter
policoil
student
kind troop
press
foreign
book
head fear made
week group offici
read
back
promis
amount
seri
rise basic
strategi
fals
goal
notic
hot
met
similar
conflict
extrem pick
experi
engag date studi
troubl
alleg
haven
damag
coupl
alli
imag
eductotal
hide
type
longer
progress
review
invad
respond
north
reach
common
challeng
due
need
produc
associ
join
announc fall
central
british
situat
figur
add
sort
protest
late
pictur insid
regim
entir
ignor
price
aid
april
help
fridai cut
futur pass
south
refus result
morn
exist
interview
insurg past
arm
attempt
warnsign
learn
thousand
europ air
east mean
global
complet
meetsimpli view
high
answer
night
line
armi order
middl
take john
actiondoesn
york
vote continu
histori turn
actbig
liber friend
publicword
dont zarqawi
0 field
directli
actual
poor
entercaus
signific
period prepar
readi
low
blow
worth
ag
aren
avoid
town
roll
opposit
differ
consid miss
justifi
mark
forward
oppos
water
particip
earlier
repres
special
threaten
attent
mistak
singl
individu
woman
light
david
determin
popul
lack
spent
nice kid
short
caught
choic
imagin
surpris
pull
rais
subject
team
realiti
follow
express battl
demand
dollarland
bodi
stai
failur
earli
movement
realiz
camp
violenc
ground
knowledg
accus
red
hundr
wors
abus
moment
strike
absolut
car
accept
mention
small
cultur
explaincome
let
appar
western
lose
argument
respect
argu
agent
hit
gave fire
destruct
expectappear
gun
begin
remain
allow
form destroi
daili
quot
yeah
return
defeat bit isn
hussein
win
children
west
hold
street
debat
rememb
side
pretti
massrule
true
bstatement
lack
bring
term
opinion
religion
sens train
boi close
mind
fail
movi lifreedom
young
messag
defend involv
speechidea
peac
israel
carebad
face
month
saddam
lot
place
watch
happen
respons give
put
left
found
long point
enemi
taliban
oper good
afghanistan
burn
happi
favor shot
combat
voic
worst
wall refer
prove
blame societi
civilian
heart
ey
carri
strong
commit
admit
natur half
heard
deni
suppos
trade radicabuwait
lawyer
receivguess
religi believ
pentagon
fbiclearleav
think
christian
captur
innoc understand
soldier
stand
defens
arab
yesterdai
citi hearwrongreal
thought bomb
school
pakistan
reasontold
love
plan
plai video fact
leader
journal fight
show
brought
opportunwalk
father sound
possibl
capit
son
wasn led
minutevil women center
suicid
crimin hourtruth
hate great
crime hand end
member
claim part person live
fair
flysuffer room doubt
arrest
held
sitmission
declar moral
agre
dead
serv connect evid
final
knewspeak
event releas
hope didn
human
menfeel gui
−0.1 diplane
tuesdai celldecis hell
thursdai
decid tortur
chanc save want charg
feder man
victim murderexecut case muslim
−0.2 plot won famili
spend brother court
lost justic
role rest hijack god
−0.3
septemb osama
−0.4 convict die

0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5

Fig. 3. Zoomed-in graph of Figure 2

American. Yet another grouping include the terms death, prison, and life at the
bottom of the graph. Zooming into the large cluster from Figure 2, Figure 3 shows
a subset of the big cluster of keywords. A larger group of keywords (spyware,
malware, software, data, user, window, privacy, spy, domestic) can be identified,
thus showing the ability descend through a hierarchical grouping of keywords.
The implications of the graphs demonstrate the possibility to visualize closely-
related terms in two-dimensional space. Although the two-dimensional graphs
may be an over-simplification of the dimensionality reduction that takes place,
the plot can help to visualize the terms and relate to the topics produced for the
weblogs.

4.3 Results for Weblog Topic Analysis


We conducted some experiments using PLSA for the weblog entries. Tables 1-4
summarizes the keywords found for each of the four topics (Computer Security,
Osama bin Laden, Iraq War, and US National Security).
By looking at the various topics listed, we are able to see that the probabilis-
tic approach is able to list important keywords of each topic in a quantitative
fashion. The keywords listed can relate back to the original topics. For example,
the keywords detected in the Computer Security topic features items such as
computers, spyware, software, and internet.
Figure 4 shows the graph of the topic-document distribution of the weblog
entries by date. Some of the topics have a higher density of documents distributed
around certain dates. This can be used to match certain events in each topic to
the weblog entries. For example, the heavy clustering of documents for Topic
4 (US National Security) indicate that there was an increase on the weblog
54 F.S. Tsai and K.L. Chan

Table 1. List of keywords for Topic 1: Table 2. List of keywords for Topic 2:
Computer Security Osama bin Laden

Keyword Probability Keyword Probability


comput 0.023716 moussaoui 0.0170900
malwar 0.020509 don 0.0083916
spywar 0.018047 life 0.0083244
softwar 0.014650 bin 0.0079485
secur 0.014257 laden 0.0078515
window 0.013527 osama 0.0074730
internet 0.013436 prison 0.0074594
http 0.013266 peopl 0.0064921
web 0.012022 death 0.0063618
user 0.011651 god 0.0061734

Table 3. List of keywords for Topic 3: Table 4. List of keywords for Topic 4: US
Iraq War National Security

Keyword Probability Keyword Probability


iraq 0.0134810 nsa 0.0142720
war 0.0089393 bush 0.0127100
islam 0.0087796 phone 0.0121350
zarqawi 0.0086076 program 0.0099155
militari 0.0073831 presid 0.0098480
muslim 0.0073576 cia 0.0093545
afghanistan 0.0072725 american 0.0088027
iran 0.0070231 record 0.0086708
qaeda 0.0069711 call 0.0086591
iraqi 0.0065779 administr 0.0084607

conversations in this topic around the middle of May 2006. This can be due to
US President Bush’s comment on May 11, 2006 about a USA Today report on
a massive NSA database that collects information about all phone calls made
within the United States [5]. This is one example of an event that can trigger
much conversation in the blogosphere.
For the topic of Computer Security, we further decompose into separate
subtopics, two of which are shown in Tables 5-6. Malware, which includes com-
puter viruses, worms, trojan horses, spyware, adware, and other malicious soft-
ware, is the topic derived from examining the keywords in Subtopic 1. Subtopic
2 is classified as Macintosh, and reflects the increasing reports of cyber attacks
affecting Macintosh computers. Therefore, we can classify and decompose the
topics into a hierarchy of subtopics, which may be useful for examining larger
data sets.
The power of PLSA in cyber security applications include the ability to au-
tomatically detect terms and keywords related to cyber security threats and
terror events. By presenting blogs with measurable keywords, we can improve
Detecting Cyber Security Threats in Weblogs Using Probabilistic Models 55

2
Topics
3

May 1 1000
5 2000
10 3000 20
15 4000 25
500031 (2006)
Documents
Documents By Date
Fig. 4. Topic by document distribution by date

Table 5. List of keywords for Subtopic 1: Table 6. List of keywords for Subtopic 2:
Malware Macintosh

Keyword Probability Keyword Probability


spywar 0.0174610 mac 0.0078348
trojan 0.0117580 secur 0.0056874
adwar 0.0092120 appl 0.0054443
scan 0.0088160 microsoft 0.0041262
anti 0.0087841 attack 0.0041169
free 0.0074914 system 0.0039921
spybot 0.0074895 report 0.0037533
remov 0.0072834 cyber 0.0037227
download 0.0069242 crime 0.0036151
viru 0.0068496 comput 0.0035352

our understanding of cyber security issues in terms of distribution and trends of


current threats and events. This has implications for security agencies wishing
to monitor real-time threats present in weblogs or other related documents.

5 Conclusions
In this paper, we analyzed weblog posts for various categories of cyber security
threats related to the detection of cyber security threats, cyber crime, and cyber
terrorism. To our knowledge, is the first such study focusing on cyber security
weblogs. We use latent semantic analysis to illustrate similarities in terms dis-
tributed across all the terms in the weblog dataset. Our experiments on our
dataset of weblogs demonstrate how our probabilistic weblog model can present
the blogosphere in terms of topics with measurable keywords, hence tracking
popular conversations and topics in the blogosphere. By applying a probabilistic
approach, we can improve information retrieval in weblog search and keywords
56 F.S. Tsai and K.L. Chan

detection, and provide an analytical foundation for the future of security intel-
ligence analysis of weblogs.
Potential applications of this stream of research may include automatically
monitoring and identifying trends in cyber terror and security threats in weblogs.
This can have some significance for government and intelligence agencies wishing
to monitor real-time potential international terror threats present in weblog
conversations and the blogosphere.

References
1. Avesani, P., Cova, M., Hayes, C., Massa, P.: Learning Contextualised Weblog Top-
ics. WWW ’05 Workshop on the Weblogging Ecosystem: Aggregation, Analysis
and Dynamics (2005)
2. Berry, M., Dumais, S. and O’Brien, G.: Using linear algebra for intelligent infor-
mation retrieval. SIAM Review, 37(4):573–595 (1995).
3. Columbus, L.: Blog Mining Gets Real. CRM Buyer (2005).
4. Deerwester, S., Dumais, S., Landauer, T., Furnas,G., Harshman, R.: Indexing by
latent semantic analysis. In Journal of the American Society of Information Science,
41(6) (1990) 391–407
5. Diamond, J.: NSA has massive database of Americans’ phone calls. In USA Today
(May 10, 2006)
6. Gill, K.E.: How Can We Measure the Influence of the Blogosphere? WWW ’04
Workshop on the Weblogging Ecosystem: Aggregation, Analysis and Dynamics
(2004)
7. Glance, N.S. Hurst, M. Tomokiyo, T: BlogPulse: Automated Trend Discovery for
Weblogs. WWW ’04 Workshop on the Weblogging Ecosystem: Aggregation, Anal-
ysis and Dynamics (2004)
8. Gruhl, D. Guha, R.,Liben-Nowell, D., Tomkins, A.: Information Diffusion Through
Blogspace. WWW ’04 (2004)
9. Hofmann, T.: Probabilistic Latent Semantic Indexing. SIGIR’99 (1999)
10. Mei, Q., Liu, C., Su, H., Zhai, C.: A Probabilistic Approach to Spatiotemporal
Theme Pattern Mining on Weblogs. WWW ’06 (2006)
11. Nakajima, S., Tatemura, J., Hino,Y., Hara,Y., Tanaka, K.: Discovering Important
Bloggers based on Analyzing Blog Threads. WWW ’05 Workshop on the Weblog-
ging Ecosystem: Aggregation, Analysis and Dynamics (2005)
12. Newman, D., Chemudugunta, C., Smyth, P., Steyvers, M.: Analyzing Entities and
Topics in News Articles Using Statistical Topic Models. ISI ’06 (2006)
13. Pikas, C.K.: Blog Searching for Competitive Intelligence, Brand Image, and Rep-
utation Management. Online. 29(4) (2005) 16–21
14. Prabowo, R., Thelwall, M.: A Comparison of Feature Selection Methods for an
Evolving RSS Feed Corpus, Information Processing and Management, 42 (2006)
1491–1512
15. Tsai, F.S., Chan, C.K. (eds): Cyber Security, Pearson Education, Singapore (2006)
16. Tsai, F.S., Chen, Y., Chan, K.L.: Probabilistic Latent Semantic Analysis for Search
and Mining of Corporate Blogs (2007)
17. Wikipedia contributors: Intelligence Analysis. In: Wikipedia, The Free Encyclope-
dia, http://en.wikipedia.org/wiki/Intelligence analysis. (accessed Nov 7,
2006).
Detecting Cyber Security Threats in Weblogs Using Probabilistic Models 57

18. Yang, C.C., Shi, X., Wei, C.-P.: Tracing the Event Evolution of Terror Attacks
from On-Line News. ISI ’06 (2006)
19. Yilmazel, O., Symonenko, S., Balasubramanian, N., Liddy, E.D.: Leveraging One-
Class SVM and Semantic Analysis to Detect Anomalous Content. ISI ’05 (2005)
20. Zeimpekis, D., Gallopoulos, E.: TMG: A MATLAB Toolbox for generating term-
document matrices from text collections. Grouping Multidimensional Data: Recent
Advances in Clustering. Springer (2005) 187–210

You might also like