Detecting Cyber Security Threats in Weblogs
Detecting Cyber Security Threats in Weblogs
1 Introduction
Cyber security is defined as the intersection of computer, network, and informa-
tion security issues which directly affect the national security infrastructure [15].
Cyber security problems are frequent, serious, and global in nature. The number
of cyber attacks by persons and malicious software are increasing rapidly. Many
cyber criminals or hackers may post their ongoing achievements in weblogs, or
blogs, which are websites where entries are made in a reverse chronological order.
In addition, weblogs may provide up-to-date information on the prevalence and
distribution of various cyber security incidents and threats.
Weblogs range in scope from individual diaries to arms of political campaigns,
media programs, and corporations. Weblogs’ explosive growth is generating large
volumes of raw data and is considered by many industry watchers one of the top
C.C. Yang et al. (Eds.): PAISI 2007, LNCS 4430, pp. 46–57, 2007.
c Springer-Verlag Berlin Heidelberg 2007
Detecting Cyber Security Threats in Weblogs Using Probabilistic Models 47
ten industry trends [3]. Blogosphere is the collective term encompassing all blogs
as a community or social network. Because of the huge volume of existing weblog
posts and their free format nature, information in the blogosphere is rather
random and chaotic, but immensely valuable in the right context. Weblogs can
thus potentially contain usable and measurable information related to cyber
security threats, such as malware, viruses, cyber blackmail, and other cyber
crime.
With the amazing growth of blogs on the web, the blogosphere affects much
in the media. Studies on the blogosphere include measuring the influence of the
blogosphere [6], analyzing the blog threads for discovering the important bloggers
[11], determining the spatiotemporal theme pattern on blogs [10], focusing the
topic-centric view of the blogosphere [1], detecting the blogs growing trends [7],
tracking the propagation of discussion topics in the blogosphere [8], and searching
and detecting topics in corporate blogs [16].
Existing studies have focused on analyzing forums and news articles for cy-
ber threats [12,18,19], but few have looked at weblogs. In this paper, we focus
on analyzing cyber security weblogs, which are blogs providing commentary or
analysis of cyber security threats and incidents.
In our work, we analyzed various weblog posts to detect the keywords of
various topics of the blog entries, hence tracking the trends and topics of conver-
sations in the blogosphere. Probabilistic Latent Semantic Analysis (PLSA) was
used to detect the keywords from various cyber security blog entries with respect
to certain topics. By using PLSA, we can present the blogosphere in terms of
topics with measurable keywords.
The paper is organized as follows. Section 2 reviews the related work on
intelligence analysis and extraction of useful information from weblogs. Section
3 describes an overview of the Latent Semantic models such as Latent Semantic
Analysis and Probabilistic Latent Semantic Analysis model for mining of weblog-
related topics. Section 4 presents experimental results, and Section 5 concludes
the paper.
articles, and applying probabilistic topic models to learn the latent structure
behind the named entities and other words [12]. Another study analyzed the
evolution of terror attack incidents from online news articles using techniques re-
lated to temporal and event relationship mining [18]. In addition, Support Vector
Machines were used for improving document classification for the insider threat
problem within the intelligence community by analyzing a collection of docu-
ments from the Center for Nonproliferation Studies (CNS) related to weapons
of mass destruction [19]. These studies illustrate the growing need for security
intelligence analysis, and the usage of machine learning and information retrieval
techniques to provide such analysis. However, much work has yet to be done in
obtaining intelligence information from the vast collection of weblogs that exist
throughout the world.
– In synonymy, different words may have the same meaning. Thus, a person
issuing a query in a search engine may use a different word from what appears
in a document, and may not retrieve the document.
– In polysemy, the same word can have multiple meanings, so a searcher can
get unwanted documents with the alternate meanings.
A = [aij ], (1)
where aij is the number of times or frequency in which term i appears in docu-
ment j. As each word will not usually appear in every document, the matrix A
is typically sparse with rarely any noticeable nonzero structure [2].
The matrix A is then factored into the product of three matrices using SVD.
Given a matrix A, where rank(A) = r, the SVD of A is defined as:
A = USVT . (2)
The columns of U and V are referred to as the left and right singular vectors,
respectively, and the singular values of A are the diagonal elements of S, or the
nonnegative square roots of the n eigenvalues of AAT .
As defined by Equation (2), the SVD is used to represent the original rela-
tionships among terms and documents as sets of linearly-independent vectors.
Performing truncated SVD by using the k -largest singular values and correspond-
ing singular vectors, the original TDM can be reduced to a smaller collection of
vectors in k -space for conceptual query processing [2].
capture the polysemy and synonymy in text for applications in the information
retrieval domain. Similar to LSA, PLSA uses a term-document matrix which
describes patterns of term (word) distribution across a set of documents (blog
entries). By implementing PLSA, topics are generated from the blog entries,
where each topic produces a list of word usage, using the maximum likelihood
estimation method, the expectation maximization (EM) algorithm.
The starting point for PLSA is the aspect model [9]. The aspect model is
a latent variable model for co-occurrence data associating an unobserved class
variable zk ∈ {z1 , . . . , zk } with each observation, an observation being the oc-
currence of a keyword in a particular blog entry. There are three probabilities
used in PLSA:
In the collection, the probability of each blog and the probability of each
keyword are known, while the probability of an aspect given a blog and the
probability of a keyword given an aspect are unknown. By using the above three
probabilities and conditions, three fundamental schemes are implemented:
1. select a blog entry bi with probability P (bi ),
2. pick a latent class zk with probability P (zk |bi ),
3. generate a keyword wj with probability P (wj |zk ).
As a result, a joint probability model is obtained in asymmetric parameteri-
zation:
K
P (wj |bi ) = P (wj |zk )P (zk |bi ) (4)
k=1
After the aspect model is generated, the model is fitted using the EM algo-
rithm. The EM algorithm involves two steps, namely the expectation (E) step
and the maximization (M) step. The E-step computes the posterior probability
for the latent variable, by implying Bayes’ formula, so the parameterization of
joint probability model is obtained as:
The M-step updates the parameters based on the expected complete data
log-likelihood depending on the posterior probability resulted from the E-step.
Hence the M-step re-estimates the following two probabilities:
N
i=1 n(bi , wj )P (zk |bi , wj )
P (wj |zk ) = M N (6)
m=1 i=1 n(bi , wm )P (zk |bi , wm )
M
j=1 n(bi , wj )P (zk |bi , wj )
P (zk |bi ) = (7)
n(bi )
The EM iteration is continued to increase the likelihood function until the
specific conditions are met and the program is terminated. These conditions can
be a convergence condition, or a cut-off point, which is specified for reaching a
local maximum, rather than a global maximum.
In short, the PLSA model selects the model parameter values that maximize
the probability of the observed data, and returns the relevant probability dis-
tributions by implying the EM algorithm. Word usage analysis with the aspect
model is a common application of the aspect model. Based on the pre-processed
term-document matrix, the blogs are then classified onto different aspects or
topics. For each aspect, the keyword usage, such as the probable words in the
class-conditional distribution P (wj |zk ), is determined. Empirical results indi-
cate the advantages of PLSA in reducing perplexity, and high performance of
precision and recall in information retrieval [9].
—————————————————————————————————————–
Cyber blackmail is on the increase ... Criminal gangs have moved away from the stealth
use of infected computers ... to direct blackmailing of victims. ... Cyber blackmailing
is done ... by encrypting data or by corrupting system information. The criminal then
demands a ransom for its return to the victim. ...
—————————————————————————————————————–
There are a total of 5493 entries in our dataset, and each weblog entry is saved
as a text file for further text preprocessing. For the preprocessing of the blog
data, HTML tags were removed and lexical analysis was performed by removing
stopwords, stemming, and pruning using the Text to Matrix Generator (TMG)
[20]. The total number of terms after pruning and stopword removal is 797. The
term-document matrix was then input to the LSA and PLSA algorithms.
0.8
nsa
0.6 program
spywar phonecomput
0.4 malwar secur
softwar data
userwindow compani record call
virudatabas privaci
domestspy agenc
million
inform bush
0.2 instal file
surveil
collect usa
activ protect democratintellig report govern
presid
custom technolog
telephon
mine
warrant ten network
search
track servic congress
committe
anti
busicommun
sourc number
internet senat
investig
republican
cia law
todai administr
nation american
remov
access
monitor
research
email machin
mail
director web
target provid
constitut
www
liberti
document illeg
efforthttp
articl
listen
free
suspectrun threat
depart
site secret
system
gener
linkwork
media power iraq
tool
largestfocus
analysi
product
convers
code
corpor
privat
gen approv
confirm
onlin
market
potenti
version
manag
violat
senior
card updat
requir
trust
paper
detail
social chenei
poll
address
check
critic
page
list
creat
concern
develop
increasprevent danger
organ legal
control problem
border
author
intern
major
partielect
home
polici
monei
bill includ
issu
start
washington
interest talk
weapon
georg
blog stori
support hous
iraqi
question
iran
content
spread
awar
test
additkei
abil
launch
fine
conduct
standard
gather
gain
push
januari
simpl
contact
wai
written
measur
sell
wideemerg
tilight
discoveasi
independ
popular
approach
cross
huge
promis
coupl
damag
goal
amount identifi
risk
drop
design
specif
enforc
project
grow
purpos
newspap
websit
main
march
wrote
institut
origin
seek publish
safe
rate
reveal
stuffsecretari
object
cost
strategi
initi
falspick
basic piec
limit
econom
fund
visit
began
practic
region
fall
studi effect
level
front
step
local
worri michael
direct
name
invas
account
drive
expert
process
establish
bui success
posittax
larg
pai
ad
tell
chief
share
latest
suggest
break
billion
cover
class
insid
regim
entir
troubl
central
alleg
haven ignor
british
price
total pass
south full
letter
game
presenttop
hard
democraci
command
set
conserv
campaign
build
discuss
relat
areasend
lead
offer
dealclinton
ago
current
result
past move
note
mean
global job
ask
opennuclear
writecivil
right
base
chang
oil
baghdad
univers
import
matter
polic
student
kind
orderdoesntroop
press
foreign
book
head offic
recent
citizen
stop
immigr
fear
continu
histori
turn white
findmade
week
word
friend group forc
comment
offici
read
back newcountritime
militari
polit
post state
0 engag
seri
notic
rise
extrem
conflict
prepar
worth
directli
aren
actual
poor
town
caus
signific
enter
consid
period ag
avoid
roll
readi
opposit
low
differ
mark
forward
burn
favor
hide
alli
educ
imag
experitype
progress
longer
invad
date
common
due
challeng
produc
need
similar
fieldassoci
met
hot follow
express
miss
david
woman
determin
popul
justifi
lack
spent
blow
water
particip
pull
earlier
happi
wallshot
combat
blame b add
pictur
repres
attent
threaten
mistak
singl
land
ground
nice
surpris
opposimagin
red
accus
wors
hundr
abus
argu
aid
figur
situat
review
respond
reach
northhelp
sort
protest
late
announc
join car
special
form
let
stai
earli
realiz
camp
movement
violenc
defeat
voic
heart
ey
strong
commit
admit
complet
refus
meet
accept
rais
individu
caught
short
kid
subject
demand
choic
dollar
team simpli
morn
exist
april
interview
cut arm
attempt
fridai
futurwarn
fire
destroi
bit
west
western
hold
gave
daili
lose
quot
knowledg bring
term
street
debat
argument
moment
strike
absolut
respect religion
return
civilian
refer
worst
prove half
deni
carri
suppos
trade
natur sens
yeah
societi
wait
heard
receiv
a
sign
learn
air
insurg
europ
appear
gun
begin
pretti
expect
mention
battlhit
small
come
cultur
explain
realiti mass
isn
agent rule
hussein
true
win
remain
allow middl
view
high
defend
children
odiboi
appar
failur mind
fail
train
opinion john
take
action
nswer
night
line
armi
thousand
east lifreedom
young
rememb
side
destruct
messagidea
peac
involv
speech
close
blackmovi
leav
statement
understand
believ
pentagon
lawyer
guess
religi
abu
radic stand
cleardefens
christian
think
captur
fbi arab
yesterdai
innocciti
center hear
york
vote
bad
care
face act
liber
month
saddam
israel lot
place
happen
respons
real
wrong
soldier big
thought long
give
watch
put
left public
found
enemi
taliban
bomb
school
pakistan
reason
love
plan
plai
end
dont
oper
point
fact
leader
video
told fight
journalshow goodthing
zarqawi
afghanistanmake
yeardon
unit peopl war
brought
fly walk
capit
opportun
fair sound
father
son
wasnled
possiblevil
minut
doubt
arrest
held
mission
sit
declar
room
serv women
moral
agre
dead truth
hate
suicid
crimin
hour
evid
final
knew great
crime
speak member
claim
handgui
releas
didn
human
hope
men part personlive world
suffer event
connect dai terror
diplane
tuesdai
chanc
victim
cell
save hell
decid
decis
murder feder feel
thursdai
tortur
charg
want man muslimislam qaeda
−0.2 plot
spend
execut
brotherwon famili case court
terrorist
lost hijack
rest justic america
kill
role god attack
−0.4 convictdie septemb osama
judg
sept
−0.6 laden
bin
trial
−0.8
death
−1 life
sentenc prison
zacaria
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
spywar
0.5
malwar
0.4
softwar data
user compani
0.3 window record
viru privaci spy agenc
million inform
databas file domest
0.2 instal surveil usa protect
technolog collect activ senat democrat
custom mine telephon
ten
search network
committe congress
commun investig
number republican
warrant track
machin web anti
servic
target busi
provid sourc
internet cia todai law
remov access
monitor mail illeg
site
effort
listenhttp
articl
depart threat
secret power
0.1 focus
analysi research
email director
document
updat constitut
www
libertisuspect free
run
problem system
generlink work
media stori
hous
tool codeproduct
convers
market
corpor onlin
potenti
version managapprov
confirm
trust requir
paper poll
address
check chenei
critic
page list
creat organdanger
legal
control author
intern
major
parti
polici
monei border
home
elect includ
issu
start support
talk
weapon iraqi
question
iran
largest
content
spread
awartest
abil privat
kei violat
senior
card
gen
drop
specif
enforc risksocial
increas
identifi
reveal
design
stuffdetail
safedevelop
rate concern
prevent
publish
secretari
level effect
michael full
letter job
write
ask
open
changbill
nuclear
civil
right
base
citizen
stop recent
immigr washington
interest
officfind georg
blog
white forc polit
comment
addit
wide launch
januari
push
simpl
measur
sell
discovwai
ti
emerg
independ
popular project
fine
conduct
standard
gather
gain purpos
websit
main
march
bui
establish
wrote
institut
contact
written
easi
approach
cross
initi
huge step
worri
grow
newspap
origin
econom
fund
seek
costobject
limit
visit
began
practic
region name
front
local
account
drive
process
expert
posit
success
piec
latest
share direct
invas tax
larg
pai
ad
campaign
chief
tell
break area
suggest
billion
coverclass discuss
relat
deal
current send
offer
ago game
set
conserv
present
build
lead top
hard
command
democraci move
clinton
note baghdad
importunivers
matter
policoil
student
kind troop
press
foreign
book
head fear made
week group offici
read
back
promis
amount
seri
rise basic
strategi
fals
goal
notic
hot
met
similar
conflict
extrem pick
experi
engag date studi
troubl
alleg
haven
damag
coupl
alli
imag
eductotal
hide
type
longer
progress
review
invad
respond
north
reach
common
challeng
due
need
produc
associ
join
announc fall
central
british
situat
figur
add
sort
protest
late
pictur insid
regim
entir
ignor
price
aid
april
help
fridai cut
futur pass
south
refus result
morn
exist
interview
insurg past
arm
attempt
warnsign
learn
thousand
europ air
east mean
global
complet
meetsimpli view
high
answer
night
line
armi order
middl
take john
actiondoesn
york
vote continu
histori turn
actbig
liber friend
publicword
dont zarqawi
0 field
directli
actual
poor
entercaus
signific
period prepar
readi
low
blow
worth
ag
aren
avoid
town
roll
opposit
differ
consid miss
justifi
mark
forward
oppos
water
particip
earlier
repres
special
threaten
attent
mistak
singl
individu
woman
light
david
determin
popul
lack
spent
nice kid
short
caught
choic
imagin
surpris
pull
rais
subject
team
realiti
follow
express battl
demand
dollarland
bodi
stai
failur
earli
movement
realiz
camp
violenc
ground
knowledg
accus
red
hundr
wors
abus
moment
strike
absolut
car
accept
mention
small
cultur
explaincome
let
appar
western
lose
argument
respect
argu
agent
hit
gave fire
destruct
expectappear
gun
begin
remain
allow
form destroi
daili
quot
yeah
return
defeat bit isn
hussein
win
children
west
hold
street
debat
rememb
side
pretti
massrule
true
bstatement
lack
bring
term
opinion
religion
sens train
boi close
mind
fail
movi lifreedom
young
messag
defend involv
speechidea
peac
israel
carebad
face
month
saddam
lot
place
watch
happen
respons give
put
left
found
long point
enemi
taliban
oper good
afghanistan
burn
happi
favor shot
combat
voic
worst
wall refer
prove
blame societi
civilian
heart
ey
carri
strong
commit
admit
natur half
heard
deni
suppos
trade radicabuwait
lawyer
receivguess
religi believ
pentagon
fbiclearleav
think
christian
captur
innoc understand
soldier
stand
defens
arab
yesterdai
citi hearwrongreal
thought bomb
school
pakistan
reasontold
love
plan
plai video fact
leader
journal fight
show
brought
opportunwalk
father sound
possibl
capit
son
wasn led
minutevil women center
suicid
crimin hourtruth
hate great
crime hand end
member
claim part person live
fair
flysuffer room doubt
arrest
held
sitmission
declar moral
agre
dead
serv connect evid
final
knewspeak
event releas
hope didn
human
menfeel gui
−0.1 diplane
tuesdai celldecis hell
thursdai
decid tortur
chanc save want charg
feder man
victim murderexecut case muslim
−0.2 plot won famili
spend brother court
lost justic
role rest hijack god
−0.3
septemb osama
−0.4 convict die
American. Yet another grouping include the terms death, prison, and life at the
bottom of the graph. Zooming into the large cluster from Figure 2, Figure 3 shows
a subset of the big cluster of keywords. A larger group of keywords (spyware,
malware, software, data, user, window, privacy, spy, domestic) can be identified,
thus showing the ability descend through a hierarchical grouping of keywords.
The implications of the graphs demonstrate the possibility to visualize closely-
related terms in two-dimensional space. Although the two-dimensional graphs
may be an over-simplification of the dimensionality reduction that takes place,
the plot can help to visualize the terms and relate to the topics produced for the
weblogs.
Table 1. List of keywords for Topic 1: Table 2. List of keywords for Topic 2:
Computer Security Osama bin Laden
Table 3. List of keywords for Topic 3: Table 4. List of keywords for Topic 4: US
Iraq War National Security
conversations in this topic around the middle of May 2006. This can be due to
US President Bush’s comment on May 11, 2006 about a USA Today report on
a massive NSA database that collects information about all phone calls made
within the United States [5]. This is one example of an event that can trigger
much conversation in the blogosphere.
For the topic of Computer Security, we further decompose into separate
subtopics, two of which are shown in Tables 5-6. Malware, which includes com-
puter viruses, worms, trojan horses, spyware, adware, and other malicious soft-
ware, is the topic derived from examining the keywords in Subtopic 1. Subtopic
2 is classified as Macintosh, and reflects the increasing reports of cyber attacks
affecting Macintosh computers. Therefore, we can classify and decompose the
topics into a hierarchy of subtopics, which may be useful for examining larger
data sets.
The power of PLSA in cyber security applications include the ability to au-
tomatically detect terms and keywords related to cyber security threats and
terror events. By presenting blogs with measurable keywords, we can improve
Detecting Cyber Security Threats in Weblogs Using Probabilistic Models 55
2
Topics
3
May 1 1000
5 2000
10 3000 20
15 4000 25
500031 (2006)
Documents
Documents By Date
Fig. 4. Topic by document distribution by date
Table 5. List of keywords for Subtopic 1: Table 6. List of keywords for Subtopic 2:
Malware Macintosh
5 Conclusions
In this paper, we analyzed weblog posts for various categories of cyber security
threats related to the detection of cyber security threats, cyber crime, and cyber
terrorism. To our knowledge, is the first such study focusing on cyber security
weblogs. We use latent semantic analysis to illustrate similarities in terms dis-
tributed across all the terms in the weblog dataset. Our experiments on our
dataset of weblogs demonstrate how our probabilistic weblog model can present
the blogosphere in terms of topics with measurable keywords, hence tracking
popular conversations and topics in the blogosphere. By applying a probabilistic
approach, we can improve information retrieval in weblog search and keywords
56 F.S. Tsai and K.L. Chan
detection, and provide an analytical foundation for the future of security intel-
ligence analysis of weblogs.
Potential applications of this stream of research may include automatically
monitoring and identifying trends in cyber terror and security threats in weblogs.
This can have some significance for government and intelligence agencies wishing
to monitor real-time potential international terror threats present in weblog
conversations and the blogosphere.
References
1. Avesani, P., Cova, M., Hayes, C., Massa, P.: Learning Contextualised Weblog Top-
ics. WWW ’05 Workshop on the Weblogging Ecosystem: Aggregation, Analysis
and Dynamics (2005)
2. Berry, M., Dumais, S. and O’Brien, G.: Using linear algebra for intelligent infor-
mation retrieval. SIAM Review, 37(4):573–595 (1995).
3. Columbus, L.: Blog Mining Gets Real. CRM Buyer (2005).
4. Deerwester, S., Dumais, S., Landauer, T., Furnas,G., Harshman, R.: Indexing by
latent semantic analysis. In Journal of the American Society of Information Science,
41(6) (1990) 391–407
5. Diamond, J.: NSA has massive database of Americans’ phone calls. In USA Today
(May 10, 2006)
6. Gill, K.E.: How Can We Measure the Influence of the Blogosphere? WWW ’04
Workshop on the Weblogging Ecosystem: Aggregation, Analysis and Dynamics
(2004)
7. Glance, N.S. Hurst, M. Tomokiyo, T: BlogPulse: Automated Trend Discovery for
Weblogs. WWW ’04 Workshop on the Weblogging Ecosystem: Aggregation, Anal-
ysis and Dynamics (2004)
8. Gruhl, D. Guha, R.,Liben-Nowell, D., Tomkins, A.: Information Diffusion Through
Blogspace. WWW ’04 (2004)
9. Hofmann, T.: Probabilistic Latent Semantic Indexing. SIGIR’99 (1999)
10. Mei, Q., Liu, C., Su, H., Zhai, C.: A Probabilistic Approach to Spatiotemporal
Theme Pattern Mining on Weblogs. WWW ’06 (2006)
11. Nakajima, S., Tatemura, J., Hino,Y., Hara,Y., Tanaka, K.: Discovering Important
Bloggers based on Analyzing Blog Threads. WWW ’05 Workshop on the Weblog-
ging Ecosystem: Aggregation, Analysis and Dynamics (2005)
12. Newman, D., Chemudugunta, C., Smyth, P., Steyvers, M.: Analyzing Entities and
Topics in News Articles Using Statistical Topic Models. ISI ’06 (2006)
13. Pikas, C.K.: Blog Searching for Competitive Intelligence, Brand Image, and Rep-
utation Management. Online. 29(4) (2005) 16–21
14. Prabowo, R., Thelwall, M.: A Comparison of Feature Selection Methods for an
Evolving RSS Feed Corpus, Information Processing and Management, 42 (2006)
1491–1512
15. Tsai, F.S., Chan, C.K. (eds): Cyber Security, Pearson Education, Singapore (2006)
16. Tsai, F.S., Chen, Y., Chan, K.L.: Probabilistic Latent Semantic Analysis for Search
and Mining of Corporate Blogs (2007)
17. Wikipedia contributors: Intelligence Analysis. In: Wikipedia, The Free Encyclope-
dia, http://en.wikipedia.org/wiki/Intelligence analysis. (accessed Nov 7,
2006).
Detecting Cyber Security Threats in Weblogs Using Probabilistic Models 57
18. Yang, C.C., Shi, X., Wei, C.-P.: Tracing the Event Evolution of Terror Attacks
from On-Line News. ISI ’06 (2006)
19. Yilmazel, O., Symonenko, S., Balasubramanian, N., Liddy, E.D.: Leveraging One-
Class SVM and Semantic Analysis to Detect Anomalous Content. ISI ’05 (2005)
20. Zeimpekis, D., Gallopoulos, E.: TMG: A MATLAB Toolbox for generating term-
document matrices from text collections. Grouping Multidimensional Data: Recent
Advances in Clustering. Springer (2005) 187–210