Dynamic Topic Modelling Tutorial
Dynamic Topic Modelling Tutorial
MatiasHurtado
EngineeringStudent,
PontificiaUniversidadCatlicadeChile
[email protected]
Advisor:DenisParra
AssistantProfessor,PontificiaUniversidadCatlicadeChile
[email protected]
SocialComputingandvisualizationgroup,PUCChile
http://socialcomputing.ing.puc.cl
Introduction
The following tutorial explains how to use Dynamic Topic Modelling (BleiandLafferty,2006),
an extension of LDA(Bleiet al., 2003). Its based in the implementationbyS.GerrishandD.
Bleiavailableat
https://code.google.com/p/princetonstatisticallearning/downloads/detail?name=dtm_release
0.8.tgz
.
The procedure here was used during my work that led to the research article Twitter in
Academic Conferences: Usage, Networking and Participation over Time (Wen et al, 2014).
Here we analyzed tweets posted during academic conferences in the Computer Science
domain,andspecificallyweobtainthetopicevolutionoverfiveyears.
Prerequisite
:Followingthistutorialmayrequiresomebasicknowledgeofpython,SQL
databasesandbasiccommandlineinstructions.
Required files
You must get Gerris & Blei DTM release files. You can download the compiled binary files
from
https://github.com/magsilva/dtm
Youcandownloadthepythonscriptsfromdtm_gensimrepositoryonGitHub.
https://github.com/mihurtado/dtm_gensim
Ifyouwantthedatabaseused,youcanaskforitbyemailing
[email protected]
The Database
ttID:itsthetweetid
content:itsthecontentofthetweet
conference:itstheconferencename
category_index:itstheyearoftheconference
WeloadedthedataintoaMySQLdatabasewiththiscommand:
$ mysql
-
u
[
uname
]
-
p
[
pass
]
[
twitter_conferences
]
<
[
database
.
sql]
You must load the database into a SQL server. We recommend you to install MYSQL,
following this tutorial
http://dev.mysql.com/doc/refman/5.1/en/windowsinstallation.html
.Then,
youmustloadthedatabase,usingtheMYSQLcommandline,withyourcredentials.
$ mysql
-
u
[
uname
]
-
p
[
pass
]
[
twitter_conferences
]
<
[
database
.
sql]
The Tutorial
1. Getting and transforming the Data
We will use the generated bin file of Blei DTM, called dtmwin64.exe or dtmwin32.exe,
depending on yourWindowsarchitecture.Inmagsilvasgithubaccounttherearealsobinaries
forMacOSX.Toexecutetheprogram,weneedtwoessentialfiles:
prefixseq.dat:containstheTimeStamps.Thestructureofthefilemustbe:
Number_Timestamps
number_docs_time_1
...
number_docs_time_i
number_docs_time_NumberTimestamps
In our case (tweets of theconference), our timestamps are years from 2009 to2013,
soourfilemustlooklike:
5
number_docs_2009
number_docs_2010
number_docs_2011
number_docs_2012
number_docs_2013
Wearemissingthenumer_docs_yearnumber,sohavetocalculateit.
Where each line represent a document. In our case, each line is a tweet. The
unique_word_count is the total number of uniquewords,eachindexisanidentifierforaword
that would berepresentedbyadictionary,and eachcountishowmanytimesthewordshows
upinthedocument/tweet.
We also are going to construct two more files, that will help us in the interpretation of
theresults.
dictionary.dict: well create a dictionary where each word connects with the index in
prefixmult.datfile.
vocabuary.dat:itwillstoreeachwordthatappearsinthetweets.
metadata.dat:itwillstorethetweetsimplemetadata(tweetid,dateandcontent)
We will use Python to collect all the tweets from the SQL database, then well create a
dictionary and finally well transform the text of the tweets to a Vector Space Model (VSM)
representation.
AssumingthatyourtweetsareinaMySQLdatabaseandyouwanttowritethemintoafileto
processthemwithDTM,gothroughthispartofthetutorialifnot,jumpdirectlytosection
3
GeneratingtheCorpuswithgensim
.
SofirstwellgetthetweetsfromtheSQLdatabase.WewillusesimpleSQLcommandsto
extracttheinformationweneed.Wewillapplysomefiltersoverthecontent,suchasremoving
thehashtagsandmentions,removingstopwords,stemming,andlemmatization,andkeeping
onlynounsandadjectives.
Initialconfiguration:
#Import some modules for reading and getting data.
#If you don't have this modules, you must install them.
importcsv
import
MySQLdb
importre
importnltk
fromnltk
.
stem
.
wordnet
import
WordNetLemmatizer
fromnltk
.
corpus
importstopwords
importos
fromgensim
importcorpora
,models
,similarities
#to create a dictionary
#Set years, this would be the timestamps
time_stamps
=
[
'2009'
,
'2010'
,
'2011'
,
'2012'
,
'2013']
#Set the conference name to be analyzed
conference
=
''
#DB MYSQL Connect. Put your credentials here.
db_host
=
'localhost'
#Host
db_user
=
'user'
#User
db_pass
=
'password'
#Password
db_database
=
'twitter_conferences'
#Database
##Connect...
db
=
MySQLdb
.
connect
(
host
=
db_host
,user
=
db_user
,passwd
=
db_pass
,
db
=
db_database)
Getting the data. Here we will get the data from the database, keep it in memory (tweets
pythonlist)andsaveitforfuturereferenceinmetadata.datfile.
#Set metadata output file
dat_outfile
=open
(
os
.
path
.
join
(
'data'
,conference
,
'metadata.dat'
),
'w'
)
dat_outfile
.
write
(
'id\tdate\tcontent\n'
)
#write header
tweets
=list
()
#Set total_tweets list per year, starting at 0
total_tweets_list
=
[
0
foryear
intime_stamps
]
#Analyze each year..
time_stamps_count
=
0
foryear
intime_stamps
:
#For each year
print
(
'Analyzing year '
+str
(
year
))
Set total_tweets to 0
#
total_tweets
=
0
Get tweets with mysql
#
cursor
=db
.
cursor
()
Query
#
query
=
"SELECT ttID, content,category_indexFROMcon_tweets_filteredWHEREconference
= '"
+conference
+
"' and category_index = "
+year
+
" and relevant=1 and lang='en'"
Execute query
#
cursor
.
execute
(
query
)
result
=cursor
.
fetchall
()
#store results
cursor
.
close
()
For each result (tweet), get content
#
forline
inresult
:
Remove @xxxx and #xxxxx
#
content
=
[
unicode
(
word
.
lower
(),errors
=
'ignore'
)
forword
inline
[
1
].
split
()
if
word
.
find
(
'@'
)
==
-
1
andword
.
find
(
'#'
)
==
-
1
andword
.
find
(
'http'
)
==
-
1
]
join words list to one string
#
content
=
' '
.
join
(
content
)
remove symbols
#
content
=re
.
sub
(
r
'[^\w]'
,
' '
,content
)
#remove stop words, this could also be done in the next step with gensim
content
=
[
word
forword
incontent
.
split
()
ifword
not
instopwords
.
words
(
'english'
)
andlen
(
word
)
>
3
and
notany
(
c
.
isdigit
()
forc
inword
)]
join words list to one string
#
content
=
' '
.
join
(
content
)
Stemming and lemmatization
#
lmtzr
=
WordNetLemmatizer
()
content
=lmtzr
.
lemmatize
(
content
)
Filter only nouns and adjectives
#
tokenized
=nltk
.
word_tokenize
(
content
)
classified
=nltk
.
pos_tag
(
tokenized
)
join words list to one string
#
content
=
' '
.
join
(
content
)
tweets
.
append
([
line
[
0
],content
,line
[
2
]])
total_tweets
+=
1
dat_outfile
.
write
(
str
(
line
[
0
])
+
'\t'
+str
(
line
[
2
])
+
'\t'
+content
)
dat_outfile
.
write
(
'\n'
)
Add the total tweets to the total tweets per year list
#
total_tweets_list
[
time_stamps_count
]
+=total_tweets
time_stamps_count
+=
1
dat_outfile
.
close
()
#Close the metadata file
print
(
'Done collecting tweets')
seq_outfile
.
close
()
print
(
'Done writing seq')
We will use the gensim python package to generate our corpus. You can see the full tutorial
andotheroptionsinthispage:
http://radimrehurek.com/gensim/tut1.html#fromstringstovectors
.
Consideringthatdocumentsisalistcontainingthetweets,e.g.
[Im at WWW conference 2013, hope to have a great time!, This is such an interesting
project,#HT2010,...]
We also will remove stop words and words than appear only once (you must define a
stopwordslistoruse,forinstance,theoneavailableinNLTKlibrary).
stoplist
=
set
(
'for a of the and to in'
.
split
())
#Construct the dictionary
dictionary
=corpora
.
Dictionary
(
line
[
1
].
lower
().
split
()
forline
intweets)
# remove stop words and words that appear only once
stop_ids
=
[
dictionary
.
token2id
[
stopword
]
forstopword
instoplist
ifstopword
indictionary
.
token2id]
once_ids
=
[
tokenid
fortokenid
,docfreq
indictionary
.
dfs
.
iteritems
()
ifdocfreq
==
1]
dictionary
.
filter_tokens
(
stop_ids
+ once_ids
)
# removestopwordsandwordsthatappearonly
once
dictionary
.
compactify
()
# remove gaps in id sequence after words that were removed
dictionary
.
save
(
os
.
path
.
join
(
'data'
,conference
,
'dictionary.dict'
))
# store the dictionary
#Save vocabulary
vocFile
=open
(
os
.
path
.
join
(
'data'
,conference
,
'vocabulary.dat'
),
'w')
forword
indictionary
.
values
():
vocFile
.
write
(
word
+
'\n')
vocFile
.
close
()
print
(
'Dictionary and vocabulary saved')
So, our dictionary is finished and ready to be implemented in our vectorization. We will
introduce a class to prevent storing the words of each document in RAM. Instead, we will
analyzeeachdocumentwordsseparately.
class
MyCorpus
(
object
):
def__iter__
(
self
):
forline
intweets:
# assume there's one document per line, tokens separated by whitespace
yielddictionary
.
doc2bow
(
line
.
lower
().
split
())
Andwewillcreateaninstanceoftheclass,containingthecorpus
corpus_memory_friendly
=
MyCorpus
()
Now that our corpus isready and each document will be vectorized when we call each line,
we can start writing ourmult.dat file touse it in
DTM
,butwehavea littleproblem: thecorpus
isrepresentedbyalistoflistsoftuples,likethis:
[[(
0
,
1
),
(
1
,
1
),
(
2
,
1
)],
[(
0
,
1
),
(
3
,
1
),
(
4
,
1
),
(
5
,
1
),
(
6
,
1
),
(
7
,
1
)], ]
Andwehavetowriteinthespecifiedformat:
unique_word_count index1
:
count1 index2
:
count2
...indexn
:
counnt
Now,wecanwritetothemult.datfile
multFile
=open
(
os
.
path
.
join
(
'data','WWW','mult.dat'
),
'w')
forvector
incorpus_memory_friendly
:
# load one vector into memory at a time
multFile
.
write
(
str
(
len
(
vector
))
+
' ')
for
(
wordID
,weigth
)
invector:
multFile
.
write
(
str
(
wordID
)
+
':'
+str
(
weigth
)
+
' ')
multFile
.
write
(
'\n')
multFile
.
close
()
print
(
'mult file saved')
In the example, data/WWW/ is just the name of a folder but you can changeit tosomething
elsebutdontmisshavingthefilemult.datinthefolderspecificedin
--corpus_prefix
.
To interpret this output data, we will use a python package to visualize corpus data named
Tethne [2]. You might use another package if you want. With this tool, we will be able to
#Import to tethne
dtm
= tethne
.
model
.
corpus
.
dtmmodel
.
from_gerrish
(
'data/'
+ conference
+
'/output/'
,
'data/'
+
conference
+
'/metadata.dat'
,
'data/'
+conference
+
'/vocabulary.dat')
Then, we cangeneratetheexportsshowedintethnedocumentation.Asanexample,herewe
will print the topics most 10common words foreachtopic(inourcase5topics)andforeach
year (inour case5years)withtheprobabilityofappearinginadocument,sothenwecanplot
atopicevolution.
fortopic_i
inrange
(
5
):
arr
=dtm
.
topic_evolution
(
topic_i
,
10)
forkey
inarr
[
1
].
keys
():
foryear_i
inrange
(
5
):
print(
[
conference
,topic_i
,key
,
(
year_i
+
2009
),arr
[
1
][
key
][
year_i
]])
Using the output from the code above, we can generate using Rsome interesting plots, like
the one below, that shows three topicevolutionoverthetimespaninoneconference.Wecan
observethattheconferencewasverystableinitstopicsoverthefiveyears.
Havingthedataintothisformat,youcanplotthisintoR:
library
(
ggplot2)
library
(
gridExtra)
library
(
directlabels)
>str
(
dfc)
'data.frame'
: 240obs
.of
8variables:
$ year
:
Factorw
/
5levels
"2009"
,
"2010"
,..:
1
1
1
1
1
1
1
1
1
1
...
$ confID
:
Factorw
/
16levels
"CHI"
,
"CIKM"
,
"ECTEL"
,..:
1
1
1
2
2
2
3
3
3
4
...
$ topic
:
Factorw
/
3levels
"Topic1"
,
"Topic2"
,..:
1
2
3
1
2
3
1
2
3
1
...
$N
:num
2083
2083
2083
37
37
...
$ value
:num
0.387
0.294
0.319
0.28
0.391
...
$ sd
:num
0.44
0.412
0.42
0.425
0.447
...
$ se
$ ci
num 0
:
.00964
0.00903
0.0092
0.06985
0.07347
...
:num
0.0189
0.0177
0.018
0.1417
0.149
..
pd
<-position_dodge
(.
1)
ggplot
(
dfc
,aes
(
x
=
year
,y
=
value
,colour
=
topic
,
group
=
topic
))
+geom_point
(
position
=
pd
)
+
geom_dl
(
aes
(
label
=
topic
),
size
=
2.5
,
list
(
"first.qp"
,
cex
=
0.5
,rot
=
30
))
+
geom_dl
(
aes
(
label
=
topic
),
size
=
2.5
,
list
(
"last.qp"
,
cex
=
0.5
,rot
=
30
))
+
geom_smooth
(
aes
(
group
=
topic
,ymin
=value
-
se
,ymax
=value
+
se
))
+facet_wrap
(~
confID
,
ncol
=
4
)
+theme_bw
()
+
theme
(legend
.
position
=
"bottom"
, panel
.
grid
.
minor
=element_blank
())
+theme
(
axis
.
title
.
x
=element_blank
(),axis
.
title
.
y
=element_blank
(),axis
.
text
.
x
=element_text
(
angle
=
30
,
vjust
=
0.1
,hjust
=
0.1
,
size
=
5
),strip
.
background
=
element_blank
(),
strip
.
text
=
element_text
(
size
=
7
),legend
.
text
=
element_text
(
size
=
4
),
legend
.
title
=
element_text
(
size
=
4
),
panel
.
margin
=unit
(
0
,
"null"
),plot
.
margin
=
rep
(
unit
(
0.1
,
"cm"
),
4
)
)
+labs
(
x
=
NULL
)
willlooklikethis:
oreventhis:
# $ Year
: int 2009 2010 2011 2012 2013 2009 2010 2011 2012 2013 ...
# $ Probability: num 0 0 0.0142 0 0 ...
# ===
conf.dtm.df <- dtm.df[dtm.df$Conference == "CHI", ];
gchi <- ggplot(data=conf.dtm.df,aes(x=Year,y=Word))
gchi <- gchi + geom_tile(aes(fill = Probability), colour="black",stat = "identity") +
scale_fill_gradient(low="white", high="blue") +
ggtitle(paste("",as.character(conf.dtm.df$Conference)," ") ) + facet_wrap(~ TopicID,
scales="free_y", ncol=5) + geom_text(data=conf.dtm.df[conf.dtm.df$Year == 2011,],
aes(label=Word), size=4, vjust=0.25) + theme_bw() + theme(panel.grid.major =
element_blank(), legend.position="bottom", panel.grid.minor = element_blank()) +
theme(axis.ticks = element_blank(), axis.text.y = element_blank(), axis.title.x =
element_blank(), axis.title.y = element_blank(), axis.text.x = element_text(angle = 60,
vjust = 0.1, hjust=0.1,size=5), strip.background=element_blank(),
strip.text=element_text(size=7), legend.text=element_text(size=4),
legend.title=element_text(size=4),panel.margin = unit(0,"null"), plot.margin =
rep(unit(0.1,"cm"),4) , legend.margin=unit(-0.6,"cm"), legend.key.height = unit(0.4, "cm"))
+ labs(x=NULL)
#visualze
gchi
..thatwilllooklikethis
References
[1] Wen, X., Lin, Y., Trattner, C. and Parra, D.: Twitter in Academic Conferences: Usage,
Networking and Participation over Time, In Proceedings of the ACM 2014 International
Conference in Hypertext and Social Media (Hypertext 2014), ACM, New York, USA, 2014.
PDF
[2]Tethnepackageforpython
http://diging.github.io/tethne/
[3]
Blei,D.M., Ng, A. Y., &Jordan,M.I.(2003).Latentdirichletallocation.
theJournalofmachineLearning
research
,
3
,9931022.