0% found this document useful (0 votes)
301 views

Dynamic Topic Modelling Tutorial

This document provides a tutorial on using Dynamic Topic Modelling (DTM) to analyze the evolution of topics in a collection of tweets over multiple years. It explains the required files and formats to run DTM, including a prefix-seq.dat file containing timestamps, a prefix-mult.dat file containing the vectorized tweet data, and additional files like a dictionary and metadata. It also describes extracting tweet data from a SQL database, preprocessing the text, creating the input files, and running the DTM executable to model the topic evolution in the tweets over time.

Uploaded by

spider9385
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
301 views

Dynamic Topic Modelling Tutorial

This document provides a tutorial on using Dynamic Topic Modelling (DTM) to analyze the evolution of topics in a collection of tweets over multiple years. It explains the required files and formats to run DTM, including a prefix-seq.dat file containing timestamps, a prefix-mult.dat file containing the vectorized tweet data, and additional files like a dictionary and metadata. It also describes extracting tweet data from a SQL database, preprocessing the text, creating the input files, and running the DTM executable to model the topic evolution in the tweets over time.

Uploaded by

spider9385
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 13

Dynamic Topic Modelling Tutorial

MatiasHurtado
EngineeringStudent,
PontificiaUniversidadCatlicadeChile
[email protected]

Advisor:DenisParra
AssistantProfessor,PontificiaUniversidadCatlicadeChile
[email protected]

SocialComputingandvisualizationgroup,PUCChile
http://socialcomputing.ing.puc.cl

Introduction
The following tutorial explains how to use Dynamic Topic Modelling (BleiandLafferty,2006),
an extension of LDA(Bleiet al., 2003). Its based in the implementationbyS.GerrishandD.
Bleiavailableat
https://code.google.com/p/princetonstatisticallearning/downloads/detail?name=dtm_release
0.8.tgz
.

The procedure here was used during my work that led to the research article Twitter in
Academic Conferences: Usage, Networking and Participation over Time (Wen et al, 2014).
Here we analyzed tweets posted during academic conferences in the Computer Science
domain,andspecificallyweobtainthetopicevolutionoverfiveyears.

Prerequisite
:Followingthistutorialmayrequiresomebasicknowledgeofpython,SQL
databasesandbasiccommandlineinstructions.

Required files
You must get Gerris & Blei DTM release files. You can download the compiled binary files
from
https://github.com/magsilva/dtm
Youcandownloadthepythonscriptsfromdtm_gensimrepositoryonGitHub.
https://github.com/mihurtado/dtm_gensim
Ifyouwantthedatabaseused,youcanaskforitbyemailing
[email protected]

The Database

In this tutorial, we will use asmallfragmentofthedatabaseusedfortheinvestigationTwitter


in Academic Conferences: Usage, Networking and Participation over Time. This fragment
consist on a SQL database with ~5K tweets posted on the conference WWW . There are
many dimensions of information about each tweet, such as author, date, tweetid, content,
conference,year.Wearegoingtouse:

ttID:itsthetweetid

content:itsthecontentofthetweet

conference:itstheconferencename

category_index:itstheyearoftheconference

lang: we have classificatethe tweets by language, becausethisDTMtutorialwillwork


onlyonenglishtweets.

Wed like to share thedatasets oftweetswithallthisinformation, butsinceTwitterAPITerms


of Service use does not allow us, we can share you the list of tweetIDs by emailing
[email protected]
.

WeloadedthedataintoaMySQLdatabasewiththiscommand:
$ mysql
-
u
[
uname
]
-
p
[
pass
]
[
twitter_conferences
]
<
[
database
.
sql]

You must load the database into a SQL server. We recommend you to install MYSQL,
following this tutorial
http://dev.mysql.com/doc/refman/5.1/en/windowsinstallation.html
.Then,
youmustloadthedatabase,usingtheMYSQLcommandline,withyourcredentials.
$ mysql
-
u
[
uname
]
-
p
[
pass
]
[
twitter_conferences
]
<
[
database
.
sql]

The Tutorial
1. Getting and transforming the Data
We will use the generated bin file of Blei DTM, called dtmwin64.exe or dtmwin32.exe,
depending on yourWindowsarchitecture.Inmagsilvasgithubaccounttherearealsobinaries
forMacOSX.Toexecutetheprogram,weneedtwoessentialfiles:

prefixseq.dat:containstheTimeStamps.Thestructureofthefilemustbe:

Number_Timestamps
number_docs_time_1
...
number_docs_time_i

number_docs_time_NumberTimestamps

In our case (tweets of theconference), our timestamps are years from 2009 to2013,
soourfilemustlooklike:
5
number_docs_2009
number_docs_2010
number_docs_2011
number_docs_2012
number_docs_2013

Wearemissingthenumer_docs_yearnumber,sohavetocalculateit.

prefixmult.dat: contains the data vectorized.Thestructureofeachlineinthisfilemust


belikethis:
unique_word_count index1:count1 index2:count2 ... indexn:countn

Where each line represent a document. In our case, each line is a tweet. The
unique_word_count is the total number of uniquewords,eachindexisanidentifierforaword
that would berepresentedbyadictionary,and eachcountishowmanytimesthewordshows
upinthedocument/tweet.

We also are going to construct two more files, that will help us in the interpretation of
theresults.
dictionary.dict: well create a dictionary where each word connects with the index in
prefixmult.datfile.
vocabuary.dat:itwillstoreeachwordthatappearsinthetweets.
metadata.dat:itwillstorethetweetsimplemetadata(tweetid,dateandcontent)

2. Generating the Input Files

If you alreadyhavethefilesinthespecifiedformat,youcan skipthissectionandgodirectlyto


section3BacktoDTMBleisexecutable.

We will use Python to collect all the tweets from the SQL database, then well create a
dictionary and finally well transform the text of the tweets to a Vector Space Model (VSM)
representation.

2.1 Read From database

AssumingthatyourtweetsareinaMySQLdatabaseandyouwanttowritethemintoafileto
processthemwithDTM,gothroughthispartofthetutorialifnot,jumpdirectlytosection
3

GeneratingtheCorpuswithgensim
.
SofirstwellgetthetweetsfromtheSQLdatabase.WewillusesimpleSQLcommandsto
extracttheinformationweneed.Wewillapplysomefiltersoverthecontent,suchasremoving
thehashtagsandmentions,removingstopwords,stemming,andlemmatization,andkeeping
onlynounsandadjectives.

Initialconfiguration:
#Import some modules for reading and getting data.
#If you don't have this modules, you must install them.
importcsv
import
MySQLdb
importre
importnltk
fromnltk
.
stem
.
wordnet
import
WordNetLemmatizer
fromnltk
.
corpus
importstopwords
importos
fromgensim
importcorpora
,models
,similarities
#to create a dictionary
#Set years, this would be the timestamps
time_stamps
=
[
'2009'
,
'2010'
,
'2011'
,
'2012'
,
'2013']
#Set the conference name to be analyzed
conference
=
''
#DB MYSQL Connect. Put your credentials here.
db_host
=
'localhost'
#Host
db_user
=
'user'
#User
db_pass
=
'password'
#Password
db_database
=
'twitter_conferences'
#Database
##Connect...
db
=
MySQLdb
.
connect
(
host
=
db_host
,user
=
db_user
,passwd
=
db_pass
,
db
=
db_database)

Getting the data. Here we will get the data from the database, keep it in memory (tweets
pythonlist)andsaveitforfuturereferenceinmetadata.datfile.
#Set metadata output file
dat_outfile
=open
(
os
.
path
.
join
(
'data'
,conference
,
'metadata.dat'
),
'w'
)
dat_outfile
.
write
(
'id\tdate\tcontent\n'
)
#write header
tweets
=list
()
#Set total_tweets list per year, starting at 0
total_tweets_list
=
[
0
foryear
intime_stamps
]
#Analyze each year..
time_stamps_count
=
0
foryear
intime_stamps
:
#For each year
print

(
'Analyzing year '
+str
(
year
))
Set total_tweets to 0
#
total_tweets
=
0
Get tweets with mysql
#
cursor
=db
.
cursor
()
Query
#
query
=
"SELECT ttID, content,category_indexFROMcon_tweets_filteredWHEREconference
= '"
+conference
+
"' and category_index = "
+year
+
" and relevant=1 and lang='en'"
Execute query
#
cursor
.
execute
(
query
)
result
=cursor
.
fetchall
()
#store results
cursor
.
close
()
For each result (tweet), get content
#
forline

inresult
:
Remove @xxxx and #xxxxx
#
content
=
[
unicode
(
word
.
lower
(),errors
=
'ignore'
)
forword
inline
[
1
].
split
()
if
word
.
find
(
'@'
)
==
-
1
andword
.
find
(
'#'
)
==
-
1
andword
.
find
(
'http'
)
==
-
1
]
join words list to one string
#
content
=
' '
.
join
(
content
)

remove symbols
#
content
=re
.
sub
(
r
'[^\w]'
,
' '
,content
)
#remove stop words, this could also be done in the next step with gensim

content
=
[
word
forword
incontent
.
split
()
ifword
not
instopwords
.
words
(
'english'
)
andlen

(
word
)
>
3
and
notany
(
c
.
isdigit
()
forc
inword
)]
join words list to one string
#
content
=
' '
.
join
(
content
)
Stemming and lemmatization
#
lmtzr
=
WordNetLemmatizer
()
content
=lmtzr
.
lemmatize
(
content
)
Filter only nouns and adjectives
#
tokenized
=nltk
.
word_tokenize
(
content
)
classified
=nltk
.
pos_tag
(
tokenized
)
join words list to one string
#
content
=
' '
.
join
(
content
)
tweets
.
append
([
line
[
0
],content
,line
[
2
]])
total_tweets
+=
1
dat_outfile
.
write
(
str
(
line
[
0
])
+
'\t'
+str
(
line
[
2
])
+
'\t'
+content
)
dat_outfile
.
write
(
'\n'
)
Add the total tweets to the total tweets per year list
#
total_tweets_list
[
time_stamps_count
]
+=total_tweets
time_stamps_count
+=
1
dat_outfile
.
close
()
#Close the metadata file
print
(
'Done collecting tweets')

Now that you have all thetweetsandyouhavealreadycountedhowmanyarethere,youcan


writetheprefixseq.datfilewiththespecificatedformat.

#Write seq file


seq_outfile
=open
(
os
.
path
.
join
(
'data'
,conference
,
'foo-seq.dat'
),
'w')
seq_outfile
.
write
(
str
(
len
(
total_tweets_list
))
+
'\n'
)
#number of TimeStamps
forcount
intotal_tweets_list:
seq_outfile
.
write
(
str
(
count
)
+
'\n'
)
#write the total tweets per year (timestamp)

seq_outfile
.
close
()
print
(
'Done writing seq')

Next step is to write the mult.dat file.Todothis,wemusthaveourdocumentsvectorizedand


ourdictionary.

3 Generating the Corpus with gensim

We will use the gensim python package to generate our corpus. You can see the full tutorial
andotheroptionsinthispage:
http://radimrehurek.com/gensim/tut1.html#fromstringstovectors
.

Following the tutorialofgensim,wellfirstcreateadictionaryandavocabularyusingthefields


created in theprevious step(tweet content filtered with stemming, lemmatization, lowercase,
etc.),metadata.dat,whichcolumnsare:tweet_id,tweet_date,tweet_content.

Consideringthatdocumentsisalistcontainingthetweets,e.g.
[Im at WWW conference 2013, hope to have a great time!, This is such an interesting
project,#HT2010,...]
We also will remove stop words and words than appear only once (you must define a
stopwordslistoruse,forinstance,theoneavailableinNLTKlibrary).

stoplist
=
set
(
'for a of the and to in'
.
split
())
#Construct the dictionary
dictionary
=corpora
.
Dictionary
(
line
[
1
].
lower
().
split
()
forline
intweets)
# remove stop words and words that appear only once
stop_ids
=
[
dictionary
.
token2id
[
stopword
]
forstopword
instoplist
ifstopword

indictionary
.
token2id]
once_ids
=
[
tokenid
fortokenid
,docfreq
indictionary
.
dfs
.
iteritems
()
ifdocfreq
==
1]
dictionary
.
filter_tokens
(
stop_ids
+ once_ids
)
# removestopwordsandwordsthatappearonly
once
dictionary
.
compactify
()
# remove gaps in id sequence after words that were removed
dictionary
.
save
(
os
.
path
.
join
(
'data'
,conference
,
'dictionary.dict'
))
# store the dictionary
#Save vocabulary
vocFile
=open
(
os
.
path
.
join
(
'data'
,conference
,
'vocabulary.dat'
),
'w')
forword
indictionary
.
values
():

vocFile
.
write
(
word
+
'\n')
vocFile
.
close
()
print
(
'Dictionary and vocabulary saved')

So, our dictionary is finished and ready to be implemented in our vectorization. We will
introduce a class to prevent storing the words of each document in RAM. Instead, we will
analyzeeachdocumentwordsseparately.
class
MyCorpus
(
object
):
def__iter__
(
self
):
forline
intweets:
# assume there's one document per line, tokens separated by whitespace
yielddictionary
.
doc2bow
(
line
.
lower
().
split
())

Andwewillcreateaninstanceoftheclass,containingthecorpus
corpus_memory_friendly
=
MyCorpus
()

Now that our corpus isready and each document will be vectorized when we call each line,
we can start writing ourmult.dat file touse it in
DTM
,butwehavea littleproblem: thecorpus
isrepresentedbyalistoflistsoftuples,likethis:
[[(
0
,
1
),
(
1
,
1
),
(
2
,
1
)],
[(
0
,
1
),
(
3
,
1
),
(
4
,
1
),
(
5
,
1
),
(
6
,
1
),
(
7
,
1
)], ]

Andwehavetowriteinthespecifiedformat:
unique_word_count index1
:
count1 index2
:
count2
...indexn
:
counnt

Now,wecanwritetothemult.datfile
multFile
=open
(
os
.
path
.
join
(
'data','WWW','mult.dat'
),
'w')
forvector
incorpus_memory_friendly
:
# load one vector into memory at a time
multFile
.
write
(
str
(
len
(
vector
))
+
' ')
for

(
wordID
,weigth
)
invector:
multFile
.
write
(
str
(
wordID
)
+
':'
+str
(
weigth
)
+
' ')
multFile
.
write
(
'\n')
multFile
.
close
()
print
(
'mult file saved')

3. Back to DTM Bleis executable


Once we have the corpus ready (
mult.dat created in the previous step), we can load it into
Bleis executable to perform DTM topic modelling. To do this, we can call this from comand
like:
$dtm-win64.exe
./main
--ntopics=3 --mode=fit --rng_seed=0 --initialize_lda=true
--corpus_prefix=data/WWW/
--outname=data/WWW/output
--top_chain_var=0.9
--
alpha=0.01
--lda_sequence_min_iter=6 --lda_sequence_max_iter=20 --lda_max_em_iter=20

In the example, data/WWW/ is just the name of a folder but you can changeit tosomething
elsebutdontmisshavingthefilemult.datinthefolderspecificedin
--corpus_prefix
.

It is important to notice that there is a parameter named


alpha
. This parameter allows to
force to obtain similar (when is close to0)ornotsimilar(closeto1)topicsoverthetimespan.
You have to decide how similar do you want the topics over the years. You can try with
different values to observe what happen. In our experience, setting thisparameter too high
(suchas0.9)returnstopicscompletelyunrelatedtoeachother.
In the parameter
ntopics
, youshould put the number of topics that you want to export to the
output.Inthiscase,wewillgetfivetopics.
Afterexecutingthiscommand,wearereadytoproceedinterpretingtheoutput.

4. Interpreting Bleis DTM Output data


AfterexecutingDTM,youllhavealistoffiles.Someofthisfilesare:
topicxxxvarelogprob.dat:this file presentsthedistributionof words forthetopic
xxx
foreachperiodanalyzed.Thisvaluesareinlogarithm.
gam.dat : this file stores theparameters of the Dirichletvariational(gammas)foreach
document.
For interpreting the data, you might generate a csv filewith theprobabilitiesofeachtopicper
year by getting the values fromthetopicxxxvarelogprob.dat file.Remembertoapplyexp()
to the values. You also might want to generate a csv with the topic mixtures for each
document. To do this, you have to analyze the gam.dat file. Each set of
n lines followed
represent the gammas for each topic in each document. By dividing the
i line(i topic)by the
sumofthe
nlines
willgiveyouthetopicproportion.
To make this task easier, we will use apython package to interpret and visualize the output.
Youarefreetodothiswithyourownmethod.

4.1 Using tethne to Interpret and Visualize DTM output

To interpret this output data, we will use a python package to visualize corpus data named
Tethne [2]. You might use another package if you want. With this tool, we will be able to

export the topics evolutionoverthetimespanandthetopicsmostcommonwordsintosimple


txtfiles.

To do this, we use the tethnefrom_gerrish tool to loadthecorpus,andthenwecangenerate


theoutputsthattethneallowsustodo.
Toimportthecorpusintotethne,justdo:

#Import to tethne
dtm
= tethne
.
model
.
corpus
.
dtmmodel
.
from_gerrish
(
'data/'
+ conference
+
'/output/'
,
'data/'
+
conference
+
'/metadata.dat'
,
'data/'
+conference
+
'/vocabulary.dat')

Then, we cangeneratetheexportsshowedintethnedocumentation.Asanexample,herewe
will print the topics most 10common words foreachtopic(inourcase5topics)andforeach
year (inour case5years)withtheprobabilityofappearinginadocument,sothenwecanplot
atopicevolution.

fortopic_i
inrange
(
5
):
arr
=dtm
.
topic_evolution
(
topic_i
,
10)
forkey

inarr
[
1
].
keys
():
foryear_i

inrange
(
5
):
print(
[
conference
,topic_i
,key
,
(
year_i
+
2009
),arr
[
1
][
key
][
year_i
]])

Using the output from the code above, we can generate using Rsome interesting plots, like
the one below, that shows three topicevolutionoverthetimespaninoneconference.Wecan
observethattheconferencewasverystableinitstopicsoverthefiveyears.

4.2 Plotting with R

Havingthedataintothisformat,youcanplotthisintoR:

library
(
ggplot2)
library
(
gridExtra)
library
(
directlabels)
>str
(
dfc)
'data.frame'
: 240obs
.of
8variables:
$ year
:
Factorw
/
5levels
"2009"
,
"2010"
,..:
1
1
1
1
1
1
1
1
1
1
...
$ confID
:
Factorw
/
16levels
"CHI"
,
"CIKM"
,
"ECTEL"
,..:
1
1
1
2
2
2
3
3
3
4
...
$ topic
:
Factorw
/
3levels
"Topic1"
,
"Topic2"
,..:
1
2
3
1
2
3
1
2
3
1
...
$N
:num

2083
2083
2083
37
37
...
$ value
:num
0.387
0.294
0.319
0.28
0.391
...
$ sd
:num

0.44
0.412
0.42
0.425
0.447
...

$ se
$ ci

num 0
:
.00964
0.00903
0.0092
0.06985
0.07347
...
:num

0.0189
0.0177
0.018
0.1417
0.149
..

pd
<-position_dodge
(.
1)
ggplot
(
dfc
,aes
(
x
=
year
,y
=
value
,colour
=
topic
,
group
=
topic
))
+geom_point
(
position
=
pd
)
+
geom_dl
(
aes
(
label
=
topic
),
size
=
2.5
,
list
(
"first.qp"
,
cex
=
0.5
,rot
=
30
))
+
geom_dl
(
aes
(
label
=
topic
),
size
=
2.5
,
list
(
"last.qp"
,
cex
=
0.5
,rot
=
30
))
+
geom_smooth
(
aes
(
group
=
topic
,ymin
=value
-
se
,ymax
=value
+
se
))
+facet_wrap
(~
confID
,
ncol
=
4
)
+theme_bw
()
+
theme
(legend
.
position
=
"bottom"
, panel
.
grid
.
minor
=element_blank
())
+theme
(
axis
.
title
.
x
=element_blank
(),axis
.
title
.
y
=element_blank
(),axis
.
text
.
x
=element_text
(
angle
=
30
,
vjust
=
0.1
,hjust
=
0.1
,
size
=
5
),strip
.
background
=
element_blank
(),
strip
.
text
=
element_text
(
size
=
7
),legend
.
text
=
element_text
(
size
=
4
),
legend
.
title
=
element_text
(
size
=
4
),
panel
.
margin
=unit
(
0
,
"null"
),plot
.
margin
=
rep
(
unit
(
0.1
,
"cm"
),
4
)
)
+labs
(
x
=
NULL
)

willlooklikethis:

oreventhis:

dtm.df <- read.csv(file="OutputDTM-HT4.csv",sep=",");


dtm.df$TopicID = as.factor(dtm.df$TopicID)
dtm.df$Year = as.factor(dtm.df$Year)
library(ggplot2)
library(gridExtra)
# 'data.frame': 125 obs. of 5 variables:
# $ Conference : Factor w/ 1 level "ACMMM": 1 1 1 1 1 1 1 1 1 1 ...
# $ TopicID
: int 0 0 0 0 0 0 0 0 0 0 ...
# $ Word
: Factor w/ 23 levels "ahead","award",..: 4 4 4 4 4 7 7 7 7 7 ...

# $ Year
: int 2009 2010 2011 2012 2013 2009 2010 2011 2012 2013 ...
# $ Probability: num 0 0 0.0142 0 0 ...
# ===
conf.dtm.df <- dtm.df[dtm.df$Conference == "CHI", ];
gchi <- ggplot(data=conf.dtm.df,aes(x=Year,y=Word))
gchi <- gchi + geom_tile(aes(fill = Probability), colour="black",stat = "identity") +
scale_fill_gradient(low="white", high="blue") +
ggtitle(paste("",as.character(conf.dtm.df$Conference)," ") ) + facet_wrap(~ TopicID,
scales="free_y", ncol=5) + geom_text(data=conf.dtm.df[conf.dtm.df$Year == 2011,],
aes(label=Word), size=4, vjust=0.25) + theme_bw() + theme(panel.grid.major =
element_blank(), legend.position="bottom", panel.grid.minor = element_blank()) +
theme(axis.ticks = element_blank(), axis.text.y = element_blank(), axis.title.x =
element_blank(), axis.title.y = element_blank(), axis.text.x = element_text(angle = 60,
vjust = 0.1, hjust=0.1,size=5), strip.background=element_blank(),
strip.text=element_text(size=7), legend.text=element_text(size=4),
legend.title=element_text(size=4),panel.margin = unit(0,"null"), plot.margin =
rep(unit(0.1,"cm"),4) , legend.margin=unit(-0.6,"cm"), legend.key.height = unit(0.4, "cm"))
+ labs(x=NULL)
#visualze
gchi

..thatwilllooklikethis

References
[1] Wen, X., Lin, Y., Trattner, C. and Parra, D.: Twitter in Academic Conferences: Usage,
Networking and Participation over Time, In Proceedings of the ACM 2014 International

Conference in Hypertext and Social Media (Hypertext 2014), ACM, New York, USA, 2014.
PDF

[2]Tethnepackageforpython
http://diging.github.io/tethne/

[3]
Blei,D.M., Ng, A. Y., &Jordan,M.I.(2003).Latentdirichletallocation.
theJournalofmachineLearning
research
,
3
,9931022.

[4] Blei,D.M.,& Lafferty, J. D. (2006,June). Dynamictopic models.In


Proceedingsofthe23rdinternational
conferenceonMachinelearning
(pp.113120).ACM.

You might also like