0% found this document useful (0 votes)

392 views13 pages

Dynamic Topic Modelling Tutorial

This document provides a tutorial on using Dynamic Topic Modelling (DTM) to analyze the evolution of topics in a collection of tweets over multiple years. It explains the required files and formats to run DTM, including a prefix-seq.dat file containing timestamps, a prefix-mult.dat file containing the vectorized tweet data, and additional files like a dictionary and metadata. It also describes extracting tweet data from a SQL database, preprocessing the text, creating the input files, and running the DTM executable to model the topic evolution in the tweets over time.

Uploaded by

spider9385

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

392 views13 pages

Dynamic Topic Modelling Tutorial

Uploaded by

spider9385

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

Dynamic Topic Modelling Tutorial

MatiasHurtado
EngineeringStudent,
PontificiaUniversidadCatlicadeChile
mihurtado@[Link]

Advisor:DenisParra
AssistantProfessor,PontificiaUniversidadCatlicadeChile
dparra@[Link]

SocialComputingandvisualizationgroup,PUCChile
[Link]

Introduction
The following tutorial explains how to use Dynamic Topic Modelling (BleiandLafferty,2006),
an extension of LDA(Bleiet al., 2003). Its based in the [Link].
Bleiavailableat
[Link]
[Link]
.

The procedure here was used during my work that led to the research article Twitter in
Academic Conferences: Usage, Networking and Participation over Time (Wen et al, 2014).
Here we analyzed tweets posted during academic conferences in the Computer Science
domain,andspecificallyweobtainthetopicevolutionoverfiveyears.

Prerequisite
:Followingthistutorialmayrequiresomebasicknowledgeofpython,SQL
databasesandbasiccommandlineinstructions.

Required files
You must get Gerris & Blei DTM release files. You can download the compiled binary files
from
[Link]
Youcandownloadthepythonscriptsfromdtm_gensimrepositoryonGitHub.
[Link]
Ifyouwantthedatabaseused,youcanaskforitbyemailing
dparra@[Link]

The Database

In this tutorial, we will use asmallfragmentofthedatabaseusedfortheinvestigationTwitter

in Academic Conferences: Usage, Networking and Participation over Time. This fragment
consist on a SQL database with ~5K tweets posted on the conference WWW . There are
many dimensions of information about each tweet, such as author, date, tweetid, content,
conference,[Link]:

ttID:itsthetweetid

content:itsthecontentofthetweet

conference:itstheconferencename

category_index:itstheyearoftheconference

lang: we have classificatethe tweets by language, becausethisDTMtutorialwillwork

onlyonenglishtweets.

Wed like to share thedatasets oftweetswithallthisinformation, butsinceTwitterAPITerms

of Service use does not allow us, we can share you the list of tweetIDs by emailing
mihurtado@[Link]
.

WeloadedthedataintoaMySQLdatabasewiththiscommand:
$ mysql
-
u
[
uname
]
-
p
[
pass
]
[
twitter_conferences
]
<
[
database
.
sql]

You must load the database into a SQL server. We recommend you to install MYSQL,
following this tutorial
[Link]
.Then,
youmustloadthedatabase,usingtheMYSQLcommandline,withyourcredentials.
$ mysql
-
u
[
uname
]
-
p
[
pass
]
[
twitter_conferences
]
<
[
database
.
sql]

The Tutorial
1. Getting and transforming the Data
We will use the generated bin file of Blei DTM, called [Link] or [Link],
depending on [Link]
[Link],weneedtwoessentialfiles:

[Link]:[Link]:

Number_Timestamps
number_docs_time_1
...
number_docs_time_i

number_docs_time_NumberTimestamps

In our case (tweets of theconference), our timestamps are years from 2009 to2013,
soourfilemustlooklike:
5
number_docs_2009
number_docs_2010
number_docs_2011
number_docs_2012
number_docs_2013

Wearemissingthenumer_docs_yearnumber,sohavetocalculateit.

[Link]: contains the data [Link]

belikethis:
unique_word_count index1:count1 index2:count2 ... indexn:countn

Where each line represent a document. In our case, each line is a tweet. The
unique_word_count is the total number of uniquewords,eachindexisanidentifierforaword
that would berepresentedbyadictionary,and eachcountishowmanytimesthewordshows
upinthedocument/tweet.

We also are going to construct two more files, that will help us in the interpretation of
theresults.
[Link]: well create a dictionary where each word connects with the index in
[Link].
[Link]:itwillstoreeachwordthatappearsinthetweets.
[Link]:itwillstorethetweetsimplemetadata(tweetid,dateandcontent)

2. Generating the Input Files

If you alreadyhavethefilesinthespecifiedformat,youcan skipthissectionandgodirectlyto

section3BacktoDTMBleisexecutable.

We will use Python to collect all the tweets from the SQL database, then well create a
dictionary and finally well transform the text of the tweets to a Vector Space Model (VSM)
representation.

2.1 Read From database

AssumingthatyourtweetsareinaMySQLdatabaseandyouwanttowritethemintoafileto
processthemwithDTM,gothroughthispartofthetutorialifnot,jumpdirectlytosection
3

GeneratingtheCorpuswithgensim
.
[Link]
[Link],suchasremoving
thehashtagsandmentions,removingstopwords,stemming,andlemmatization,andkeeping
onlynounsandadjectives.

Initialconfiguration:
#Import some modules for reading and getting data.
#If you don't have this modules, you must install them.
importcsv
import
MySQLdb
importre
importnltk
fromnltk
.
stem
.
wordnet
import
WordNetLemmatizer
fromnltk
.
corpus
importstopwords
importos
fromgensim
importcorpora
,models
,similarities
#to create a dictionary
#Set years, this would be the timestamps
time_stamps
=
[
'2009'
,
'2010'
,
'2011'
,
'2012'
,
'2013']
#Set the conference name to be analyzed
conference
=
''
#DB MYSQL Connect. Put your credentials here.
db_host
=
'localhost'
#Host
db_user
=
'user'
#User
db_pass
=
'password'
#Password
db_database
=
'twitter_conferences'
#Database
##Connect...
db
=
MySQLdb
.
connect
(
host
=
db_host
,user
=
db_user
,passwd
=
db_pass
,
db
=
db_database)

Getting the data. Here we will get the data from the database, keep it in memory (tweets
pythonlist)[Link].
#Set metadata output file
dat_outfile
=open
(
os
.
path
.
join
(
'data'
,conference
,
'[Link]'
),
'w'
)
dat_outfile
.
write
(
'id\tdate\tcontent\n'
)
#write header
tweets
=list
()
#Set total_tweets list per year, starting at 0
total_tweets_list
=
[
0
foryear
intime_stamps
]
#Analyze each year..
time_stamps_count
=
0
foryear
intime_stamps
:
#For each year
print

(
'Analyzing year '
+str
(
year
))
Set total_tweets to 0
#
total_tweets
=
0
Get tweets with mysql
#
cursor
=db
.
cursor
()
Query
#
query
=
"SELECT ttID, content,category_indexFROMcon_tweets_filteredWHEREconference
= '"
+conference
+
"' and category_index = "
+year
+
" and relevant=1 and lang='en'"
Execute query
#
cursor
.
execute
(
query
)
result
=cursor
.
fetchall
()
#store results
cursor
.
close
()
For each result (tweet), get content
#
forline

inresult
:
Remove @xxxx and #xxxxx
#
content
=
[
unicode
(
word
.
lower
(),errors
=
'ignore'
)
forword
inline
[
1
].
split
()
if
word
.
find
(
'@'
)
==
-
1
andword
.
find
(
'#'
)
==
-
1
andword
.
find
(
'http'
)
==
-
1
]
join words list to one string
#
content
=
' '
.
join
(
content
)

remove symbols
#
content
=re
.
sub
(
r
'[^\w]'
,
' '
,content
)
#remove stop words, this could also be done in the next step with gensim

content
=
[
word
forword
incontent
.
split
()
ifword
not
instopwords
.
words
(
'english'
)
andlen

(
word
)
>
3
and
notany
(
c
.
isdigit
()
forc
inword
)]
join words list to one string
#
content
=
' '
.
join
(
content
)
Stemming and lemmatization
#
lmtzr
=
WordNetLemmatizer
()
content
=lmtzr
.
lemmatize
(
content
)
Filter only nouns and adjectives
#
tokenized
=nltk
.
word_tokenize
(
content
)
classified
=nltk
.
pos_tag
(
tokenized
)
join words list to one string
#
content
=
' '
.
join
(
content
)
tweets
.
append
([
line
[
0
],content
,line
[
2
]])
total_tweets
+=
1
dat_outfile
.
write
(
str
(
line
[
0
])
+
'\t'
+str
(
line
[
2
])
+
'\t'
+content
)
dat_outfile
.
write
(
'\n'
)
Add the total tweets to the total tweets per year list
#
total_tweets_list
[
time_stamps_count
]
+=total_tweets
time_stamps_count
+=
1
dat_outfile
.
close
()
#Close the metadata file
print
(
'Done collecting tweets')

Now that you have all thetweetsandyouhavealreadycountedhowmanyarethere,youcan

[Link].

#Write seq file

seq_outfile
=open
(
os
.
path
.
join
(
'data'
,conference
,
'[Link]'
),
'w')
seq_outfile
.
write
(
str
(
len
(
total_tweets_list
))
+
'\n'
)
#number of TimeStamps
forcount
intotal_tweets_list:
seq_outfile
.
write
(
str
(
count
)
+
'\n'
)
#write the total tweets per year (timestamp)

seq_outfile
.
close
()
print
(
'Done writing seq')

Next step is to write the [Link] [Link],wemusthaveourdocumentsvectorizedand

ourdictionary.

3 Generating the Corpus with gensim

We will use the gensim python package to generate our corpus. You can see the full tutorial
andotheroptionsinthispage:
[Link]
.

Following the tutorialofgensim,wellfirstcreateadictionaryandavocabularyusingthefields

created in theprevious step(tweet content filtered with stemming, lemmatization, lowercase,
etc.),[Link],whichcolumnsare:tweet_id,tweet_date,tweet_content.

Consideringthatdocumentsisalistcontainingthetweets,e.g.
[Im at WWW conference 2013, hope to have a great time!, This is such an interesting
project,#HT2010,...]
We also will remove stop words and words than appear only once (you must define a
stopwordslistoruse,forinstance,theoneavailableinNLTKlibrary).

stoplist
=
set
(
'for a of the and to in'
.
split
())
#Construct the dictionary
dictionary
=corpora
.
Dictionary
(
line
[
1
].
lower
().
split
()
forline
intweets)
# remove stop words and words that appear only once
stop_ids
=
[
dictionary
.
token2id
[
stopword
]
forstopword
instoplist
ifstopword

indictionary
.
token2id]
once_ids
=
[
tokenid
fortokenid
,docfreq
indictionary
.
dfs
.
iteritems
()
ifdocfreq
==
1]
dictionary
.
filter_tokens
(
stop_ids
+ once_ids
)
# removestopwordsandwordsthatappearonly
once
dictionary
.
compactify
()
# remove gaps in id sequence after words that were removed
dictionary
.
save
(
os
.
path
.
join
(
'data'
,conference
,
'[Link]'
))
# store the dictionary
#Save vocabulary
vocFile
=open
(
os
.
path
.
join
(
'data'
,conference
,
'[Link]'
),
'w')
forword
indictionary
.
values
():

vocFile
.
write
(
word
+
'\n')
vocFile
.
close
()
print
(
'Dictionary and vocabulary saved')

So, our dictionary is finished and ready to be implemented in our vectorization. We will
introduce a class to prevent storing the words of each document in RAM. Instead, we will
analyzeeachdocumentwordsseparately.
class
MyCorpus
(
object
):
def__iter__
(
self
):
forline
intweets:
# assume there's one document per line, tokens separated by whitespace
yielddictionary
.
doc2bow
(
line
.
lower
().
split
())

Andwewillcreateaninstanceoftheclass,containingthecorpus
corpus_memory_friendly
=
MyCorpus
()

Now that our corpus isready and each document will be vectorized when we call each line,
we can start writing [Link] file touse it in
DTM
,butwehavea littleproblem: thecorpus
isrepresentedbyalistoflistsoftuples,likethis:
[[(
0
,
1
),
(
1
,
1
),
(
2
,
1
)],
[(
0
,
1
),
(
3
,
1
),
(
4
,
1
),
(
5
,
1
),
(
6
,
1
),
(
7
,
1
)], ]

Andwehavetowriteinthespecifiedformat:
unique_word_count index1
:
count1 index2
:
count2
...indexn
:
counnt

Now,[Link]
multFile
=open
(
os
.
path
.
join
(
'data','WWW','[Link]'
),
'w')
forvector
incorpus_memory_friendly
:
# load one vector into memory at a time
multFile
.
write
(
str
(
len
(
vector
))
+
' ')
for

(
wordID
,weigth
)
invector:
multFile
.
write
(
str
(
wordID
)
+
':'
+str
(
weigth
)
+
' ')
multFile
.
write
(
'\n')
multFile
.
close
()
print
(
'mult file saved')

3. Back to DTM Bleis executable

Once we have the corpus ready (
[Link] created in the previous step), we can load it into
Bleis executable to perform DTM topic modelling. To do this, we can call this from comand
like:
$[Link]
./main
--ntopics=3 --mode=fit --rng_seed=0 --initialize_lda=true
--corpus_prefix=data/WWW/
--outname=data/WWW/output
--top_chain_var=0.9
--
alpha=0.01
--lda_sequence_min_iter=6 --lda_sequence_max_iter=20 --lda_max_em_iter=20

In the example, data/WWW/ is just the name of a folder but you can changeit tosomething
[Link]
--corpus_prefix
.

It is important to notice that there is a parameter named

alpha
. This parameter allows to
force to obtain similar (when is close to0)ornotsimilar(closeto1)topicsoverthetimespan.
You have to decide how similar do you want the topics over the years. You can try with
different values to observe what happen. In our experience, setting thisparameter too high
(suchas0.9)returnstopicscompletelyunrelatedtoeachother.
In the parameter
ntopics
, youshould put the number of topics that you want to export to the
[Link],wewillgetfivetopics.
Afterexecutingthiscommand,wearereadytoproceedinterpretingtheoutput.

4. Interpreting Bleis DTM Output data

AfterexecutingDTM,[Link]:
[Link]:this file presentsthedistributionof words forthetopic
xxx
[Link].
[Link] : this file stores theparameters of the Dirichletvariational(gammas)foreach
document.
For interpreting the data, you might generate a csv filewith theprobabilitiesofeachtopicper
year by getting the values [Link] [Link]()
to the values. You also might want to generate a csv with the topic mixtures for each
document. To do this, you have to analyze the [Link] file. Each set of
n lines followed
represent the gammas for each topic in each document. By dividing the
i line(i topic)by the
sumofthe
nlines
willgiveyouthetopicproportion.
To make this task easier, we will use apython package to interpret and visualize the output.
Youarefreetodothiswithyourownmethod.

4.1 Using tethne to Interpret and Visualize DTM output

To interpret this output data, we will use a python package to visualize corpus data named
Tethne [2]. You might use another package if you want. With this tool, we will be able to

export the topics evolutionoverthetimespanandthetopicsmostcommonwordsintosimple

txtfiles.

To do this, we use the tethnefrom_gerrish tool to loadthecorpus,andthenwecangenerate

theoutputsthattethneallowsustodo.
Toimportthecorpusintotethne,justdo:

#Import to tethne
dtm
= tethne
.
model
.
corpus
.
dtmmodel
.
from_gerrish
(
'data/'
+ conference
+
'/output/'
,
'data/'
+
conference
+
'/[Link]'
,
'data/'
+conference
+
'/[Link]')

Then, we [Link],herewe
will print the topics most 10common words foreachtopic(inourcase5topics)andforeach
year (inour case5years)withtheprobabilityofappearinginadocument,sothenwecanplot
atopicevolution.

fortopic_i
inrange
(
5
):
arr
=dtm
.
topic_evolution
(
topic_i
,
10)
forkey

inarr
[
1
].
keys
():
foryear_i

inrange
(
5
):
print(
[
conference
,topic_i
,key
,
(
year_i
+
2009
),arr
[
1
][
key
][
year_i
]])

Using the output from the code above, we can generate using Rsome interesting plots, like
the one below, that shows three [Link]
observethattheconferencewasverystableinitstopicsoverthefiveyears.

4.2 Plotting with R

Havingthedataintothisformat,youcanplotthisintoR:

library
(
ggplot2)
library
(
gridExtra)
library
(
directlabels)
>str
(
dfc)
'[Link]'
: 240obs
.of
8variables:
$ year
:
Factorw
/
5levels
"2009"
,
"2010"
,..:
1
1
1
1
1
1
1
1
1
1
...
$ confID
:
Factorw
/
16levels
"CHI"
,
"CIKM"
,
"ECTEL"
,..:
1
1
1
2
2
2
3
3
3
4
...
$ topic
:
Factorw
/
3levels
"Topic1"
,
"Topic2"
,..:
1
2
3
1
2
3
1
2
3
1
...
$N
:num

2083
2083
2083
37
37
...
$ value
:num
0.387
0.294
0.319
0.28
0.391
...
$ sd
:num

0.44
0.412
0.42
0.425
0.447
...

$ se
$ ci

num 0
:
.00964
0.00903
0.0092
0.06985
0.07347
...
:num

0.0189
0.0177
0.018
0.1417
0.149
..

pd
<-position_dodge
(.
1)
ggplot
(
dfc
,aes
(
x
=
year
,y
=
value
,colour
=
topic
,
group
=
topic
))
+geom_point
(
position
=
pd
)
+
geom_dl
(
aes
(
label
=
topic
),
size
=
2.5
,
list
(
"[Link]"
,
cex
=
0.5
,rot
=
30
))
+
geom_dl
(
aes
(
label
=
topic
),
size
=
2.5
,
list
(
"[Link]"
,
cex
=
0.5
,rot
=
30
))
+
geom_smooth
(
aes
(
group
=
topic
,ymin
=value
-
se
,ymax
=value
+
se
))
+facet_wrap
(~
confID
,
ncol
=
4
)
+theme_bw
()
+
theme
(legend
.
position
=
"bottom"
, panel
.
grid
.
minor
=element_blank
())
+theme
(
axis
.
title
.
x
=element_blank
(),axis
.
title
.
y
=element_blank
(),axis
.
text
.
x
=element_text
(
angle
=
30
,
vjust
=
0.1
,hjust
=
0.1
,
size
=
5
),strip
.
background
=
element_blank
(),
strip
.
text
=
element_text
(
size
=
7
),legend
.
text
=
element_text
(
size
=
4
),
legend
.
title
=
element_text
(
size
=
4
),
panel
.
margin
=unit
(
0
,
"null"
),plot
.
margin
=
rep
(
unit
(
0.1
,
"cm"
),
4
)
)
+labs
(
x
=
NULL
)

willlooklikethis:

oreventhis:

# $ Year
: int 2009 2010 2011 2012 2013 2009 2010 2011 2012 2013 ...
# $ Probability: num 0 0 0.0142 0 0 ...
# ===
[Link] <- [Link][[Link]$Conference == "CHI", ];
gchi <- ggplot(data=[Link],aes(x=Year,y=Word))
gchi <- gchi + geom_tile(aes(fill = Probability), colour="black",stat = "identity") +
scale_fill_gradient(low="white", high="blue") +
ggtitle(paste("",[Link]([Link]$Conference)," ") ) + facet_wrap(~ TopicID,
scales="free_y", ncol=5) + geom_text(data=[Link][[Link]$Year == 2011,],
aes(label=Word), size=4, vjust=0.25) + theme_bw() + theme([Link] =
element_blank(), [Link]="bottom", [Link] = element_blank()) +
theme([Link] = element_blank(), [Link].y = element_blank(), [Link].x =
element_blank(), [Link].y = element_blank(), [Link].x = element_text(angle = 60,
vjust = 0.1, hjust=0.1,size=5), [Link]=element_blank(),
[Link]=element_text(size=7), [Link]=element_text(size=4),
[Link]=element_text(size=4),[Link] = unit(0,"null"), [Link] =
rep(unit(0.1,"cm"),4) , [Link]=unit(-0.6,"cm"), [Link] = unit(0.4, "cm"))
+ labs(x=NULL)
#visualze
gchi

..thatwilllooklikethis

References
[1] Wen, X., Lin, Y., Trattner, C. and Parra, D.: Twitter in Academic Conferences: Usage,
Networking and Participation over Time, In Proceedings of the ACM 2014 International

Conference in Hypertext and Social Media (Hypertext 2014), ACM, New York, USA, 2014.
PDF

[2]Tethnepackageforpython
[Link]

[3]
Blei,D.M., Ng, A. Y., &Jordan,M.I.(2003).Latentdirichletallocation.
theJournalofmachineLearning
research
,
3
,9931022.

[4] Blei,D.M.,& Lafferty, J. D. (2006,June). Dynamictopic [Link]

Proceedingsofthe23rdinternational
conferenceonMachinelearning
(pp.113120).ACM.

Live Classroom 3
No ratings yet
Live Classroom 3
36 pages
Clustering Thesis
No ratings yet
Clustering Thesis
55 pages
INDEXReport Ayush
No ratings yet
INDEXReport Ayush
38 pages
Anand Institute of Higher Technology Department of Computer Science and Engineering ACADEMIC YEAR: 2018-19 Mini Project Report
No ratings yet
Anand Institute of Higher Technology Department of Computer Science and Engineering ACADEMIC YEAR: 2018-19 Mini Project Report
9 pages
Ayushi Data Science Final File
No ratings yet
Ayushi Data Science Final File
30 pages
Boolean and Vector Space Retrieval Models
No ratings yet
Boolean and Vector Space Retrieval Models
31 pages
First
No ratings yet
First
27 pages
Python Twitter Sentiment Analysis
No ratings yet
Python Twitter Sentiment Analysis
20 pages
Text Mining Twitter Data with R
No ratings yet
Text Mining Twitter Data with R
35 pages
Text Mining & Analysis Guide
No ratings yet
Text Mining & Analysis Guide
6 pages
Fake News Detection
100% (1)
Fake News Detection
25 pages
Biterm Topic Model for Short Texts
No ratings yet
Biterm Topic Model for Short Texts
19 pages
Document Classification with tm Package
No ratings yet
Document Classification with tm Package
16 pages
Apex Institute of Technology Natural Language Processing (CST-354)
No ratings yet
Apex Institute of Technology Natural Language Processing (CST-354)
22 pages
Online Learning With Stream Mining
No ratings yet
Online Learning With Stream Mining
36 pages
Data Democratisation With Deep Learning
No ratings yet
Data Democratisation With Deep Learning
4 pages
De Project01 TwitterSentimentAnalysisinHive 101020 0745
No ratings yet
De Project01 TwitterSentimentAnalysisinHive 101020 0745
19 pages
Large Language Model Based Search Tool Prototype
No ratings yet
Large Language Model Based Search Tool Prototype
2 pages
Python & Data Analytics Internship Review
No ratings yet
Python & Data Analytics Internship Review
20 pages
Twitter Article
No ratings yet
Twitter Article
12 pages
TweeQL & TwitInfo: Twitter Data Processing
No ratings yet
TweeQL & TwitInfo: Twitter Data Processing
9 pages
Twitter Data Mining with R Techniques
No ratings yet
Twitter Data Mining with R Techniques
34 pages
A Review of Approaches For Topic Detection in Twitter
No ratings yet
A Review of Approaches For Topic Detection in Twitter
28 pages
Package BTM': January 20, 2025
No ratings yet
Package BTM': January 20, 2025
11 pages
Social Media Data Scraping with Python
No ratings yet
Social Media Data Scraping with Python
3 pages
Restricting Unsolicited Approaches and Counterfeit Users: Batch No: 28 Guided by Done by
No ratings yet
Restricting Unsolicited Approaches and Counterfeit Users: Batch No: 28 Guided by Done by
28 pages
Lecture 8
No ratings yet
Lecture 8
45 pages
Text Mining With R
No ratings yet
Text Mining With R
15 pages
Cyberbullying Detection Using ML
No ratings yet
Cyberbullying Detection Using ML
5 pages
Text Preprocessing with NLTK
No ratings yet
Text Preprocessing with NLTK
42 pages
Detecting Emerging Topics in Social Networks Using Anomaly Detection
No ratings yet
Detecting Emerging Topics in Social Networks Using Anomaly Detection
6 pages
5.2 Feature Engineering
No ratings yet
5.2 Feature Engineering
57 pages
Twitter-Based Traffic Monitoring
No ratings yet
Twitter-Based Traffic Monitoring
3 pages
User-Driven Entity Disambiguation System
No ratings yet
User-Driven Entity Disambiguation System
6 pages
A Gentle Introduction To Topic Modeling Using Pyth
No ratings yet
A Gentle Introduction To Topic Modeling Using Pyth
10 pages
213j1a05h6 Data Science Cse-F
No ratings yet
213j1a05h6 Data Science Cse-F
25 pages
Biterm Topic Model for Short Texts
No ratings yet
Biterm Topic Model for Short Texts
18 pages
Information Retrieval Module 1 24
No ratings yet
Information Retrieval Module 1 24
53 pages
Using Knowledge Graphs To Explain Entity Co-Occurrence in
No ratings yet
Using Knowledge Graphs To Explain Entity Co-Occurrence in
4 pages
Information Retrieval Using Effective Bigram Topic Modeling
No ratings yet
Information Retrieval Using Effective Bigram Topic Modeling
8 pages
1 Text Mining Review Slides
No ratings yet
1 Text Mining Review Slides
78 pages
Chapter 2
No ratings yet
Chapter 2
37 pages
Analyzing and Ranking Prevalent News Over Social Media
No ratings yet
Analyzing and Ranking Prevalent News Over Social Media
12 pages
Tweet Clustering Using Embeddings
No ratings yet
Tweet Clustering Using Embeddings
48 pages
Fake News Detection with Python
No ratings yet
Fake News Detection with Python
14 pages
Twitter Data Ingestion and Queries
No ratings yet
Twitter Data Ingestion and Queries
2 pages
Drug Categorization via SVM on Twitter
No ratings yet
Drug Categorization via SVM on Twitter
4 pages
Twitter-Based Traffic Detection Using CNN
No ratings yet
Twitter-Based Traffic Detection Using CNN
12 pages
Sentiment Analysis PDF
No ratings yet
Sentiment Analysis PDF
4 pages
Machine Learning Slide - Group 16
No ratings yet
Machine Learning Slide - Group 16
32 pages
Twitter Sentiment Analysis
No ratings yet
Twitter Sentiment Analysis
13 pages
Install PyLDAvis for Topic Modeling
No ratings yet
Install PyLDAvis for Topic Modeling
44 pages
Monitoring The Public Opinion About The Vaccination Topic From Tweets Analysis
100% (1)
Monitoring The Public Opinion About The Vaccination Topic From Tweets Analysis
18 pages
Lec 13
No ratings yet
Lec 13
7 pages
Mitsuku Download and Text Mining Insights
No ratings yet
Mitsuku Download and Text Mining Insights
37 pages
PPPT
No ratings yet
PPPT
20 pages
Lab Report - CSE 816
No ratings yet
Lab Report - CSE 816
17 pages
Hello BORRA: Welcome To Your Workspace
No ratings yet
Hello BORRA: Welcome To Your Workspace
3 pages
Action by Ship Upon Reception of HF DSC Distress Alert
No ratings yet
Action by Ship Upon Reception of HF DSC Distress Alert
1 page
Miyata Catalogue 88
No ratings yet
Miyata Catalogue 88
30 pages
AC750 to AC1900 Wireless Routers
No ratings yet
AC750 to AC1900 Wireless Routers
41 pages
Standby Dump Batterystats 2024 0722 211659
No ratings yet
Standby Dump Batterystats 2024 0722 211659
131 pages
Petronas Towers: Design and Structure Insights
No ratings yet
Petronas Towers: Design and Structure Insights
1 page
PID Control for Engineers
No ratings yet
PID Control for Engineers
14 pages
Cyberpunk Por Bruce Bethke
100% (2)
Cyberpunk Por Bruce Bethke
226 pages
HYSYS
No ratings yet
HYSYS
1 page
Final
No ratings yet
Final
12 pages
Peter Senge: Systems Thinking Pioneer
No ratings yet
Peter Senge: Systems Thinking Pioneer
4 pages
Qap Ball Valve
No ratings yet
Qap Ball Valve
1 page
SAP FI Accounts Payable Guide
No ratings yet
SAP FI Accounts Payable Guide
44 pages
hrs1k (H)
No ratings yet
hrs1k (H)
3 pages
Relay Circuit Globe
No ratings yet
Relay Circuit Globe
33 pages
SSC CGL Typing
No ratings yet
SSC CGL Typing
12 pages
Report Assignment
No ratings yet
Report Assignment
1 page
Breter Series L2 Switches Overview
No ratings yet
Breter Series L2 Switches Overview
7 pages
Industrial Training Report (Reference
No ratings yet
Industrial Training Report (Reference
32 pages
How The Brain Experiences Architecture
100% (2)
How The Brain Experiences Architecture
16 pages
6SL3246-0BA22-1PA0 Datasheet en
No ratings yet
6SL3246-0BA22-1PA0 Datasheet en
1 page
Pre-K Unit Plan: Where We Live
No ratings yet
Pre-K Unit Plan: Where We Live
14 pages
Masterflex 828: Hot Pour Joint Sealant, ASTM D1190
No ratings yet
Masterflex 828: Hot Pour Joint Sealant, ASTM D1190
2 pages
Understanding Cutoff Wavelength in Fibers
No ratings yet
Understanding Cutoff Wavelength in Fibers
3 pages
Saudi Aramco Typical Inspection Plan: Soil Improvement (Vibro Replacement & Vibro Compaction) 31-Nov-2018 Civil
0% (1)
Saudi Aramco Typical Inspection Plan: Soil Improvement (Vibro Replacement & Vibro Compaction) 31-Nov-2018 Civil
10 pages
Quality in Motion.: Carbon Strips Carbon Brushes Carbon Blanks
No ratings yet
Quality in Motion.: Carbon Strips Carbon Brushes Carbon Blanks
8 pages
Industrial Valve Solutions Guide
No ratings yet
Industrial Valve Solutions Guide
2 pages
Enterprise Resource Planning Success A Management Theory Approach To Critical Success Factors
No ratings yet
Enterprise Resource Planning Success A Management Theory Approach To Critical Success Factors
335 pages
1kgf100824e 83SR04 R1210
No ratings yet
1kgf100824e 83SR04 R1210
20 pages
CTL Rajasthan
0% (1)
CTL Rajasthan
7 pages

Dynamic Topic Modelling Tutorial

Uploaded by

Dynamic Topic Modelling Tutorial

Uploaded by

Dynamic Topic Modelling Tutorial

In this tutorial, we will use asmallfragmentofthedatabaseusedfortheinvestigationTwitter

lang: we have classificatethe tweets by language, becausethisDTMtutorialwillwork

Wed like to share thedatasets oftweetswithallthisinformation, butsinceTwitterAPITerms

[Link]: contains the data [Link]

2. Generating the Input Files

If you alreadyhavethefilesinthespecifiedformat,youcan skipthissectionandgodirectlyto

2.1 Read From database

Now that you have all thetweetsandyouhavealreadycountedhowmanyarethere,youcan

#Write seq file

Next step is to write the [Link] [Link],wemusthaveourdocumentsvectorizedand

3 Generating the Corpus with gensim

Following the tutorialofgensim,wellfirstcreateadictionaryandavocabularyusingthefields

3. Back to DTM Bleis executable

It is important to notice that there is a parameter named

4. Interpreting Bleis DTM Output data

4.1 Using tethne to Interpret and Visualize DTM output

export the topics evolutionoverthetimespanandthetopicsmostcommonwordsintosimple

To do this, we use the tethnefrom_gerrish tool to loadthecorpus,andthenwecangenerate

4.2 Plotting with R

[Link] <- [Link](file="[Link]",sep=",");

[4] Blei,D.M.,& Lafferty, J. D. (2006,June). Dynamictopic [Link]

You might also like