Maier 2018
Maier 2018
Maier 2018
To cite this article: Daniel Maier, A Waldherr, P Miltner, G Wiedemann, A Niekler, A Keinert, B
Pfetsch, G Heyer, U Reber, T Häussler, H Schmid-Petri & S Adam (2018): Applying LDA topic
modeling in communication research: Toward a valid and reliable methodology, Communication
Methods and Measures, DOI: 10.1080/19312458.2018.1430754
ABSTRACT
Latent Dirichlet allocation (LDA) topic models are increasingly being used in
communication research. Yet, questions regarding reliability and validity of
the approach have received little attention thus far. In applying LDA to
textual data, researchers need to tackle at least four major challenges that
affect these criteria: (a) appropriate pre-processing of the text collection; (b)
adequate selection of model parameters, including the number of topics to
be generated; (c) evaluation of the model’s reliability; and (d) the process of
validly interpreting the resulting topics. We review the research literature
dealing with these questions and propose a methodology that approaches
these challenges. Our overall goal is to make LDA topic modeling more
accessible to communication researchers and to ensure compliance with
disciplinary standards. Consequently, we develop a brief hands-on user
guide for applying LDA topic modeling. We demonstrate the value of our
approach with empirical data from an ongoing research project.
Introduction
Topic modeling with latent Dirichlet allocation (LDA) is a computational content-analysis technique
that can be used to investigate the “hidden” thematic structure of a given collection of texts. The
data-driven and computational nature of LDA makes it attractive for communication research
because it allows for quickly and efficiently deriving the thematic structure of large amounts of
text documents. It combines an inductive approach with quantitative measurements, making it
particularly suitable for exploratory and descriptive analyses (Elgesem, Steskal, & Diakopoulos,
2015; Koltsova & Shcherbak, 2015).
Consequently, LDA topic models are increasingly being used in communication research.
However, communication scholars have not yet developed good-practice guidance for the many
challenges a user faces when applying LDA topic modeling. Important methodological decisions
must be made that are rarely explained at length in application-focused studies. These decisions
relate to at least four challenging questions: (a) How does one pre-process unstructured text data
appropriately? (b) How does one select algorithm parameters appropriately, e.g., the number of
topics to be generated? (c) How can one evaluate and, if necessary, improve reliability and inter-
pretability of the model solution? (d) How can one validate the resulting topics?
These challenges particularly affect the approach’s reliability and validity, both of which are core
criteria for content analysis in communication research (Neuendorf, 2017), but they have, never-
theless, received little attention thus far. This article’s aim is to provide a thorough review and
discussion of these challenges and to propose methods to ensure the validity and reliability of topic
models. Such scrutiny is necessary to make LDA-based topic modeling more accessible and applic-
able for communication researchers.
This article is organized as follows. First, we briefly introduce the statistical background of LDA.
Second, we review how the aforementioned questions are addressed in studies that have applied LDA
in communication research. Third, drawing on knowledge from these studies and our experiences
from an ongoing research project, we propose a good-practice approach that we apply to an
empirical collection of 186,557 web documents. Our proposal comprises detailed explanations and
novel solutions for the aforementioned questions, including a practical guide for users in commu-
nication research. In the concluding section, we briefly summarize how the core challenges of LDA
topic modeling can be practically addressed by communication scholars in future research.
seen as factors that consist of sets of words, and documents incorporate such factors with different
weights (Lötscher, 1987). Topic models draw on the notion of distributional semantics (Turney & Pantel,
2010) and particularly make use of the so-called bag of words assumption, i.e., the ordering of words
within each document is ignored. To grasp the thematic structure of a document, it is sufficient to
describe its distribution of words (Grimmer & Stewart, 2013).
Although it appears fairly obvious what a topic is at first glance, there exists no clear-cut established
definition of topics in communication research (Günther & Domahidi, 2017, p. 3057). Following
Brown and Yule (1983, p. 73), Günther and Domahidi (2017, p. 3057) conclude that a “topic” can only
vaguely be described as “what is being talked/written about”. In the context of LDA topic modeling, the
concept of a topic also takes on an intuitive and rather “abstract notion” of a topic (Blei et al., 2003, p.
995). However, what topic actually means in theoretical terms remains unclear. The meaning of a topic
in an LDA topic model must be assessed empirically instead (Jacobi, Van Atteveldt, & Welbers, 2015,
p. 91) and defined against the background of substantive theoretical concepts, such as “political issues”
or “frames” (Maier, Waldherr, Miltner, Jähnichen, & Pfetsch, 2017).
(1) We assume that each document, d, in a corpus can be described as a probability distribution
over topics. This distribution, called θd (the topic distribution of document d), is drawn from a
Dirichlet distribution with prior parameter α (which must be chosen by the researcher).
4 D. MAIER ET AL.
(2) Thus, each topic can be defined as a probability distribution over the entire corpus
vocabulary, i.e., all the different words that appear in the documents. More technically,
for each topic k, we draw ϕk, a distribution over the V words of the vocabulary from a
Dirichlet distribution with prior parameter β (which must be chosen by the researcher).
(3) Within each document (d = 1, . . ., D) and for every word in that document (i = 1, . . ., Nd), in
which i is the index count for each word in document d and Nd is the total length of d, we
sample:
(a) a topic (zd,i) from the respective topic distribution in the document (θd), and
(b) a word (wd,i) from the respective topic’s word distribution ϕk, in which k is zd,i, the topic
we sampled in the previous step.
The core concept of the model implies a statistical creation of a document as a process of randomly
drawing topics (3a), then randomly drawing words associated with these topics (3b). This process
has a crucial function: It explicates the dependency relationship between the observed variables
(words in documents wd,i) and the unobserved variables (word-topic distribution ϕ and document-
topic distribution θ), thereby paving the way for the application of statistical inference (Griffiths &
Steyvers, 2004).
Although the inference procedures cannot be addressed here in detail, it is essential to understand
that the statistical theory sketches a joint-probability distribution of the observed and latent variables
altogether (see Blei, 2012, pp. 79–80). From this joint-probability distribution, defined by the
generative process, the conditional probability distribution of the latent variables ϕ and θ can be
estimated (see Blei, 2012, pp. 79–80) using variational inference (Blei et al., 2013) or Gibbs sampling
(see Griffiths & Steyvers, 2004). Therefore, for application on an empirical corpus, the algorithm
makes use of the generative process and inverts the aforementioned steps. LDA starts with a random
initialization, i.e., it randomly assigns term probabilities to topics (i.e., the initial state of ϕ) and topic
probabilities to documents (i.e., the initial state of θ). The algorithm then aims to maximize joint
likelihood of the model by iteratively adapting values of the word-topic distribution matrix ϕ and
document-topic distribution matrix θ.
in stochastic processes. Reliability and validity cannot be taken for granted. In the remainder of this
article, we highlight four challenges with LDA topic modeling and propose guidelines as to how to
deal with them.
(1) Before a topic model can even be estimated for an empirical corpus, the text collection must
be sanitized of undesirable components and further pre-processed. Cleaning and pre-
processing affect the input vocabulary and the documents included in the modeling process.
Until now, little is known about the impact of preprocessing on reliability, interpretability,
and validity of topic models. However, recent studies (e.g., Denny & Spirling, 2017) suggest
that preprocessing strongly affects all these criteria. We provide suggestions on how text
data can be cleaned, which pre-processing steps are reasonable to include, and in which
order these steps should be applied.
(2) Three model parameters must be selected (K, α, and β), which affect the dimensions and a
priori defined distribution of the target variables, ϕ and θ. All three parameters (i.e., K, α,
and β) are of substantial importance for the resulting topic model. Thus, the selection of
appropriate prior parameters and the number of topics is crucial to retrieve models that
adequately reflect the data and can be meaningfully interpreted. Thus far, there is no
statistical standard procedure to guide this selection; thus, this remains one of the most
complicated tasks in the application of LDA topic modeling. Our proposal suggests a two-
step approach: In the first step, the prior parameters are calibrated along the mean intrinsic
coherence of the model, i.e., a metric focused on the interpretability (Mimno, Wallach,
Talley, Leenders, & McCallum, 2011) to find appropriate candidate models with different
numbers for the K proposed topics. In the second step, a qualitative investigation of these
candidates follows, which aims to match the models’ results with the theoretical concept
under study.
(3) The random initialization of the model and the sequence of multiple random processes are
integral parts of LDA. The fact that topical contexts are manifested by combining certain
words throughout multiple documents will guide the inference mechanism to assign similar
topics to documents containing similar word distributions. Inference, itself, is also governed
by stochastic random processes to approach a maximum joint probability of the model
based on the evidence in the data. Due to both random initialization and stochastic
inference, the results from topic models are not entirely deterministic. This calls for
reliability checks that indicate the robustness of the topic solutions. We provide an easy-
to-calculate reliability metric (Niekler & Jähnichen, 2012) and show that random initializa-
tion is a weakness in the LDA architecture. It is clearly inferior to non-random initialization
methods, which, as we demonstrate, can improve the reliability of an LDA topic model.
(4) Most importantly, topics are latent variables composed of word distributions. We agree with
DiMaggio, Nag, and Blei (2013, p. 586), who write “[P]roducing an interpretable solution is
the beginning, not the end, of an analysis.” To draw adequate conclusions, the interpretation
of the latent variables must be substantially validated. We advise researchers to use system-
atically structured combinations of existing metrics and in-depth investigation to boost the
significance of the validation process.
The four challenges are not independent of each other. Having a clean text corpus and finding a
parameter setting that generates interpretable topics are important prerequisites for valid interpreta-
tion. Just as well, reliability of the topic solution is an essential precondition for validity.
Literature review
In this section, we systematically review how communication-related research has responded to these
challenges so far. We performed keyword searches in EBSCO Communication Source and Web of
6 D. MAIER ET AL.
Science (SSCI).2 The search yielded 61 unique results, which two authors classified as focusing on
communication research or other fields of study. Articles were considered further if they applied the
LDA algorithm and set out to answer a question of communication research, or used mass-
communication data (e.g., newspaper articles, public comments, tweets). Some studies have a
substantive thematic research focus, while many others referred to methodological issues. Of the
latter studies, only those that demonstrate the application of topic modeling with a sample corpus
were included in our review, while general descriptions and discussions of the method were ruled
out (e.g., Griffiths, Steyvers, & Tenenbaum, 2007; Günther & Quandt, 2016).
We completed our retrieval of relevant and recent studies by checking Google Scholar and also
revisiting basic literature on topic modeling (e.g., Blei, 2012; Blei et al., 2003). The final collection of
research articles contained 20 publications in communication research (listed in Appendix A), with
12 studies focusing on the method and only 8 studies dealing with thematic research questions. We
reviewed all 20 studies for solutions regarding their approach to (a) preprocessing, (b) parameter
selection, (c) reliability, and (d) validity.
objective is to find substantive topics, this approach also has been termed a substantive search
(Bonilla & Grimmer, 2013, p. 656). Because the overall goal is to generate a topic solution that can be
validly interpreted, some researchers also draw on further external and internal validation criteria
(discussed below) to choose between different candidate models (Baum, 2012; Evans, 2014).
There are also different metrics used to inform the process of model selection. The most widely
applied is the measure of perplexity (used by, e.g., Ghosh & Guha, 2013; Jacobi et al., 2015). The
perplexity metric is a measure used to determine the statistical goodness of fit of a topic model (Blei
et al., 2003). Generally, it estimates how well a model produced for the major part of the corpus
predicts a held-out smaller portion of the documents.
Another strategy is to run a non-parametric topic model, such as a Hierarchical Dirichlet Process
(HDP) topic model (see Teh, Jordan, Beal, & Blei, 2006) in which K does not need to be defined in
advance. Instead, a statistically appropriate number of topics is estimated from the data (Bonilla &
Grimmer, 2013). However, for such a model, other even more abstract parameters must be defined
in advance, so that the problem about the model’s granularity is not solved, but merely shifted to yet
another parameter.
The choice of the prior parameters α and β is rarely discussed in current studies. Ghosh and Guha
(2013) apply default values that are set in the R topicmodels package by Grün and Hornik (2011). Biel
and Gatica-Perez (2014) refer to standard values proposed by Blei et al. (2003). Evans (2014) uses an
optimization procedure offered by the MALLET software package (McCallum, 2002) to iteratively
optimize the Dirichlet parameter for each topic at regular intervals.
Jacobi et al., 2015). Some studies also checked whether the temporal patterns of topics corresponded
with events that occurred in the study’s time frame (e.g., Evans, 2014; Newman, Chemudugunta,
Smyth, & Steyvers, 2006).
Summarizing our review, we agree with Koltsova and Koltcov (2013, p. 214) that “the evaluation
of topic models is a new and still underdeveloped area of inquiry.” While in the past few years, a
range of strategies for testing the validity of topic models has been established, a standard metho-
dology for ensuring the reliability of the topics has yet to be developed in communication research.
threshold of .95. For each duplicate, a reference to the first occurrence of that document was stored
to allow for queries, including or excluding duplicates in the resulting set. Altogether, 87,692
documents were marked as being unique.
Generally speaking, we deem rigorous data cleaning to be necessary and suggest that text
documents should be relieved of boilerplate content, such as ads, side bars, and links to related
content. If boilerplate content either is not randomly distributed across all the documents in the
corpus—which would be a naive assumption for most empirical corpora—or the documents are not
cleaned extensively enough, the LDA algorithm could be distorted and uninterpretable, as messy
topics could emerge.
Corpus cleaning is only the first step. Automated content-analysis procedures, such as topic
modeling, need further specific preprocessing of textual data. “Preprocessing text strips out informa-
tion, in addition to reducing complexity, but experience in this literature is that the trade-off is well
worth it” (Hopkins & King, 2010, p. 223). As we pointed out in the literature review, many LDA
studies have reported using a range of seemingly standard pre-processing rules. However, most
studies fail to emphasize that these consecutively applied rules depend on each other, which implies
that their ordering matters (see also Denny & Spirling, 2017). Although a single correct pre-
processing chain cannot be defined, the literature provides reasons for proceeding in a specific order.
Thus, we suggest that after data cleaning, the documents should be divided into units, usually
word units, called tokens. Hence, this step is called tokenization (Manning & Schütze, 2003, p. 124).
After tokenization, all capital letters should be converted to lowercase, which should be applied for
the purpose of term unification. After that, punctuation and special characters (e.g., periods,
commas, exclamation points, ampersands, white-space, etc.) should be deleted. While punctuation
may bear important semantic information for human readers of a text, it is usually regarded as
undesirable and uninformative in automatic text analyses based on the bag-of-words approach (e.g.,
Scott & Matwin, 1999, p. 379). However, following Denny and Spirling (2017, p. 6), some special
characters, such as the hashtag character, might be informative in specific contexts, e.g., modeling a
corpus of tweets, and should be kept in such cases. The next step is to remove stop-words, which are
usually functional words such as prepositions or articles. Their removal is reasonable because they
appear frequently and are “insufficiently specific to represent document content” (Salton, 1991, p.
976). While lowercasing and removal of punctuation and special characters can be done in any order
after tokenization, they should be done before the removal of stop-words to reduce the risk that
stop-word dictionaries may be unable to detect stop-words in the corpus vocabulary. Unification
procedures, such as lemmatization and stemming, should be used only after stop-word removal. As
mentioned above, both techniques are used for the purpose of reducing inflected forms and “some-
times derivationally related forms of a word to a common base form” (Manning, Raghavan, &
Schütze, 2009, p. 32). However, we prefer lemmatization over stemming because stemming “com-
monly collapses derivationally related words, whereas lemmatization commonly only collapses the
different inflectional forms of a lemma” (Manning et al., 2009, p. 32). Thus, interpreting word stems
correctly can be tough, or even impossible. For example, while the word organized is reduced to its
stem, organ, its lemma is organize.
In the very last step, relative pruning should be applied. Due to language-distribution character-
istics, we can expect a vast share of very infrequent words in the vocabulary of a collection. In fact,
roughly half of the terms of the vocabulary occur only once (Zipf’s Law, e.g., Manning & Schütze,
2003, pp. 23–29). Thus, relative pruning is recommended to strip very rare and extremely frequent
word occurrences from the observed data. Moreover, relative pruning reduces the size of the corpus
vocabulary, which will enhance the algorithm’s performance remarkably (Denny & Spirling, 2017)
and will stabilize LDA’s stochastic inference. In our empirical study, relative pruning was applied,
removing all terms that occurred in more than 99% or less than .5% of all documents (Denny &
Spirling, 2017; Grimmer, 2010; Grimmer & Stewart, 2013).
If the unification of inflected words is not applied before relative pruning, chances are high that
semantically similar terms such as genetic and genetically will be part of the vocabulary, i.e., if a user
10 D. MAIER ET AL.
complies with the suggested ordering, the corpus vocabulary will be reduced, while still maintaining
a great diversity of substantively different words. In our empirical case, we followed the proposed
ordering of the pre-processing steps.
In response to these findings, topic-coherence measures were proposed based on the assumption
that the more frequently top words of a single topic co-occur in documents, the more coherent the
topic. Studies have shown that coherence measured with respect to data that is external (Newman,
Lau, Grieser, & Baldwin, 2010) or internal to the corpus (Mimno et al., 2011) correlates with human
judgment on topic interpretability. The latter is also referred to as intrinsic coherence.5
For both interpretability and reliability, different regularization techniques have been tested. In
this regard, regularization of topic models describes a process that helps mitigate ill-posed mathe-
matical problems and guides them toward a more favorable solution.
test for three different corpora (food-safety-related content from Germany, the U.K., and the U.S.,
which is the focal corpus of the empirical study), with different topic numbers K.6
As a baseline strategy, we tested the standard random initialization of LDA. As a second strategy,
we fixed the random initialization with a specific seed value, but afterward, we reset the random-
number generator. We ran this experiment to test the influence of random sampling during the
inference algorithm, independent of initialization. As our own third strategy, we proposed a
modification of clustered initialization from Lancichinetti et al. (2015). We also initialized the topics
based on term-co-occurrence networks. In contrast to the original approach, which was tested on
two highly artificial text collections, we observed that their proposed combination of significance
measure (Poisson) and clustering algorithm (Infomap) does not perform well on real-world data to
identify coherent semantic clusters. Thus, we selected alternatives to achieve a better pre-clustering
of terms. For determining co-occurrence significance, we relied on Dunning’s Log-Likelihood Ratio
Test (LL) (Bordag, 2008). Subsequent semantic community detection is performed by applying the
Partitioning-Around-Medoids (PAM) algorithm (see Kaufman & Rousseeuw, 1990).
Each experiment was repeated n = 10 times. Figure 2 displays the average reliability of the
experiments over the progress of Gibbs sampling iterations. Confidence intervals for reliability are
provided on the basis of nðn1 2
Þ
¼ 45 possible pairs for comparing models i and j. The results
indicate that our cluster-initialization strategy significantly improves the reliability of the inference
for all three corpora and leads to levels of reproducibility above 85% for the German and U.K.
corpus, and above 75% for the U.S. corpus. The seeded initialization also outperforms the random
standard initialization, but does not reach the performance of an initialization by semantic network
clustering. From this result, we conclude that the stability of the inference algorithm itself actually
can be quite high once it starts from the same position. We further conclude that providing semantic
clusters of terms as a starting position leads to even more stable results in the inference process,
thereby indicating why it is the preferred strategy to improve reliability.
Figure 2. Reliability of topic models for three corpora (DE = Germany; UK = United Kingdom; US = United States) according to
different initialization techniques (random = default random initialization; seed = fixed seed initialization; and cluster = semantic
co-occurrence network initialization) and varying number of inference iterations; K = number of topics.
COMMUNICATION METHODS AND MEASURES 13
Figure 3. Mean coherence of topic models for three corpora (DE = Germany, UK = United Kingdom, US = United States) according
to different initialization techniques (random = default random initialization; seed = fixed seed initialization; and
cluster = semantic co-occurrence network initialization) and varying number of inference iterations; K = number of topics.
Figure 3 displays the average topic-model coherences of the ten repeated runs of our experiment,
including their confidence intervals. Compared with the reliability check, the results are rather
mixed. Although the cluster initialization usually performs very well, differences between all the
strategies are not very pronounced. The most important finding from this part of the experiment is
that topic coherence is drastically lowered if sampling runs for only one iteration. Although it
guarantees perfect reliability, the results of such an early stopped process cannot be used in a
practical scenario. We conclude that to further improve interpretability, the process also needs to
run for some time until the topic composition stabilizes. We recommend at least 1,000 iterations.
Running only one iteration, as proposed in Lancichinetti et al. (2015), trades reliability for inter-
pretability and appears to be a bad choice in practical scenarios.
number of parameters included, we fixed the value of β at 1/K, the default value as proposed by the
widely used topic model library gensim (Řehůřek & Sojka, 2010). The prior for the topic-document
matrix α was found to be of greater importance for the quality of the topic model (e.g., Wallach,
Mimno, & McCallum, 2009), which was the reason to fix β and let α vary. The model was run with
1,000 iterations. We calculated 6 different models (i.e., all possible combinations of α) for each of the
three values in K (resulting in 18 models, see Appendix B) and chose the single best model for each
K regarding the mean intrinsic topic coherence for further investigation. We refer to these three
models as our candidate models.
Instead of using the whole corpus for the model creation, in this phase, we took a random sample
of 10,000 non-duplicate documents (out of 87,692 unique documents) to calculate these models.
Whether a document sample is representative “depends on the extent to which it includes the range
of linguistic distributions in the population” (Biber, 1993, p. 243). Thus, for topic-modeling
purposes, a valid sample must catch the variety of word co-occurrence structures in the document
population. Random sampling can be regarded a valid procedure for topic modeling of very large
document collections. Due to the characteristic distribution of language data, we can expect a huge
share of very infrequent words in the vocabulary of a collection. This is also the reason why the
pruning of infrequent vocabulary is a recommended and valid pre-processing step. In other words,
applying relative pruning to the full corpus yields a very similar vocabulary, as would applying
relative pruning to a random sample of 10% of the corpus. In both cases, document content is
reduced to a very similar vocabulary. Thus, it is reasonable to expect that co-occurrence structures of
these terms in a large-enough random sample would be very similar to those in the entire corpus.
Still, the size of the sample must be big enough to draw valid conclusions about which parameter
configurations yielded solid models. Scholars from corpus linguistics (e.g., Hanks, 2012) argue that
sample size is the most important criterion to consider in covering the thematic diversity of the
corpus. As a rule of thumb for domain-specific corpora, we recommend using at least a two-digit
fraction (10% minimum) of the overall corpus size. In our empirical case, we drew a random sample
of 10,000 documents, or 11.4% (10,000/87,692) of the document total. However, it is important to
note that it cannot be guaranteed that this technique will work well for corpora containing
significantly smaller sized and/or more heterogeneous documents. In our view, the validity of this
technique crucially depends on whether the sample size is big enough to capture the heterogeneity of
the corpus vocabulary. In this regard, future research needs to figure out valid guidelines for
sampling strategies and sample sizes.
The 10,000 sampled documents are used only for purposes of model creation and selection.
Inference is conducted for the complete corpus. The separation of model creation and inference
enables us to directly use the model that we created on the basis of the random sample and
successively infer the topic composition of the remaining documents.
A group of four researchers discussed the three best topic models in terms of their mean
coherence metric, one for each value in K. For the collaborative investigation of the three models,
the LDA visualization software LDAvis was used (Sievert & Shirley, 2014). The question that was
guiding the qualitative investigation of the group was: Which topic model most suitably represents the
contentious matters of dispute, i.e., the “issues,” of the food-safety discourse in civil society on the Web?
The discussion and interpretation were based on the models’ ϕ matrices, i.e., word-topic distribu-
tions, and also considered varying orders of the top words using Sievert and Shirley (2014) relevance
metric (explained in the next section). The group discussion led to a consensus within the research
group. The model with K = 50 offered the most reasonable topic solution to interpret the theoretical
concept of “political issues,” which was the focus of our research. While setting K = 70 led to too
many topics that could easily be traced back to arguments put forth by single websites, minor events,
or remaining boilerplate, K = 30 obfuscated and blurred issues that would otherwise be treated
separately by the research group. We decided in favor of the model with the parameters K = 50,
α = .5, and β = 1/K = .02. This solution deserved further investigation in validity checks.
COMMUNICATION METHODS AND MEASURES 15
Summarizing topics
To summarize the topics, we used several auxiliary metrics to better understand the semantics of the
topics’ word distributions. Specifically, we used the following four metrics.
(1) Rank-1: The Rank-1 metric (see Evans, 2014) counts how many times a topic is the most
prevalent in a document. Thus, the metric can help identify so-called background topics,
which usually contribute much to the whole model, but their word distribution is not very
specific. In the case of a high topic share in the entire collection being accompanied by a low
Rank-1 value, we can make a reasonable guess that a topic occurs in many documents, but
rarely can be found as the dominant topic of a document. The empirical example presented
below contains several background topics, such as economy, politics, and health care, all of
which constitute the setting in which the food-safety debate among civil-society actors takes
place.
(2) Coherence: This metric, developed by Mimno et al. (2011, p. 264), already was used for
model-selection purposes. However, applied to single topics, it also helps guide intuition and
may help identify true topics in which a researcher might not see a coherent concept at first
glance.
(3) Relevance: The word distributions within any topic of the model are based on the word
probabilities conditioned on topics. However, provided that a given word, e.g., food, occurs
frequently in many documents, it is likely to have a high conditional probability in many
topics and thereby occurs frequently as a top-word. In this case, such a word does not
contribute much to the specific semantics of a given topic. Sievert and Shirley (2014, pp.
66–67) developed the so-called relevance metric, which is used to reorder the top words of a
topic by considering their overall corpus frequency. The researchers can decide how much
weight should be ascribed to corpus frequencies of words by manipulating the weighting
parameter λ, which can have values ranging from 0 to 1. For λ = 1, the ordering of the top
words is equal to the ordering of the standard conditional word probabilities. For λ close to
zero, the most specific words of the topic will lead the list of top words. In their case study,
Sievert and Shirley (2014, p. 67) found the best interpretability of topics using a λ-value
close to .6, which we adopted for our own case.
(4) Sources and concentration: In our empirical dataset, we selected sources of topics by asking
which websites were promoting certain topics and how much a topic was concentrated in
the potential sources. Therefore, we assessed the average source distribution of topics by
computing the Hirschman-Herfindahl Index (HHI) as a concentration measure. The HHI
ranges from 1/number of sources to 1. An HHI = 1 signifies maximum concentration, i.e.,
the topic is pronounced by only one source. A very low HHI value, conversely, indicates that
a topic can be found in many sources.
16 D. MAIER ET AL.
For the interpretation of our topics, we summarized the aforementioned metrics on a single
overview sheet, one for each topic in the model (see Appendix C for an example topic).
Exclusion of topics
After summarizing the topics in this manner, two researchers reviewed all the topic sheets indepen-
dently from each other. By relying on both the metrics and their expert knowledge about food safety,
they (independently) judged whether the topics should still be included for further investigation or
not. More specifically, topics whose top-word lists were hard to interpret and which came with low
values in Rank-1 and coherence while showing low prevalence and high concentration were
excluded. If one author had judged that a topic deserved in-depth investigation, the topic was
kept. In the case that both authors came to the conclusion that a topic should be discarded it was
discarded. In other words, we kept a topic if there was at least one indication that it contained a
meaningful, coherent concept.
Another peculiarity of topic models are boilerplate topics. Although we extensively cleaned the
corpus (see the Building and preprocessing the corpus section), boilerplate content still showed up in
some topics. Boilerplate topics are common phenomena in topic models (Mimno & Blei, 2011). They
have no substantive meaning, but their emergence sharpens other meaningful topics “by segregating
boilerplate terms in a distinct location” (DiMaggio et al., 2013, p. 586). Most often, the identified
boilerplate topics coincide with the most unreliable and least-salient topics (see also Mimno et al.,
2011).
After discussing the results of the separate investigations we made a consensual decision using the
aforementioned criteria. The authors decided that 13 topics should be removed because they showed
no indication of being either meaningful or coherent. The remaining 37 topics were subject to the
final validation and labeling step.
Table 1. Validated topic model for the online text corpus about food safety in the U.S.
K Label Share % M (SD) HHI M (SD) Top-5 Words
Agriculture
25 GM Food 3.94 (0.90) 0.04 (0.01) food, label, genetically, monsanto, gmo
9 Organic Farming 2.58 (0.37) 0.02 (0.00) organic, food, farm, farmer, agriculture
20 Livestock 2.55 (0.18) 0.03 (0.00) meat, food, animal, beef, milk
10 Antibiotics 2.21 (0.46) 0.10 (0.02) antibiotic, animal, health, drug, human
Consumption and Protection
22 Foodborne Diseases 4.06 (1.34) 0.06 (0.02) food, outbreak, salmonella, illness, report
8 FS Regulation 3.48 (0.40) 0.04 (0.01) food, fda, safety, product, consumer
7 Contaminated Food 2.77 (0.63) 0.04 (0.01) safety, recall, produce, fda, outbreak
29 Food Consumption 2.26 (0.14) 0.03 (0.01) product, company, consumer, store, sell
27 Restaurant Inspection 2.14 (0.98) 0.09 (0.04) food, restaurant, safety, health, inspection
16 Tap Water 1.53 (1.03) 0.22 (0.23) water, food, public, protect, watch
39 BPA-packaging 1.50 (0.83) 0.15 (0.11) chemical, bpa, safe, toxic, health
Science and Technology
6 Health Reports 3.48 (0.25) 0.02 (0.00) health, report, public, risk, datum
19 Chemicals 2.28 (0.28) 0.02 (0.00) study, chemical, level, health, human
37 GM Technology 1.84 (0.12) 0.02 (0.00) research, test, science, article, study
Environment
44 Bees and Pesticides 3.14 (1.90) 0.41 (0.28) bee, pesticide, epa, food, center
43 Environment 1.41 (0.28) 0.05 (0.02) read, fish, salmon, environment, specie
50 Fracking 1.37 (0.30) 0.04 (0.02) energy, gas, oil, water, environmental
31 Climate Change 1.34 (0.22) 0.03 (0.01) climate, change, report, world, warm
Personal Health and Wellbeing
21 (Un)healthy Diet 2.32 (0.44) 0.04 (0.01) food, fat, sugar, diet, health
35 Health and Nutrition 2.31 (0.24) 0.01 (0.01) program, community, work, education, child
38 Recipes 2.26 (0.41) 0.03 (0.01) cook, eat, meat, make, recipe
1 School Food 2.00 (0.52) 0.17 (0.08) food, school, pew, safety, project
12 Dietary Therapy/Prevention 1.42 (0.18) 0.03 (0.01) cancer, disease, woman, blood, child
42 Medical Information 1.29 (0.39) 0.07 (0.08) doctor, medicine, take, day, skin
Background Topics
14 Politics 2.65 (0.28) 0.03 (0.01) bill, state, obama, law, house
11 Economy 2.50 (0.29) 0.02 (0.01) company, market, country, million, u.s.
24 Law and Order 2.20 (0.34) 0.02 (0.00) report, year, police, official, court
2 Infectious Diseases 2.03 (0.62) 0.06 (0.02) health, coli, pet, animal, case
48 Health Care 1.07 (0.46) 0.13 (0.11) drug, health, care, medical, patient
Note. HHI = Hirschman-Herfindahl-Index; GM = genetically modified; BPA = Bisphenol A; FS = food safety; K = index of the topic.
18 D. MAIER ET AL.
In our view, a comprehensive presentation of a topic model also should encompass some of the
most important measures, such as the salience of a topic and a fraction of the top-words (see
Table 1). Top-word presentation is important to give readers insight into topics.
Pre-processing
LDA does not just work for “nice” and “easy” data. As our technically challenging case exemplifies,
elaborate data cleaning is necessary, especially for unstructured text collections. Additionally, researchers
may not only rely on a seemingly standard procedure for successively applied pre-processing steps.
Instead, it is important to consider the specifics of the text corpus, including theoretical implications, as
well as the proper ordering of pre-processing steps. For instance, the removal of some special characters,
such as hashtag-symbols, might be reasonable for the analysis of newspaper article-collections, but not
for tweet collections. Regarding proper ordering, we suggest proceeding in the following order: 1.
tokenization; 2. transforming all characters to lowercase; 3. removing punctuation and special characters;
4. Removing stop-words; 5. term unification (lemmatizing or stemming); and 6. relative pruning. We
prefer lemmatizing over stemming, because a word’s lemma is usually easier to interpret than its stem.
Model selection
Also, the proposed model-selection process can be costly and time-consuming, but it will yield more
reliable topic models with enhanced interpretability. We propose three considerations.
First, our approach suggests a two-step procedure for model selection that aims to optimize the
human interpretation of topic models. In our view, interpretability should be the prime criterion in
selecting candidate models. Communication researchers working with content data aim to gain knowl-
edge about content characteristics and the substantive meaning of the text collection. Thus, the success of
LDA applications for both objectives depends on how well the resulting model can be interpreted by
human researchers. Therefore, we suggest first calculating candidate models with varying granularity
levels (i.e., different values for K) and different combinations of prior parameters α and β. Then, choose
one model for each K, in which the parameter configuration yields the best results regarding the intrinsic
coherence metric. The chosen candidate models need to be further investigated in the second step with a
substantive search in coherence-optimized candidate models. The purpose of the substantive search
should be to select one of the candidates that matches the granularity level with the theoretical concept
under study, such as political issues or frames. Substantive searches also may include qualitative
techniques, such as group discussions, to ensure intersubjectivity. Software tools, such as LDAvis
(Sievert & Shirley, 2014), proved to be extremely helpful to accomplish this task.
Second, if the size of a corpus is very extensive (e.g., n > 50,000 documents), large-enough samples
(e.g., > 10% of the documents) can be used instead of the whole corpus to calculate the candidate models.
It is clearly an intricate process to test various combinations of parameter settings, but using a
significantly smaller random subset of the corpus turned out to be a viable approach for mastering
this challenge. Using random samples will boost the algorithm’s performance and enable researchers to
test various parameter settings much faster. The separation of model creation and inference enabled us to
COMMUNICATION METHODS AND MEASURES 19
directly use the model that we created on the basis of the random sample and successively infer the topic
composition of the remaining documents. However, the validity of the sampling technique crucially
depends on whether the sample size is big enough to capture the heterogeneity of the corpus vocabulary.
Thus, we cannot guarantee that a sample of roughly 10% of the documents will work equally well for
more heterogeneous corpora, and corpora containing significantly smaller sized documents (e.g., a
corpus of tweets). Future research needs to address the question of valid guidelines regardless of corpus
characteristics.
Third, a well-fitted model with meaningful interpretation is worthless if the results cannot be reproduced.
To tackle this issue, we advanced the regularization technique of Lancichinetti et al. (2015) using a semantic-
network initialization approach. The literature, as well as our experiments which included multiple corpora,
provided evidence that available regularization techniques, such as ours, significantly enhances the reliability
of topic models. However, because reliability cannot be guaranteed for topic models generally, we believe
that reliability reporting for LDA models should become a disciplinary standard in communication
research. We suggest using the metric proposed by Niekler (2016) for this purpose.
Validation
The sequential validation procedure approximates validity from different angles. The available
metrics, which have different interpretations, are not treated as objective indicators for how well
the model works or how good a topic is. Instead, our approach focuses on inter-individual inter-
pretability using the metrics as a basis. Each step in the process involves deliberation among several
researchers. Two criteria of validity were checked: intra-topic and inter-topic semantic validity
(Quinn et al., 2010). Our case study teaches us that intra-topic semantic validity cannot be derived
merely from a topic’s word distribution. Several easy-to-calculate metrics definitely should be
considered to sharpen the understanding of whether or not a topic refers to a coherent semantic
concept. The most time-consuming, but indispensable, step is the manual check of documents with a
high probability of containing a specific topic. This practice allows us to compare and check whether
the notion that we sketch from the ϕ distribution matches the interpretation of several information-
rich text documents. Labeling topics on the basis of broader context knowledge seems only fair.
Final thoughts
We emphasize that we do not propose a whole new method for topic modeling. Instead, we develop
an approach to dealing with the methodological decisions one has to make for applying LDA topic
modeling reliably and validly in communication research. With the exception of the regularization-
technique which we demonstrated to work significantly better for multiple corpora, we used only a
single corpus as a showcase for our explications. However, we deem our approach generalizable to
other cases because every single component of our approach is either based on substantial existent
studies and/or based on a theoretical rationale.
All in all, LDA topic modeling has proven to be a most promising method for communication
research. At the same time, it does not work well with non-deliberate, arbitrary choices in model
selection and validation. Our study proposes methods and measures to approximate and improve
validity and reliability when using LDA. After all, we aim to provide a “good practice” example,
bringing LDA into the spotlight as a method that advances innovation in communication research.
Notes
1. The Dirichlet distribution is a continuous multivariate probability distribution which is frequently used in
Bayesian statistics.
2. EBSCO communication source (search in title OR abstract OR keywords; apply related words): “topic model”,
“topic modeling”, “topic modelling”, “latent Dirichlet allocation”. Web of Science (only communication-related
20 D. MAIER ET AL.
Acknowledgement
The first author claims single authorship for subsection Topic validity and labeling and section Presentation and
interpretation of the selected topic model including Table 1 and Appendix C.
Funding
This publication was created in the context of the Research Unit “Political Communication in the Online World”
(1381), subproject 7, which was funded by the Deutsche Forschungsgemeinschaft (DFG, German Research
Foundation). The subproject was also funded by the Swiss National Science Foundation (SNF).
References
Baum, D. (2012). Recognising speakers from the topics they talk about. Speech Communication, 54(10), 1132–1142.
Biber, D. (1993). Representativeness in corpus design. Literary and Linguistic Computing, 8(4), 243–257.
Biel, J.-I., & Gatica-Perez, D. (2014). Mining crowdsourced first impressions in online social video. Ieee Transactions
on Multimedia, 16(7), 2062–2074.
Blei, D. M. (2012). Probabilistic topic models. Communications of the ACM, 55(4), 77–84.
Blei, D. M., & Lafferty, J. D. (2006). Dynamic topic podels. Paper presented at the International Conference on Machine
Learning, Pittsburgh, PA.
Blei, D. M., & Lafferty, J. D. (2007). A correlated topic model of science. The Annals of Applied Statistics, 1(1), 17–35.
Blei, D. M., Ng, A. Y., & Jordan, M. I. (2003). Latent Dirichlet Allocation. The Journal of Machine Learning Research, 3
(4/5), 993–1022.
Bonilla, T., & Grimmer, J. (2013). Elevated threat levels and decreased expectations: How democracy handles terrorist
threats. Poetics, 41, 650–669.
Bordag, S. (2008). A comparison of co-occurrence and similarity measures as simulations of context. Proceedings of the
9th international conference on computational linguistics and intelligent text processing, 52–63. doi: 10.1007/978–3–
540–78135–6_5
Brown, G., & Yule, G. (1983). Discourse analysis. Cambridge, UK: Cambridge University Press.
Chang, J., Boyd-Graber, J., Gerrish, S., Wang, C., & Blei, D. (2009). Reading tea leaves: How humans interpret topic
models. Paper presented at the Neural Information Processing System 2009.
Cobb, R. W., & Elder, C. D. (1983). Participation in American politics: The dynamics of agenda-building. Baltimore,
MD: Johns Hopkins University Press.
Denny, M. J., & Spirling, A. (2017). Text preprocessing for unsupervised learning: Why it matters, when it misleads, and what to
do about it. New York University. Retrieved from http://www.nyu.edu/projects/spirling/documents/preprocessing.pdf
DiMaggio, P., Nag, M., & Blei, D. M. (2013). Exploiting affinities between topic modeling and the sociological
perspective on culture: Application to newspaper coverage of U.S. government arts funding. Poetics, 41, 570–606.
COMMUNICATION METHODS AND MEASURES 21
Elgesem, D., Feinerer, I., & Steskal, L. (2016). Bloggers’ responses to the Snowden affair: Combining automated and
manual methods in the analysis of news blogging. Computer Supported Cooperative Work (CSCW), 25, 167–191.
Elgesem, D., Steskal, L., & Diakopoulos, N. (2015). Structure and content of the discourse on climate change in the
blogosphere: The big picture. Environmental Communication, 9(2), 169–188.
Evans, M. S. (2014). A computational approach to qualitative analysis in large textual datasets. PLoS One, 9(2), 1–10.
Ghosh, D. D., & Guha, R. (2013). What are we ‘tweeting’ about obesity? Mapping tweets with topic modeling and
Geographic Information System. Cartography and Geographic Information Science, 40(2), 90–102.
Griffiths, T. L., & Steyvers, M. (2004). Finding scientific topics. Proceedings of the National Academy of Sciences, 101
(1), 5228–5235.
Griffiths, T. L., Steyvers, M., & Tenenbaum, J. B. (2007). Topics in semantic representation. Psychological Review, 114
(2), 211–244.
Grimmer, J. (2010). A Bayesian hierarchical topic model for political texts: Measuring expressed agendas in Senate
press releases. Political Analysis, 18(1), 1–35.
Grimmer, J., & Stewart, B. M. (2013). Text as data: The promise and pitfalls of automatic content analysis methods for
political texts. Political Analysis, 1–31. doi:10.1093/pan/mps028
Grün, B., & Hornik, K. (2011). topicmodels: An R package for fitting topic models. Journal of Statistical Software, 40
(13), 1–30.
Günther, E., & Domahidi, E. (2017). What communication scholars write about: An analysis of 80 years of research in
high-impact journals. International Journal of Communication, 11, 3051–3071.
Günther, E., & Quandt, T. (2016). Word counts and topic models. Digital Journalism, 4(1), 75–88.
Guo, L., Vargo, C. J., Pan, Z., Ding, W., & Ishwar, P. (2016). Big social data analytics in journalism and mass
communication: Comparing dictionary-based text analysis and unsupervised topic modeling. Journalism & Mass
Communication Quarterly, 1–28. doi:10.1177/1077699016639231
Hanks, P. (2012). The corpus revolution in lexicography. International Journal of Lexicography, 25(4), 398–436.
Hopkins, D. J., & King, G. (2010). A method of automated nonparametric content analysis for social science. American
Journal of Political Science, 54(1), 229–247.
Jacobi, C., Van Atteveldt, W., & Welbers, K. (2015). Quantitative analysis of large amounts of journalistic texts using
topic modelling. Digital Journalism, 1–18. doi:10.1080/21670811.2015.1093271
Kaufman, L., & Rousseeuw, P. J. (1990). Finding groups in data: An introduction to cluster analysis. Hoboken, NJ: Wiley.
Koltcov, S., Nikolenko, S. I., Koltsova, O., Filippov, V., & Bodrunova, S. (2016). Stable topic modeling with local
density regularization. In F. Bagnoli, A. Satsiou, I. Stavrakakis, P. Nesi, G. Pacini, Y. Welp, & D. DiFranzo (Eds.),
Internet science: Third international conference, INSCI 2016, Florence, Italy, September 12–14, 2016, Proceedings (pp.
176–188). Cham, Switzerland: Springer International Publishing.
Koltsova, O., & Koltcov, S. (2013). Mapping the public agenda with topic modeling: The case of the Russian
LiveJournal. Policy & Internet, 5(2), 207–227.
Koltsova, O., & Shcherbak, A. (2015). ‘LiveJournal Libra!’: The political blogosphere and voting preferences in Russia
in 2011–2012. New Media & Society, 17(10), 1715–1732.
Lancichinetti, A., Sirer, M. I., Wang, J. X., Acuna, D., Körding, K., & Amaral, L. A. N. (2015). High-reproducibility and
high-accuracy method for automated topic classification. Physical Review, 5(1). doi:10.1103/PhysRevX.5.011007
Lenci, A. (2008). Distributional semantics in linguistic and cognitive research. Rivista Di Linguistica, 20(1), 1–31.
Levy, K. E. C., & Franklin, M. (2014). Driving regulation: Using topic models to examine political contention in the U.
S. trucking industry. Social Science Computer Review, 32(2), 182–194.
Lötscher, A. (1987). Text und Thema. Studien zur thematischen Konstituenz von Texten. [Text and topic. Studies
concerning thematical constituency of texts]. Berlin, Germany: De Gruyter.
Lovins, J. B. (1968). Development of a stemming algorithm. Mechanical Translation and Computational Linguistics, 11
(1–2), 22–31.
Maier, D., Waldherr, A., Miltner, P., Jähnichen, P., & Pfetsch, B. (2017). Exploring issues in a networked public sphere:
Combining hyperlink network analysis and topic modeling. Social Science Computer Review, Advance online
publication. doi:10.1177/0894439317690337
Manning, C. D., Raghavan, P., & Schütze, H. (2009). An introduction to information retrieval. Cambridge, UK:
Cambridge University Press.
Manning, C. D., & Schütze, H. (2003). Foundations of statistical natural language processing (6. print with corr.).
Cambridge, MA: MIT Press.
Marshall, E. A. (2013). Defining population problems: Using topic models for cross-national comparison of dis-
ciplinary development. Poetics, 41(6), 701–724.
McCallum, A. K. (2002). MALLET: A machine learning for language toolkit. Retrieved from http://mallet.cs.umass.edu
Miller, M. M., & Riechert, B. P. (2001). The spiral of opportunity and frame resonance: Mapping the issue cycle in
news and public discourse. In S. D. Reese, O. H. Gandy Jr., & A. E. Grant (Eds.), Framing public life: Perspectives on
media and our understanding of the social world (pp. 107–121). Mahwah, NJ: Lawrence Erlbaum Associates.
Mimno, D., & Blei, D. M. (2011). Bayesian checking for topic models. Proceedings of the 2011 Conference on Empirical
Methods in Natural Language Processing, 227–237.
22 D. MAIER ET AL.
Mimno, D., Wallach, H. M., Talley, E., Leenders, M., & McCallum, A. (2011). Optimizing semantic coherence in topic
models. Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing, 262–272.
Neuendorf, K. A. (2017). The content analysis guidebook (2nd ed.). Los Angeles, CA: Sage.
Newman, D., Bonilla, E. V., & Buntine, W. (2011). Improving topic coherence with regularized topic models.
Proceedings of the 24th International Conference on Neural Information Processing Systems, 496–504. Retrieved
from http://dl.acm.org/citation.cfm?id=2986459.2986515
Newman, D., Chemudugunta, C., Smyth, P., & Steyvers, M. (2006). Analyzing entities and topics in news articles using
statistical topic models. In S. Mehrotra, D. D. Zeng, H. Chen, B. Thuraisingham, & F.-Y. Wang (Eds.), Intelligence
and security informatics (Vol. 3975, pp. 93–104). Berlin, Germany: Springer.
Newman, D., Lau, J. H., Grieser, K., & Baldwin, T. (2010). Automatic evaluation of topic coherence. Proceedings of the
2010 Annual Conference of the North American Chapter of the ACL, 100–108.
Niekler, A. (2016). Automatisierte Verfahren für die Themenanalyse nachrichtenorientierter Textquellen. [Automated
approaches for the analysis of topics in news sources]. (PhD dissertation). University of Leipzig, Leipzig, Germany.
Retrieved from http://www.qucosa.de/fileadmin/data/qucosa/documents/19509/main.pdf
Niekler, A., & Jähnichen, P. (2012). Matching results of latent dirichlet allocation for text. Proceedings of the 11th
International Conference on Cognitive Modeling (ICCM), 317–322.
Parra, D., Trattner, C., Gómez, D., Hurtado, M., Wen, X. D., & Lin, Y.-R. (2016). Twitter in academic events: A study
of temporal usage, communication, sentimental and topical patterns in 16 computer science conferences. Computer
Communications, 73, 301–314.
Puschmann, C., & Scheffler, T. (2016) Topic modeling for media and communication research: A short primer. HIIG
Discussion Paper Series (No. 2016-05): Alexander von Humboldt Institut für Internet und Gesellschaft.
Quinn, K. M., Monroe, B. L., Colaresi, M., Crespin, M. H., & Radev, D. R. (2010). How to analyze political attention
with minimal assumptions and costs. American Journal of Political Science, 54(1), 209–228.
Rajaraman, A., & Ullman, J. D. (2011). Mining of massive datasets. New York, NY: Cambridge University Press.
Rauchfleisch, A. (2017). The public sphere as an essentially contested concept: A co-citation analysis of the last 20
years of public sphere research. Communication and the Public, 2(1), 3–18.
Řehůřek, R., & Sojka, P. (2010). Software framework for topic modelling with large corpora. Proceedings of the LREC
2010 Workshop on New Challenges for NLP Frameworks, 45–50. Valletta, Malta: ELRA.
Roberts, M. E., Stewart, B. M., & Tingley, D. (2016). Navigating the local modes of big data: The case of topic models.
In R. M. Alvarez (Ed.), Analytical methods for social research. Computational social science: Discovery and prediction
(pp. 51–97). New York, NY: Cambridge University Press. doi:10.1017/CBO9781316257340.004
Salton, G. (1991). Developments in automatic text retrieval. Science, 253(5023), 974–980.
Scott, S., & Matwin, S. (1999). Featuring engineering for text classification. Proceedings of the ICML-99, 379–388. Bled,
Slovenia.
Sievert, C., & Shirley, K. E. (2014). LDAvis: A method for visualizing and interpreting topics. Paper presented at the
Workshop on Interactive Language Learning, Visualization, and Interfaces, Baltimore, MD.
Sokolov, E., & Bogolubsky, L. (2015). Topic models regularization and initialization for regression problems.
Proceedings of the 2015 Workshop on Topic Models: Post-Processing and Applications, 21–27. doi:10.1145/
2809936.2809940
Steyvers, M., & Griffiths, T. L. (2007). Probabilistic approaches to semantic representation. In T. K. Landauer,
McNamara, S. Dennis, & W. Knitsch (Eds.), Handbook of latent semantic analysis, (pp. 424–440). Mahwah, NJ:
Lawrence Erlbaum.
Teh, Y. W., Jordan, M. I., Beal, M. J., & Blei, D. M. (2006). Hierarchical dirichlet processes. Journal of the American
Statistical Association, 101(476), 1566–1581.
Tsur, O., Calacci, D., & Lazer, D. (2015). A frame of mind: Using statistical models for detection of framing and agenda
setting campaigns. Paper presented at the 53rd Annual Meeting of the Association for Computational Linguistics
and the 7th International Joint Conference on Natural Language Processing, Beijing, China.
Turney, P. D., & Pantel, P. (2010). From frequency to meaning: Vector space models of semantics. Journal of Artificial
Intelligence Research, 37, 141–188.
Van Atteveldt, W., Welbers, K., Jacobi, C., & Vliegenthart, R. (2014). LDA models topics. . . But what are ‘topics’?
Retrieved from http://vanatteveldt.com/wp-content/uploads/2014_vanatteveldt_glasgowbigdata_topics.pdf
Waldherr, A., Maier, D., Miltner, P., & Günther, E. (2017). Big data, big noise: The challenge of finding issue networks
on the web. Social Science Computer Review, 35(4), 427–443.
Wallach, H. M. (2006). Topic modeling: Beyond bag-of-words. Paper presented at the 23rd International Conference on
Machine Learning, Pittsburgh, PA.
Wallach, H. M., Mimno, D., & McCallum, A. (2009). Rethinking LDA: Why priors matter. In Y. Bengio, D.
Schuurmans, J. D. Lafferty, C. K. I. Williams, & A. Culotta (Eds.), Advances in neural information processing
systems 22 (pp. 1973–1981). New York, NY: Curran Associates.
Zhao, W. X., Jiang, J., Weng, J., He, J., Lim, E.-P., Yan, H., & Li, X. (2011). Comparing Twitter and traditional media
using topic models. Paper presented at the 33rd European Conference on IR Research, Dublin, Ireland.
COMMUNICATION METHODS AND MEASURES 23
Appendix A
Systematic review of studies in communication research, which uses LDA topic modeling
Type of Parameter
Reference Data Preprocessing Selection Interpretability & Validity Reliability
Studies with methodological focus
Baum (2012) Political Stemming K (chosen after Review top words __
speeches Removing stop words validation) Review top documents
No specific sequence Manual labeling
External validation
Biel and YouTube Removing punctuation and K (qualitative Review top words Split sample
Gatica- videos and repeated letters exploration), Manual labeling test
Perez comments Stemming prior Validation of topics via word
(2014) No specific sequence parameters intrusion tasks and topic intrusion
(standard tasks
values)
DiMaggio Newspaper Removing stop words K (qualitative Review top words Replication
et al. articles No specific sequence exploration) Review top documents with
((2013) Categorizing topics variations of
Statistical validation with mutual corpus,
information (MI) criterion seeds and
Internal validation via hand parameters
coding of sample texts
External validation of topics with
news events
Evans (2014) Newspaper __ K (chosen after Review top words __
articles validation), prior Manual labeling
parameters Quantitative metrics (topic
(optimization) coherence, etc.)
External validation through
qualitative domain knowledge
Ghosh and Tweets 1. Removing URLs and K (quantitative Review top words __
Guha HTML entities metrics: Manual labeling
(2013) 2. Removing punctuation perplexity); prior External validation with political
and conversion to parameters events
lowercase (standard
3. Removing stop words values)
4. Stemming
5. Tokenization
Guo et al. Tweets Stemming K (trial and Review top words __
(2016) Removing punctuation, error) Manual labeling
stop words, etc. Comparison with manual coding
No specific sequence
Jacobi et al. News 1. Lemmatizing K (qualitative Review top words __
(2015) articles 2. Part of speech-tagging; exploration and Review top documents
Removing frequent and quantitative Review of co-occurrence of top
infrequent words; Removing metrics: words (topic coherence)
terms with numbers/non- perplexity) Manual labeling
alphanumeric letters Comparison with manual coding
Newman et al. News 1. Tokenization; Removing K (no Review top words and entities __
(2006) articles stop words explanation) Manual labeling
2. Removing infrequent External validation of topics with
terms news events
Puschmann Newspaper 1. Removing numbers and K (quantitative Review top words __
and articles punctuation, conversion in metrics: Quantitative metrics (Euclidean
Scheffler lower case perplexity and distance)
(2016) 2. Removing stop words Euclidean Manual evaluation
3. Removing infrequent distance) Inter-topic semantic validation
terms
(Continued )
24 D. MAIER ET AL.
(Continued).
Type of Parameter
Reference Data Preprocessing Selection Interpretability & Validity Reliability
Tsur, Calacci, Press __ K (qualitative Review top words __
and Lazer releases exploration) Manual labeling
(2015) and External validation by domain
statements experts
Van Atteveldt News Lemmatizing K (high Review top words Replication
et al. (2014) articles Removing frequent and resolution) Quantitative metrics (topic with
infrequent words prevalence) different
No specific sequence Comparison with manual coding parameters
Zhao et al. Tweets and 1. Removing stop words K (qualitative Review top words __
(2011) newspaper 2. Removing frequent and exploration) Semi-automated topic
articles infrequent words categorization
3. Removing tweets with Manual labeling
less than three words/users Manual judgement of
with less than eight tweets interpretability
Studies with thematic research focus
Bonilla and Newspaper Stemming K (application of Review documents (random Replication
Grimmer articles and Removing punctuation and non-parametric sample) with varying
(2013) transcripts stop words topic model, Manual labeling number of
of No specific sequence qualitative Automated labeling (using mutual topics
newscasts exploration) information)
Elgesem et al. Blog posts __ K (qualitative Review top words __
(2016) exploration) Review top documents
Manual labeling
Elgesem et al. Blog posts __ K (qualitative Review top words __
(2015) exploration) Review documents
Manual labeling
Quantitative metrics (mutual
information, etc.)
Koltsova and Blog posts Removing HTML tags, K (quantitative Review top words __
Koltcov punctuation, etc. metrics: Review top documents
(2013) Lemmatization perplexity) Manual labeling
No specific sequence
Koltsova and Blog posts __ K (no Review documents __
Shcherbak explanation) Manual labeling and evaluation
(2015)
Levy and Public 1. Stemming K (qualitative Review top words Replication
Franklin comments 2. Removing stop words exploration) External validation with expert with
(2014) 3. Removing terms with evaluation variations of
only single letters or corpus,
numbers seeds and
4. Removing infrequent parameters
words
Parra et al. Tweets Language filtering K (qualitative __ __
(2016) Removing stop words, exploration)
special characters, URLs,
words with less than three
characters
No specific sequence
Rauchfleisch Research Removing stop words K (no Review top words __
(2017) articles Removing numbers, explanation); Manual classification
replacing hyphens with parameters set External validation
space characters, conversion according to
in lowercase Steyvers and
Stemming Griffiths (2007)
No specific sequence
Note. K = number of topics. The ordering of pre-processing steps is numbered if the ordering was explicitly mentioned in the
source.
COMMUNICATION METHODS AND MEASURES 25
Appendix B
Choice of candidate models from topic models with varying parameter sets
Appendix C
Note. The figure depicts a divided table and two time-series plots. The left side of the table shows the average most prevalent
sources of the topic while the right side maps out the top-words according to two different relevance values (λ = 1 and λ = .6).
Below the table the ranks of the Rank-1 and the coherence metrics are given. The left time series shows the salience of the topic
over time, while the right plot gives a sense of how concentrated the topic was over the course of investigation.