Comparative Discourse Analysis Using Topic Models
Comparative Discourse Analysis Using Topic Models
73
SMSociety ’20, July 22–24, 2020, Toronto, ON, Canada Zachary Stine and Nitin Agarwal
Third, r/China and r/Sino produce their respective constructions were active. In the second case, we focus specifically on submissions
of China with documented awareness of the opposition between from the two subreddits that are from 2019 and discuss Hong Kong.
them. Notably, the discourse of r/Sino can be read as a reaction to While the first case provides an overarching comparison of how the
the discourse of r/China. One of the earliest submissions in r/Sino two communities conceive of China, the second allows us to see
serves as a welcome and explanation of the subreddit by one of the how these two ways of understanding China lead to different ways
subreddit’s moderators. In the submission text, the author explains of understanding the protests occurring in Hong Kong throughout
that other China-related subreddits exist, but that they are “hateful” much of 2019. In analyzing discussions about the protests, the two
and “spread misinformation.” The submission text itself does not discourses are brought into sharp contrast, providing a glimpse of
explicitly mention r/China, but r/China is explicitly referenced 18 how real-world events are constructed and interpreted differently
times within the submission’s comments. References to each other by r/China and r/Sino.
occur throughout both subreddits, further validating the view that We find that the most frequently observed features are simi-
the two discourses are in competition with each other. r/China is lar across the two communities, but that r/Sino can be best distin-
the older subreddit, created in 2008 compared to 2015 for r/Sino. It guished from r/China by negative discourse around other countries—
also has wider exposure on Reddit with almost four times as many most often the United States—while more general stylistic features
subscribers as r/Sino (at the time of this writing). Therefore, we can and discussions about living and working in China most differenti-
view the construction of China provided by r/Sino as an intended ate r/China. In the case of the 2019 Hong Kong protests, we find that
corrective to the alleged inaccuracies of the representation provided discussions about the politics underlying the protests are salient in
by the more dominant r/China. r/China, while discussions about violence in the protests alongside
Our goal in this study is to not only characterize the two compet- criticism of the United States are salient in r/Sino. We also find that
ing discourses, but to characterize what makes each stand out most our method for constructing community-level feature distributions
from the other. We accomplish this by first identifying the discur- with combinations of topics allows us to look beyond isolated top-
sive features underlying the discussions of the two communities by ics and better appreciate important interactions between multiple
training topic models on the discussion text, which represent each topics within discussions.
discussion thread as a mixture of latent word-usage patterns, or top-
ics [3]. Second, we map each discussion thread to a categorical fea- 2 BACKGROUND
ture representation that includes individual topics or combinations In order to better understand our analyses, we provide some back-
of topics. To our knowledge, the process for mapping documents ground and review selections of relevant work covering Reddit
to categorical combinations of topics is a novel methodological scholarship, topic modeling, and the use of information theoretic
contribution of this study. We then calculate how frequent these measures for comparative purposes.
features are within each community to obtain community-specific
feature distributions. Using an information theoretic quantity, we
2.1 Reddit
then calculate how conspicuous each feature is in one community
when juxtaposed with the other. Finally, we provide qualitative Data collected from Reddit have been analyzed within a variety
interpretations of the discursive features that emerge as salient and of research contexts. Network approaches have been used to char-
what discursive frames and strategies they indicate. acterize common user roles in subreddits [5] and user loyalty to
Our theoretical framework for interpreting results is informed subreddits [14]. Prior work on Reddit has also focused on specific
by the computational approach we take. Our use of topic models to behaviors of users including the usage of hate speech [6] and norm
learn the primary discursive features that we analyze constitutes a violations [7].
“distant reading” of the discourses that trades fine-grained, nuanced Reddit is an especially useful source for data given that each
interpretations for access to large-scale patterns that would not community has a well-defined focus. For example, the subreddit
be otherwise observable. In other words, we are interested in the r/ChangeMyView has been analyzed in order to better understand
broad tendencies of the discourses, which necessarily obscures more persuasion [25] and to characterize user susceptibility [17]. Addi-
specific aspects of the discourses that might also be illuminating. tionally, birth narratives from the subreddit r/BabyBumps have been
While we do conduct manual readings of documents from each computationally analyzed to better understand the experiences of
subreddit, we do so in a way that is guided and constrained by people who have given birth [1].
the topic models we obtain. The theoretical framework within A contribution of the present study to Reddit scholarship is
which we compare the two discourses is motivated by our desire the comparative approach taken to understand the relationship
to make the comparisons objective (at least, as much as possible). between the language used in two subreddits.
Therefore, rather than qualitatively compare quantitative features,
we rely on information theory to quantitatively interrogate the 2.2 Latent Dirichlet allocation
relationships between the two discourses and to quantify how In order to identify the word-usage patterns underlying discussions
salient each feature is for distinguishing between each discourse. from r/China and r/Sino, we train probabilistic topic models via
This approach assumes that what is most interesting about these latent Dirichlet allocation (LDA) [3]. LDA results in two kinds of
competing discourses is what most juxtaposes them. distributions being inferred: some number of distributions over the
We apply this methodology within two cases. The first consists of vocabulary present in the corpus and a mixture of those distribu-
a general comparison between all discussions from the two subred- tions that best represent each document. The distributions over
dits over a roughly four-year period during which both subreddits vocabulary are referred to as topics, though this is not always the
74
Comparative Discourse Analysis Using Topic Models: Contrasting Perspectives on China from Reddit SMSociety ’20, July 22–24, 2020, Toronto, ON, Canada
most appropriate way to think about what they represent. Topics produced by LDA were found to be useful in representing the dis-
comprise a pattern of word-usage and may correspond to certain cursive context under analysis. Brooks and McEnery [4] provide a
rhetorical styles as well as actual topics. Additionally, topics learned less favorable view of LDA within discourse analysis, criticizing it
though LDA arguably reflect certain concepts from the sociology of on the grounds that it lacks linguistic theoretical grounding, and
culture, including framing, polysemy, heteroglossia, and a relational that the topics produced by LDA from their data were difficult to
approach to meaning [9]. interpret from lists of high-probability words and often lacked the-
LDA requires the number of learned topics, k, to be specified. matic coherence across documents. In the present study, we did
Because it is unlikely that a uniquely “true” number of topics exists not find similar problems with our topic models after a combined
underlying any non-trivial corpus, the selection of k is often based analysis of the high-probability words for each topic along with
on more pragmatic grounds, primarily how useful the resulting manual readings of exemplar documents for each topic.
topics are for the researchers making sense of the corpus. While
several quantitative methods exist for evaluating topic models [8, 18, 2.3 Information theory and measures of
29], qualitative evaluation is necessary [23]. Different selections of k
may result in slightly different though potentially equally plausible
divergence
sets of topics with k influencing the specificity of the topics [20]. Because the features of interest from topic models are in the form of
Topic models have been used in a variety of contexts including probability distributions, they lend themselves to the use of informa-
comparative philosophy [21], literary scholarship [11], cultural evo- tion theoretic measures for rigorously interrogating relationships
lution [2], and in comparing Twitter data from different sources between objects within the inferred topic space. In this study, we
[19]. While Nichols et al. [21] and Morstatter et al. [19] both use are specifically interested in the relationship between distributions
topic models within a comparative context, their approaches differ of topics among two collections of documents, each representing
from ours in key ways. In the case of Morstatter et al. [19], two a Reddit community. We follow the usage of the partial Kullback-
document collections are compared by training two separate topic Leibler divergence by Klingenstein et al. [16] who use the measure
models for each collection and calculating the similarity between to identify features that were most salient for distinguishing be-
matched topics from each model. The two models reflect two dis- tween violent and non-violent trials in England over time. Here,
tinct feature spaces. Their approach discovers whether a feature we are interested in how well a given feature acts as a signal of one
from one model has an analogous feature in the other model and, if community over the other.
so, how similar the two features are. Thus, their interest is in finding Other relevant uses of divergence measures include comparisons
the extent to which two separate data sets produce similar features of hashtag usage between protestors and counterprotestors on
as a proxy for understanding how well a sample of data represents Twitter [10] and comparisons of proceedings from natural language
a much larger data set. While this is an appropriate method for processing conferences [13]. Notably, Hall et al. [13] also use LDA to
the questions being asked in that study, having two distinct sets of represent the documents being compared, but the method used for
features for the two subreddits we are comparing would complicate creating collection-level topic distributions amounts to calculating
our ability to calculate how distinguishing a feature is in either of an average topic distribution to represent each collection. While
the subreddits. In other words, we are not interested in comparing this is a reasonable approach, it results in the loss of document-level
the overall similarity of two distinct feature spaces, but rather the topic interactions, which we preserve in this study.
characteristics in a common set of features that are strong signals
of one subreddit relative to the other. 3 METHODS
The comparative approach taken by Nichols et al. [21] uses a In the following sections we describe the data collected and the
single topic model to compare documents within a shared feature methods used to analyze them. After data collection, we trained
space, which the present study more closely mirrors. However, topic models on the combined data from the two subreddits. We
in that study, the authors compare three philosophical works by then constructed feature representations for the two document
treating the ten highest probability topics within each of the three collections in two ways: first, by counting the dominant topic for
texts as sets. The texts are then compared based on the topics within each document within a collection and second, by counting com-
the intersections of each set. By comparing the sets of each text’s binations of topics for each document within a collection, using
top ten topics, useful information from the probabilities of these a threshold value to determine which topics to combine. Informa-
topics within a text are mostly discarded (outside of determining tion theoretic measures of divergence were then used to identify
which topics should be included in a text’s topic set). Information the most distinguishing topics or combinations of topics between
theory provides a set of tools for making more rigorous comparisons the two communities’ collection-level topic distributions. While
between probability distributions, which we use as the basis of our these methods are used in the context of Reddit discussions, they
quantitative comparisons. are likely to be useful in any context in which collections of text
Prior work exists which attempts to use LDA within the con- are compared and where the size of these data sets is too large to
text of discourse analysis. LDA was combined with the theoretical feasibly make sense of them through manual reading alone.
framework of critical discourse analysis in order to examine how
Muslims and Islam are discursively constructed within Swedish
social media [27] as well as the discursive relationship between 3.1 Data
Islamophobia and anti-feminism [28]. In both studies, the topics For each community, we collected all submission identifiers from
the community’s date of creation up to December of 2019 using
75
SMSociety ’20, July 22–24, 2020, Toronto, ON, Canada Zachary Stine and Nitin Agarwal
Figure 1: Monthly submission frequency of r/China and r/Sino from June 2015 through November 2019. Month labels are
formatted as YYYY-MM.
the service PushShift.io. We then used Reddit’s application pro- require a Reddit account to access. While these data are considered
gramming interface (API) to collect the text of each submission public, we avoid linking to specific submissions, direct quoting,
along with all comments from the submission’s discussion thread. and mentioning any user names in order to avoid bringing any
Submissions are available for r/China going back to January of 2008, unwanted attention to the individuals whose comments we analyze.
while submissions are only available for r/Sino as early as June 2015,
with only five available for r/Sino in its first two months of existence
3.2 Collection-level topic distributions
(Figure 1). Given the more recent creation of r/Sino, submissions
considered when training topic models for either community were In order to get a range of topic specificity, we trained LDA models
posted no earlier than August 2015. We consider a document to with 30, 90, and 150 topics, which we refer to throughout the paper
be the text of a submission and the comments from its discussion as models A, B, and C, respectively. With these models, each docu-
thread. ment can be represented as a distribution over 30, 90, or 150 topics,
From these submissions, we performed basic preprocessing. Af- where each topic is a distribution over the vocabulary of 65,176
ter tokenizing, only tokens consisting of at least three characters word types. When referring to a topic, we include the model name
were kept. Common words that occur in over 25% of all submissions to distinguish between two different topics that happen to share
were removed. Rare words that occur in fewer than five submissions the same topic number (e.g., A.10 and B.10 are two different topic
were also removed. While both r/China and r/Sino are predomi- features from models A and B respectively). We used the Gensim
nantly English-language communities, Chinese characters (hanzi) Python package for LDA model training [22].
are sometimes used. All Chinese characters were identified based We considered multiple methods for constructing topic distribu-
on their Unicode values and removed. We did not stem tokens, as tions that reflect a collection of documents. While an LDA model
this has been shown to have minimal or even negative effects on provides topic distributions for each document, we would like to
topic modeling [24]. Prior to preprocessing, there were 261,555 construct a topic distribution that reflects all documents within a
unique word types, which were reduced to 65,176 word types af- collection. Existing methods for combining document-level topic
ter processing. Documents were discarded if they contained fewer distributions into a collection-level topic distribution include cal-
than 20 post-processing tokens, resulting in 97,619 total documents culating an average topic distribution from the document-level
(down from 147,681 documents). topic distributions for all documents in a collection, such as in [13].
All of the data we collected are public and did not require IRB Another possible method would be to assign each word in a doc-
approval. Both r/China and r/Sino are publicly accessible and do not ument to a topic based on the document’s topic distribution and
then count these word-topic assignments within each document
76
Comparative Discourse Analysis Using Topic Models: Contrasting Perspectives on China from Reddit SMSociety ’20, July 22–24, 2020, Toronto, ON, Canada
in the collection to make a collection-level topic distribution, such of elements in the tuple is based on a threshold parameter and
as in [26]. This method also incorporates the length of each docu- is therefore flexible. To the best of our knowledge, this is a novel
ment in the collection-level topic distribution—longer documents method for representing documents as categorical topic features
will have greater influence over the resulting collection-level topic based on their topic distributions as inferred through LDA.
distribution. To construct a topic tuple distribution, we first define a threshold
Both of these methods for constructing collection-level topic parameter, t, within a range from 0 to 1. For each document be-
distributions result in the loss of information about potential in- longing to a community, that document’s topic tuple consists of the
terdependencies between topics that are salient within the same ordered topic indices that, when their corresponding proportions
document and thus provide important context. For example, a docu- in the topic distribution are summed together, equal or exceed the
ment that contains language primarily about both the United States specified threshold. A document’s topic tuple must have at least
and the Hong Kong protests is better represented by a combination one element and only the minimum number of elements necessary
of these two topics that is lost if we consider each of the two topics to meet the threshold condition.
in isolation. Additionally, a feature may be prevalent in both collec- As an example, consider the following topic probabilities for
tions, but may be combined with other features differently in the some document with four topics: 0.01, 0.49, 0.41, and 0.09 corre-
two collections. sponding to the proportions of topics T.0, T.1, T.2, and T.3, respec-
The loss of potential topic relationships in the methods just dis- tively, and which sum to one. If we define the threshold to be 0,
cussed results from both methods being ways of calculating the then the document’s topic tuple only includes the dominant topic,
frequency of each topic within a collection, whether by counting T.1, which has the largest probability of 0.49. However, if we define
topic proportions in each document and then normalizing or by the threshold to be 0.5, then topic T.1 is no longer sufficient to meet
counting word-topic assignments and then normalizing. In order the threshold. Instead, the topic with the next highest proportion,
to capture topic interdependencies within documents, we propose T.2, must be combined with T.1 to form the topic tuple, (T.1, T.2).
expanding the feature space of topics to also include combinations The summed probability of these two topics in the document is 0.9,
of topics. By mapping each document’s topic distribution to a sin- which satisfies the threshold of 0.5.
gle categorical feature consisting of either an individual topic or This example illustrates the kind of interdependencies between
a combination of topics, we can construct collection-level topic topics that can be preserved using this method. Topics T.1 and T.2
distributions that preserve topic relationships from the document are both similarly salient in the document (based on their similar
level within the broader collection-level distribution. Below, we de- proportions in the document’s topic distribution), which is reflected
scribe two kinds of collection-level topic distributions constructed by the topic tuple containing both. If instead, the proportion of T.1
from mapping each document’s topic distribution to a categori- was 0.85 and the proportion of T.2 0.05, then only T.1 would be
cal representation: dominant topic distributions and topic tuple needed to meet the threshold of 0.5. In this case, T.1 is uniquely
distributions. salient within the topic distribution and so additional topics are not
needed in the document’s topic tuple to meet the threshold.
3.2.1 Dominant topic distributions. As a baseline, we first calcu-
It should be noted that the same threshold will make higher
lated collection-level distributions by assigning each document to
demands when used on topic distributions with larger number of
the topic with the highest probability in the document’s topic dis-
topics. For example, a threshold of 0.5 may result in a topic tuple of
tribution. After assigning each document to a topic in this way, a
2 elements if k is 90, but may result in a topic tuple of 3 elements for
community’s collection-level topic distribution can be formed from
the same document when represented in a distribution where k is
the relative frequencies of these document-topic assignments. We
150. This is simply due to the probability mass having to be spread
refer to collection-level topic distributions created in this way as
out over a larger number of elements in the case of 150 topics versus
dominant topic distributions.
90 topics. Additionally, increasing the threshold value may result
Dominant topic distributions can be thought of as a special case
in a larger number of features that constitute the collection-level
of the topic tuple distributions described below where the threshold
topic distribution (see Table 1). Arbitrarily increasing the number
value is zero. This method treats each document as equally impor-
of features in this way may have undesirable effects by decreasing
tant regardless of length, in contrast to the word-topic assignment
the ability to meaningfully discriminate between two topic tuple
method discussed above. However, relationships between topics are
distributions.
necessarily lost in dominant topic distributions, since documents
We constructed topic tuple distributions using topics from each
will always be assigned to an individual topic. Therefore, we use
of the three topic models described above for threshold values of
dominant topic distributions as a baseline with which to compare
0.1, 0.3, 0.5, and 0.7. We limited our qualitative analysis to threshold
the results found using topic tuple distributions in order to see
values of 0.3 and 0.5 in order to avoid the potential problems of
what, if anything, is gained from the combining topics.
having too many features.
3.2.2 Topic tuple distributions. As we will see, interesting findings
can be made from analyzing dominant topic distributions. How-
ever, dominant topic distributions necessarily obscure potentially
interesting interdependencies between topics. In order to preserve 3.3 Levels of analysis
potential interdependencies between multiple topics within a sin- We compared collection-level topic distributions representing
gle document, we propose a method for mapping a document’s r/China and r/Sino at two different levels of analysis. First, we
topic distribution to an ordered tuple of topics where the number compared distributions reflecting all documents used in training
77
SMSociety ’20, July 22–24, 2020, Toronto, ON, Canada Zachary Stine and Nitin Agarwal
Table 1: Number of features present in collection-level topic is for r/China against a background based on a distribution repre-
distributions. senting r/Sino. We calculated the partial KL for each feature using
r/China as the expectation distribution to rank features in order of
Model Dominant Topic tuple Topic tuple relative salience in r/China and then did the same with r/Sino as
topic (t=0.3) (t=0.5) the expectation distribution.
For each comparison, we then examined the ten most frequent
A (k=30) 30 1,232 10,515
features in each community as well as the ten most distinguishing
B (k=90) 89 6,786 32,231
features of each community (based on the partial KL values). We
C (k=150) 138 12,087 54,618
conducted these comparisons for each combination of LDA model,
feature type, and level of analysis described above. In order to
our topic models, from August 2015 through November 2019. Sec- understand the significance of each feature in context, we manually
ond, we narrowed our focus to documents from 2019 that contain read multiple documents that possess the feature. This is necessary
language about Hong Kong as a more focused case study of the since interpreting topic features based only on some number of
different perspectives on the Hong Kong protests. These documents highly probable terms in a topic can be problematic.
were selected after identifying six topics relating to Hong Kong—
one from model A (A.11), two from model B (B.30 and B.44), and 4 RESULTS
three from model C (C.6, C.70, and C.118). Any document which In this section, we describe our findings from the broader level of
has one of these six topics as its most dominant or second-most analysis followed by findings from documents discussing Hong
dominant topic and that has a submission date between January Kong during 2019. At each level of analysis, we first examine domi-
through November of 2019 was included in this set. nant topic features followed by topic tuple features. We report both
For both levels of analysis, we filtered out all documents with highly frequent and highly distinguishing features in the tables
dominant topic A.18, as they correspond to submissions that are below. Highly frequent features are reported with their relative
dominated by one of several boilerplate moderation comments, frequency within a community’s document collection (e.g., Table
typically due to the submission not following the community rules. 2). While highly frequent features within a community provide a
This filtering resulted in 5,114 documents being discarded in the general characterization of that community’s discourse, the fea-
subsequent analysis. tures which are most distinguishing reflect which aspects of one
discourse are comparatively salient in that discourse relative to
3.4 Feature comparisons the other. The most distinguishing features of each community’s
For each level of analysis, we calculated the Jensen-Shannon diver- discourse are reported with their partial KL values, given in bits
gence between a given collection-level topic distribution represent- (e.g., Table 3).
ing r/China and a distribution over the same features representing
r/Sino. In order to measure how strongly each feature of the dis- 4.1 Broad comparison of discourse
tribution functions as a signal of a community, we calculated the In order to get a broad sense of how the discursive constructions
partial Kullback-Leibler divergence, KLi¸for each element of the of China differ between the two communities, we first consider
distribution, which reflects how much each feature individually findings that arise when comparing documents from August 2015
contributes to the Jensen-Shannon divergence [16]. The Jensen- through November 2019. In each of the three topic models, a topic
Shannon divergence can be formulated in the following manner emerges that is prevalent in both communities. These topics—A.24,
as the symmetrized version of the asymmetric Kullback-Leibler B.31, and C.123—reflect a general rhetorical style that tends to be
divergence: negative and critical, based on our manual reading of exemplar
1 documents that feature these topics with high probability. These
JSD (p, q) = [KLD (p, m) + KLD (q, m)] (1) stylistic topics are the most frequently observed dominant topics
2
in both communities at this level of analysis. Several other features
where p and q are the distributions being compared and m = 1/2 (p
exist within each of the topic models, so reporting them within
+ q). The Kullback-Leibler divergence with an expectation based
each model would be redundant. We limit reporting results at this
on p is given by
level of analysis to the features from model B (k=90), as they are
Õ pi
KLD (p, m) = pi log2 (2) interpretable, but not overly specific.
i
m i
4.1.1 Results from dominant topic distributions. When examining
from which the partial Kullback-Leibler divergence for the ith fea- the most frequent features in r/China from the dominant topic dis-
ture in the distribution is simply tribution, the features that appear most frequently (aside from the
pi stylistics topic B.31) concern practical matters—asking questions
KLi (p, m) = pi log2 (3)
mi and seeking advice (e.g., how to ship a package to China) (B.76),
The partial Kullback-Leibler divergence measures how strongly personal aspects of life in China as a foreign national (B.21), discus-
feature i acts as a signal of the expectation distribution (p as writ- sions about jobs and working in China (B.87), the use of VPNs for
ten in equation 3) [16]. Thus, knowing the partial Kullback-Leibler accessing websites (B.18), etc. In addition to these more practical
divergence for each feature with an expectation based on a distri- topics, discussions about trade with the United States (B.75) occurs
bution representing r/China tells us how conspicuous that feature as the fifth most frequent feature.
78
Comparative Discourse Analysis Using Topic Models: Contrasting Perspectives on China from Reddit SMSociety ’20, July 22–24, 2020, Toronto, ON, Canada
After the stylistics topic B.31, the most frequent dominant topics experiences as foreign nationals living in China (B.21), their work-
observed in r/Sino include discussions around China establishing ing lives (B.87), and practical concerns (e.g., B.18, B.72, and B.33). In
partnerships with other countries (B.24), trade with the United juxtaposition to this, the discourses that emerge from r/Sino focus
States (B.75), military and engineering innovation in China (B.19), on China at the state level with respect to its relationships and
political ideology and systems (B.9), technological innovation and relative standings with other countries (B.24, B.75, and B.86) as well
growth (B.37), China’s international economic standing (B.86), etc. as its own internal growth, development, and power (B.19, B.37,
An overview of these features for both communities can be seen in B.25, and B.54).
Table 2
While we can see some interesting differences between the most
4.1.2 Results from topic tuple distributions. While the analysis of
frequent dominant topic features of the two communities, calculat-
dominant topic distributions has yielded interesting results, the
ing the partial KL values of each feature provides a rigorous way
most frequent feature in either community is the somewhat vague
of ranking the comparative salience of each feature. For r/China,
stylistic topic B.31, which accounts for over 60 percent of the r/China
the ordering of distinguishing dominant topics resembles its most
collection and over 45 percent of the r/Sino collection, and which is
frequent dominant topics with a few changes. After topics B.31,
the strongest signal of r/China relative to r/Sino. When we compare
B.76, B.21, and B.87, the next most distinguishing topic for r/China
collection-level distributions of topic tuples, we find that several
is B.18 (VPNs), followed by several less frequent topics reflecting
interesting features emerge in which B.31 is dominant but interde-
discussions about needing help with communication applications
pendent with an additional topic.
(most often, the messaging application, WeChat) (B.72), purchasing
Using a threshold value of 0.3, we find that the most frequent
products (B.33), and water sanitation (B.64).
topic tuples in r/China involve the same topics seen in Table 2,
The most distinguishing dominant topic features for r/Sino sim-
but now in more contextually informative combinations, includ-
ilarly reflect several of its highest-frequency features with some
ing (B.31, B.21) reflecting discourse that features critical stylistic
changes. Topic B.24 (international partnerships) is the most distin-
elements in combination with life as a foreign national in China,
guishing topic, followed by B.19 (military and engineering innova-
(B.21, B.31) representing the same combination but with life as a
tion), B.75 (trade relations with the United States), B.37 (technologi-
foreign national given primacy, and (B.31, B.76) combining critical
cal innovation and growth), B.25 (financial reporting), B.9 (political
stylistics with questions and advice. An interesting picture begins
ideology), B.86 (economic standing), and B.54 (scientific research).
to emerge from these co-salient topic tuples that is borne out when
See Table 3 for an overview of the most distinguishing dominant
reading the source documents with these features—while r/China
topics.
may often invoke critical language that is untethered from more
Notably, the discourses that emerge from r/China over this broad
specific discursive foci, the emergence of (B.31, B.21) and (B.21,
period of time tend to reflect the experiences of individuals—their
B.31) as relatively frequent reflects the tendency of r/China users
79
SMSociety ’20, July 22–24, 2020, Toronto, ON, Canada Zachary Stine and Nitin Agarwal
to discuss their lives in China as foreign nationals in ways that are From analyzing the feature distributions that characterize the col-
often negative. lection of documents from r/China and that of r/Sino over a period
Likewise, we see several of the same features occur with high of over four years, we see that both communities employ a generally
frequency in r/Sino at a threshold of 0.3 which we saw as dominant similar way of using language that involves being highly critical
topics, but now including the topic tuple (B.31, B.35) representing (topic B.31). From an analysis of the dominant topic distributions
the combination of the critical stylistics feature with discussions of each community, we see that (aside from B.31), r/China submis-
about trade with the United States. Many of these documents in- sions are often concerned with the experiences of individuals—most
clude discussions that heavily criticize the United States in relation often as foreign nationals navigating their lives in China. When
to the so-called trade war between the two countries that began in we look more deeply into the topic relationships that may occur
2018. See Table 4 for an overview of the five most frequent topic within documents (by constructing distributions of topic tuples
tuples for this threshold value. rather than dominant topics), we find that the critical stylistics
For r/China, the same five most frequent features are also the topic frequently pairs with these other topics, providing greater
five most distinguishing features relative to r/Sino. However, a few context for understanding the discourses.
interesting changes are present within the five most distinguishing
features of r/Sino relative to r/China. The first three most distin- 4.2 Comparison of discourse concerning Hong
guishing features correspond to those described for the dominant
Kong in 2019
topic features from Table 3. Additionally, we see the topic tuple
(B.31, B.88), corresponding to a combination of the critical stylistics When we compare discourse surrounding Hong Kong during 2019,
topic with B.88, which represents discourse about the “West” (typi- we again see that the critical stylistics topic B.31 is the most fre-
cally used to refer to the United States), most often as accusations quently occurring feature in each community, both when analyzing
of hypocrisy (e.g., perceived double standards regarding the state’s dominant topics and topic tuples with a threshold of 0.3.
treatment of Uyghurs in light of the United States’ treatment of
4.2.1 Results from dominant topic distributions concerning Hong
those seeking refugee status there) and more general charges of
Kong in 2019. In looking at the top five most frequent dominant
propaganda and anti-Chinese bias in Western media. This feature
topics from the two communities, they share much in common
demonstrates the usefulness of examining topic tuples in this man-
in terms of the features’ rankings (see Table 6). Notably, the topic
ner: When only analyzing dominant topics, B.88 is obscured by the
B.35 may represent two kinds of language. On the one hand, B.35
prevalence of highly critical language that often accompanies dis-
appears in submissions which include a string of phrases intended
cussions of the West by r/Sino. By allowing for the possibility that
to provoke censorship. These phrases typically include references
more than one topic is needed to adequately represent a document,
to Tibet, the Tiananmen Square massacre of 1989, “democratization,”
we can see the different ways in which r/Sino uses the critical stylis-
“independence,” and “freedom” among others. Among the r/Sino
tics topic, whether in criticism of perceived Western hypocrisy or
collection of documents concerning Hong Kong, the appearance of
in criticism of the China-US trade war. See Table 5 for an overview
B.35 almost always indicates this usage within a submission title,
of the most distinguishing topic tuples from each community at a
which are flagged as violating of the subreddit rules (it is likely
threshold of 0.3.
considered a form of trolling). However, the appearance of B.35
within r/China may also include language that shares some words
80
Comparative Discourse Analysis Using Topic Models: Contrasting Perspectives on China from Reddit SMSociety ’20, July 22–24, 2020, Toronto, ON, Canada
Table 6: Most frequent dominant topic features concerning Hong Kong in 2019.
in common with the trolling usage just described. Submissions that document. Here, those features include language about the protests
feature actual discussions invoking Tiananmen Square, Tibet, or in relation to their political underpinnings (B.30) and to protest-
democracy may also have this dominant topic. related violence (B.44). See Table 8 for an overview of the most
Examining which dominant topics most distinguish each com- frequent topic tuples.
munity yields interesting differences (see Table 7). Topic B.30 is Notably, language about protest-related violence does not occur
highly conspicuous in r/China and represents general discussions within any of the five most distinguishing topic tuples for r/China.
around the protests, typically framed as political tensions between However, three of the five most distinguishing features for r/Sino
mainland China and Hong Kong. The most distinguishing domi- feature language about violence, almost always as carried out by
nant topic of r/Sino, B.44, also reflects language about the Hong protestors. Instead, r/China’s distinguishing features concerning
Kong protests, but more specifically concerns violence occurring Hong Kong deal more with the underlying politics, both as reflected
during protests. Such submissions from r/Sino tend to focus on vio- by B.30 and some of the discussions related to B.35 that sometimes
lence alleged to have been committed by the protestors (though in invoke language about democracy. See Table 9 for an overview
r/China this topic may reflect violence carried out against protestors of the most distinguishing topic tuples for each community at a
by police in addition to violence committed by protestors). This threshold of 0.3.
marks an interesting change in the discursive strategies we previ- Interestingly, the topic tuple (B.31, B.88) appears as the third most
ously described for r/Sino: While r/Sino broadly tends to focus on distinguishing feature for r/Sino, despite these documents being
states as actors, rather than individuals, as described in section 4.1 required to have a Hong Kong-related topic as its first or second
above, in discourse around the Hong Kong protests, the community most probable topic. This is a case where using more specific topics
emphasizes the negative actions of individuals. can be helpful as these documents have a Hong Kong-related topic
The other dominant topics that distinguish r/China at this level in the 150-topic model as the first or second most dominant topic
of analysis include the previously described sensitive phrases topic that does not appear in the 90-topic model.
(B.30), question asking (B.76), language about reports or announce- If we increase the threshold to 0.5, we do see that there is a con-
ments from the CPC (B.83), and the enforcing of subreddit rules nection between r/Sino’s usage of critical stylistics (B.31), charges of
(B.49). The critical stylistics topic is more conspicuous within r/Sino Western hypocrisy (B.88) and Hong Kong-related topics within the
at this level of analysis whereas, at the broad level of analysis 90-topic model. See Table 10 for the distinguishing features of each
described in section 4.1, this topic served as a stronger signal of community when examining topic tuples with a threshold of 0.5.
r/China. While B.65 reflects submissions that include automatically At this threshold, we see that distinguishing discussions on r/Sino
constructed summaries by a self-declared bot account, we also see often combine discussions of the protests with charges of Western
the appearance of B.88, denoting accusations of Western hypocrisy, hypocrisy (B.88). Importantly, the connection that r/Sino forges
and B.45, criticizing media outlets for reporting alleged falsehoods. between the Hong Kong protests and Western hypocrisy becomes
clear when topic tuples are examined. These results suggest two
4.2.2 Results from topic tuple distributions concerning Hong Kong
dominant discursive strategies employed by users of r/Sino when
in 2019. When we analyze the topic tuples representing each com-
discussing the protests—to foreground alleged violence committed
munity’s collection of documents, we again see that the critical
by protestors and to shift discursive focus onto the hypocrisy of
stylistics topic is often co-salient with other relevant features, which
the West.
are obscured when only considering the dominant topic of each
81
SMSociety ’20, July 22–24, 2020, Toronto, ON, Canada Zachary Stine and Nitin Agarwal
Table 8: Most frequent topic tuples (t=0.3) concerning Hong Kong in 2019.
Table 9: Most distinguishing topic tuples (t=0.3) concerning Hong Kong in 2019.
Table 10: Most distinguishing topic tuples (t=0.5) concerning Hong Kong in 2019.
82
Comparative Discourse Analysis Using Topic Models: Contrasting Perspectives on China from Reddit SMSociety ’20, July 22–24, 2020, Toronto, ON, Canada
is discussed at a more abstract, and therefore idealized, level. As- focus on violence committed by protestors (a reversal from the lack
pects of this discursive strategy can also be seen in r/China through of focus on individuals more broadly) and by the tendency for dis-
the focus on concrete, negative experiences of individuals living in cussions about the protests to foreground accusations of Western
China, while discussing the Hong Kong protests primarily in terms hypocrisy. These findings contribute to a broader understanding of
of more abstract, idealized entities. These discursive strategies echo the popular Western perspectives surrounding China.
the “Fallacy of the Misguided Comparison” as described by Hall
and Ames [12] within the context of cross-cultural communication ACKNOWLEDGMENTS
between the West and China. The authors describe this fallacy as This research is funded in part by the U.S. National Science Foun-
the comparison of “the ideals of one society or culture with the prac- dation (OIA-1920920, IIS-1636933, ACI-1429160, and IIS-1110868),
tices of another” [12]. The implications of these findings are that U.S. Office of Naval Research (N00014-10-1-0091, N00014-14-1-0489,
popular conceptions of China from Reddit are likely to reflect such N00014-15-P-1187, N00014-16-1-2016, N00014-16-1-2412, N00014-
misguided comparisons, by either privileging the ideals of China 17-1-2605, N00014-17-1-2675, N00014-19-1-2336), U.S. Air Force
(as in r/Sino) or its flawed realities (as in r/China), leaving a gap Research Lab, U.S. Army Research Office (W911NF-16-1-0189), U.S.
where more even-handed cross-cultural understanding between Defense Advanced Research Projects Agency (W31P4Q-17-C-0059),
the West and China might exist. Arkansas Research Alliance, the Jerry L. Maulden/Entergy Endow-
Many of these findings come into clearer view when analyzing ment at the University of Arkansas at Little Rock, and the Australian
topic tuples as document features rather than single dominant top- Department of Defense Strategic Policy Grants Program (SPGP)
ics. This method permits us to see topic combinations that provide (award number: 2020-106-094). Any opinions, findings, and conclu-
important context. For example, it is not just the case that r/Sino sions or recommendations expressed in this material are those of
uses critical language stylistics, but rather, it pairs critical language the authors and do not necessarily reflect the views of the funding
stylistics with features like protest violence and Western hypocrisy, organizations. The researchers gratefully acknowledge the support.
whereas r/China uses the same topic in combination with describing
experiences as foreign nationals. REFERENCES
While this analysis has yielded interesting insights, there are [1] Maria Antoniak, David Mimno, and Karen Levy. 2019. Narrative paths and
limitations present in the current study. Our analyses focus on the negotiation of power in birth stories. In Proc. ACM Hum.-Comput. Interact. 3,
frequency of certain features at the document level, treating each CSCW,Article 88 (November 2019), 27 pages. DOI: https://doi.org/10.1145/3359190
[2] Alexander T. J. Barron, Jenny Huang, Rebecca L. Spang, and Simon DeDeo. 2018.
document equally. However, various kinds of metadata are available Individuals, institutions, and innovation in the debates of the French Revolu-
from Reddit that are likely to be of interest when combined with tion. PNAS 115, 18 (May 2018), 4607–4612. DOI: https://doi.org/10.1073/pnas.
1717729115
these features. One kind of potentially interesting metadata is a [3] David M. Blei, Andrew Y. Ng, and Michael I. Jordan. 2003. Latent Dirichlet
document’s score, which is derived from the number of positive allocation. Journal of Machine Learning Research 3, 993–1022.
and negative votes it received (known as upvotes and downvotes, [4] Gavin Brooks and Tony McEnery. 2019. The utility of topic modelling for dis-
course studies: A critical evaluation. Discourse Studies 21, 1 (Feb. 2019), 3–21. DOI:
respectively). Correlating document scores with discursive features https://doi.org/10.1177/1461445618814032.
might provide additional information on which features are not only [5] Cody Buntain and Jennifer Golbeck. 2014. Identifying social roles in reddit using
frequent, but broadly endorsed by the community. Additionally, network structure. In Proceedings of the 23rd International Conference on World
Wide Web (WWW ’14 Companion). Association for Computing Machinery, New
our focus has been on two important China-related subreddits, but York, NY, USA, 615–620. https://doi.org/10.1145/2567948.2579231
there are other communities whose analysis would contribute to a [6] Eshwar Chandrasekharan, Umashanthi Pavalanathan, Anirudh Srinivasan, Adam
Glynn, Jacob Eisenstein, and Eric Gilbert. 2017. You Can’t Stay Here: The Efficacy
larger understanding of the various China-related discourses that of Reddit’s 2015 Ban Examined Through Hate Speech. Proc. ACM Hum.-Comput.
are active within the English-speaking world of Reddit, but with Interact. 1, CSCW, Article 31 (December 2017), 22 pages. DOI: https://doi.org/10.
the caveats we noted in section 1. 1145/3134666
[7] Eshwar Chandrasekharan, Mattia Samory, Shagun Jhaver, Hunter Charvat, Amy
Bruckman, Cliff Lampe, Jacob Eisenstein, and Eric Gilbert. 2018. The Internet’s
Hidden Rules: An Empirical Study of Reddit Norm Violations at Micro, Meso, and
6 CONCLUSIONS Macro Scales. Proc. ACM Hum.-Comput. Interact.2, CSCW, Article 32 (November
2018), 25 pages. DOI: https://doi.org/10.1145/3274301
The subreddits r/China and r/Sino represent two popular and dis- [8] Jonathan Chang, Jordan Boyd-Graber, Sean Gerrish, Chong Wang, and David M.
tinct sets of English-language discursive constructions of China. Blei. 2009. Reading tea leaves: How humans interpret topic models. In Advances
in Neural Information Processing Systems (NIPS ‘09). Vancouver, 288–296.
For a number of reasons, understanding popular modes of discourse [9] Paul DiMaggio, Manish Nag, and David Blei. 2013. Exploiting affinities between
around China are important owing to China’s international impor- topic modeling and the sociological perspective on culture: Application to news-
tance more broadly. Using latent word-usage patterns underlying paper coverage of U.S. government arts funding. Poetics 41, 6, 570–606. DOI:
https://doi.org/10.1016/j.poetic.2013.08.004
discussions from both communities, we have examined the word- [10] Ryan Gallagher, Andrew J. Reagan, Christopher M. Danforth, Peter Sheri-
usage patterns that are most frequent in each community and that dan Dodds. 2018. Divergent discourse between protests and counter-protests:
most distinguish them against a backdrop informed by the other. #BlackLivesMatter and #AllLivesMatter. PLoS ONE 13, 4 (Apr. 2018). DOI: https:
//doi.org/10.1371/journal.pone.0195644
We find that r/China is broadly distinguished by a focus on the (of- [11] Andrew Goldstone and Ted Underwood. 2012. What can topic models teach us
ten negative) experiences of individuals, whereas r/Sino is broadly about the history of literary scholarship? Journal of Digital Humanities 2, 1 (Dec.
2012).
distinguished by a focus on states. When we focus our analysis on [12] David L. Hall and Roger T. Ames. 1999. The Democracy of the Dead: Dewey,
discussions related to Hong Kong during 2019, we find that r/China Confucius, and the Hope for Democracy in China (1st. ed.). Open Court, Chicago
is distinguished by discussions of the political underpinnings of and Lasalle, IL.
[13] David Hall, Daniel Jurafsky, and Christopher D. Manning. 2008. Studying the
the protests deriving from tensions between China and Hong Kong history of ideas using topic models. In Proceedings of the Conference on Em-
as abstract primary characters, while r/Sino is distinguished by a pirical Methods in Natural Language Processing (EMNLP ’08). Association for
83
SMSociety ’20, July 22–24, 2020, Toronto, ON, Canada Zachary Stine and Nitin Agarwal
Computational Linguistics, USA, 363–371. [22] Radim Řehůřek and Petr Sojka. 2010. Software framework for topic modelling
[14] William L. Hamilton, Justine Zhang, Cristian Danescu-Niculescu-Mizil, Dan with large corpora. In Proceedings of the LREC 2010 Workshop on New Challenges
Jurafsky, and Jure Leskovec. 2017. Loyalty in online communities. In Proceedings for NLP Frameworks. ELRA, Malta, 45–50.
of the 11th International AAAI Conference on Web and Social Media (ICWSM ‘17). [23] Margaret E. Roberts, Brandon M. Stewart, and Dustin Tingley. 2016. Navigating
AAAI Press, 540–543. https://www.aaai.org/ocs/index.php/ICWSM/ICWSM17/ the Local Modes of Big Data: The Case of Topic Models. In Computational Social
paper/view/15710/14848 Science: Discovery and Prediction Cambridge University Press, New York, NY,
[15] Marianne Jørgensen and Louise J. Phillips. 2002. Discourse Analysis as Theory 51–97.
and Method (1st. ed.). Sage, London. [24] Alexandra Schofield and David Mimno. 2016. Comparing apples to apple: The
[16] Sara Klingenstein, Tim Hitchcock, and Simon DeDeo. 2014. The civilizing process effects of stemmers on topic models. In Transactions of the Association for Com-
in London’s Old Bailey. PNAS 111, 26 (Jul. 2014), 9419–9424. DOI: https://doi.org/ putational Linguistics 4, 287–300. DOI: https://doi.org/10.1162/tacl_a_00099
10.1073/pnas.1405984111 [25] Chenhao Tan, Vlad Niculae, Cristian Danescu-Niculescu-Mizil, and Lillian Lee.
[17] Humphrey Mensah, Lu Xiao, and Sucheta Soundarajan. 2019. Characterizing 2016. Winning arguments: Interaction dynamics and persuasion strategies in
susceptible users on Reddit’s ChangeMyView. In Proceedings of the 10th Inter- good-faith online discussions. In Proceedings of the 25th International Conference
national Conference on Social Media and Society (SMSociety ’19). Association for on World Wide Web (WWW ’16). International World Wide Web Conferences
Computing Machinery, New York, NY, USA, 102–107. https://doi.org/10.1145/ Steering Committee, Republic and Canton of Geneva, CHE, 613–624. https://doi.
3328529.3328550 org/10.1145/2872427.2883081
[18] David Mimno, Hanna M. Wallach, Edmund Talley, Miriam Leenders, and Andrew [26] Laure Thompson and David Mimno. 2018. Authorless topic models: Biasing mod-
McCallum. 2011. Optimizing semantic coherence in topic models. In Proceedings els away from known structure. In Proceedings of the 27 th International Conference
of the Conference on Empirical Methods in Natural Language Processing (EMNLP on Computational Linguistics. Association for Computational Linguistics, USA,
’11). Association for Computational Linguistics, USA, 262–272. 3903–3914.
[19] Fred Morstatter, Jürgen Pfeffer, Huan Liu, and Kathleen M. Carley. 2003. Is the [27] Anton Törnberg and Petter Törnberg. 2016. Muslims in social media discourse:
sample good enough? Comparing data from Twitter’s streaming API with Twit- Combining topic modeling and critical discourse analysis. Discourse, Context and
ter’s firehose. In Proceedings of the 7th International AAAI Conference on Weblogs Media 13, 132–142. DOI: https://doi.org/10.1016/j.dcm.2016.04.003.
and Social Media, AAAI Press, Cambridge, MA., 400–408. [28] Anton Törnberg and Petter Törnberg. 2016. Combining CDA and topic modeling:
[20] Dong Nguyen, Maria Liakata, Simon DeDeo, Jacob Eisenstein, David Mimno, Analyzing discursive connections between Islamophobia and anti-feminism on
Rebekah Tromble, and Jane Winters. 2019. How we do things with words: an online forum. Discourse & Society 27, 4, 401–422. DOI: https://doi.org/10.1177/
Analyzing text as social and cultural data. arXiv:1907.01468. Retrieved from 0957926516634546
https://arxiv.org/abs/1907.01468 [29] Hanna M. Wallach, Iain Murray, Ruslan Salakhutdinov, and David Mimno. 2009.
[21] Ryan Nichols, Edward Slingerland, Kristoffer Nielbo, Uffe Bergeton, Carson Evaluation methods for topic models. In Proceedings of the 26th Annual Interna-
Logan, and Scott Kleinman. 2018. Modeling the contested relationship between tional Conference on Machine Learning (ICML ’09). Association for Computing
Analects, Mencius, and Xunzi: Preliminary evidence from a machine-learning Machinery, New York, NY, USA, 1105–1112. DOI: https://doi.org/10.1145/1553374.
approach. The Journal of Asian Studies 77, 1 (Feb. 2018), 19–57. DOI: https://doi. 1553515
org/10.1017/S0021911817000973
84