0% found this document useful (0 votes)

12 views12 pages

Comparative Discourse Analysis Using Topic Models

This study analyzes contrasting perspectives on China from two Reddit communities, r/China and r/Sino, using probabilistic topic modeling to identify linguistic features that differentiate their discourses. The findings reveal that r/China is generally critical of the Communist Party of China, while r/Sino defends it, particularly in discussions surrounding the Hong Kong protests of 2019. The research contributes a novel method for representing document collections that maintains interdependencies between topics, enhancing the understanding of how these communities construct their narratives about China.

Uploaded by

wm.summerending

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

12 views12 pages

Comparative Discourse Analysis Using Topic Models

Uploaded by

wm.summerending

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 12

Comparative Discourse Analysis Using Topic Models:

Contrasting Perspectives on China from Reddit

Zachary Kimo Stine Nitin Agarwal
Department of Information Science, University of Department of Information Science, University of
Arkansas at Little Rock, AR, United States Arkansas at Little Rock, AR, United States
[email protected] [email protected]

ABSTRACT Understanding the range of popular Western perspectives on China

In this study, we conduct a comparative analysis of the linguistic fea- is critical given China’s importance on the global stage and the ten-
tures that differentiate two China-focused discussion communities sions that sometimes exist between China and Western countries
with contrasting perspectives from Reddit. We utilize probabilistic such as the United States.
topic modeling to represent submissions from both communities as The competing discourses we analyze are produced by two com-
distributions of latent patterns of word-usage. Using information munities from the discussion platform Reddit: r/China and r/Sino.1
theoretic measures, we conduct a series of quantitative comparisons While specific discussions in either community may reflect a range
between the language patterns of each community and identify of perspectives on China, the general discursive constructions of
salient features that distinguish the two communities relative to China produced by each community are fundamentally at odds
each other. We describe the rhetorical techniques and discursive with each other, especially in their views of the Communist Party
frames implied by these features and how they are utilized by each of China, or CPC.2 Whereas r/China generally tends to be highly
community in discussions surrounding the Hong Kong protests critical of the CPC, r/Sino tends to defend the CPC against criticism
during 2019. Additionally, we contribute a novel method for repre- and engages in much more positive discourse around the party.
senting collections of documents that preserves interdependencies Understanding these conflicting discursive constructions of China
between topics at the document level. is important to understand the perspectives and representations of
China that English-language Reddit users may encounter. Given
CCS CONCEPTS Reddit’s popularity within the broader social media ecology, it is
also reasonable to think that these discourses have some degree of
• Human-centered computing → Collaborative and social com-
salience for understanding popular perspectives on China within
puting.
the English-speaking world more broadly.
KEYWORDS While English-language discussions about China may occur
within a variety of Reddit communities, called subreddits, we limit
Comparative discourse analysis, topic models, computational social our analysis to r/China and r/Sino for three reasons. First, r/China
science, information theory and r/Sino have an ideal discursive scope that is neither overly
ACM Reference Format: broad nor narrow when compared with many other subreddits that
Zachary Kimo Stine and Nitin Agarwal. 2020. Comparative Discourse Anal- may include discussions about China. For example, the subreddit
ysis Using Topic Models: Contrasting Perspectives on China from Reddit. r/worldnews includes discussions relevant to China, but the breadth
In International Conference on Social Media and Society (SMSociety ’20), July of its discursive scope extends far beyond China. In order to isolate
22–24, 2020, Toronto, ON, Canada. ACM, New York, NY, USA, 12 pages.
the discourses pertinent to China, it would be necessary to sample
https://doi.org/10.1145/3400806.3400816
only the appropriate submissions and to ensure that the particular-
1 INTRODUCTION ities of this sampling process did not introduce undesirable biases
into the analysis.
The notion that a given discourse constitutes a particular way of Second, r/China and r/Sino represent concerted discourses about
understanding some aspect of the world, whose meaning is contin- China that do not occur within some other primary context. For
gent and socially constructed, is fundamental to prevailing theories example, the subreddit r/taiwan may include discussions about
and methods of discourse analysis [15]. Viewed this way, compet- China, but those discussions are likely to be too contextually spe-
ing discourses surrounding the same entity can be understood as cific, focusing only on aspects of China that are relevant within
competing constructions of that entity. In this study, we analyze a Taiwan-specific context. Since r/China and r/Sino are directly
two competing English-language discourses surrounding China. focused on China, the users active in these subreddits are primarily
Permission to make digital or hard copies of all or part of this work for personal or there to discuss China, independent of some other primary context.
classroom use is granted without fee provided that copies are not made or distributed
for profit or commercial advantage and that copies bear this notice and the full citation
on the first page. Copyrights for components of this work owned by others than ACM
must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, 1 We follow the convention of using the prefix “r/” to denote the names of Reddit
to post on servers or to redistribute to lists, requires prior specific permission and/or a communities, or subreddits. Additionally, we write subreddit names as they are stylized
fee. Request permissions from [email protected]. by the subreddit itself, hence some subreddit names are not capitalized even when
SMSociety ’20, July 22–24, 2020, Toronto, ON, Canada they refer to proper nouns.
2 The Communist Party of China is also commonly referred to in English as the Chinese
© 2020 Association for Computing Machinery.
ACM ISBN 978-1-4503-7688-4/20/07. . . $15.00 Communist Party, abbreviated as CCP. We follow the party’s own English-language
https://doi.org/10.1145/3400806.3400816 convention of using the Communist Party of China and the abbreviation CPC.

73
SMSociety ’20, July 22–24, 2020, Toronto, ON, Canada Zachary Stine and Nitin Agarwal

Third, r/China and r/Sino produce their respective constructions were active. In the second case, we focus specifically on submissions
of China with documented awareness of the opposition between from the two subreddits that are from 2019 and discuss Hong Kong.
them. Notably, the discourse of r/Sino can be read as a reaction to While the first case provides an overarching comparison of how the
the discourse of r/China. One of the earliest submissions in r/Sino two communities conceive of China, the second allows us to see
serves as a welcome and explanation of the subreddit by one of the how these two ways of understanding China lead to different ways
subreddit’s moderators. In the submission text, the author explains of understanding the protests occurring in Hong Kong throughout
that other China-related subreddits exist, but that they are “hateful” much of 2019. In analyzing discussions about the protests, the two
and “spread misinformation.” The submission text itself does not discourses are brought into sharp contrast, providing a glimpse of
explicitly mention r/China, but r/China is explicitly referenced 18 how real-world events are constructed and interpreted differently
times within the submission’s comments. References to each other by r/China and r/Sino.
occur throughout both subreddits, further validating the view that We find that the most frequently observed features are simi-
the two discourses are in competition with each other. r/China is lar across the two communities, but that r/Sino can be best distin-
the older subreddit, created in 2008 compared to 2015 for r/Sino. It guished from r/China by negative discourse around other countries—
also has wider exposure on Reddit with almost four times as many most often the United States—while more general stylistic features
subscribers as r/Sino (at the time of this writing). Therefore, we can and discussions about living and working in China most differenti-
view the construction of China provided by r/Sino as an intended ate r/China. In the case of the 2019 Hong Kong protests, we find that
corrective to the alleged inaccuracies of the representation provided discussions about the politics underlying the protests are salient in
by the more dominant r/China. r/China, while discussions about violence in the protests alongside
Our goal in this study is to not only characterize the two compet- criticism of the United States are salient in r/Sino. We also find that
ing discourses, but to characterize what makes each stand out most our method for constructing community-level feature distributions
from the other. We accomplish this by first identifying the discur- with combinations of topics allows us to look beyond isolated top-
sive features underlying the discussions of the two communities by ics and better appreciate important interactions between multiple
training topic models on the discussion text, which represent each topics within discussions.
discussion thread as a mixture of latent word-usage patterns, or top-
ics [3]. Second, we map each discussion thread to a categorical fea- 2 BACKGROUND
ture representation that includes individual topics or combinations In order to better understand our analyses, we provide some back-
of topics. To our knowledge, the process for mapping documents ground and review selections of relevant work covering Reddit
to categorical combinations of topics is a novel methodological scholarship, topic modeling, and the use of information theoretic
contribution of this study. We then calculate how frequent these measures for comparative purposes.
features are within each community to obtain community-specific
feature distributions. Using an information theoretic quantity, we
2.1 Reddit
then calculate how conspicuous each feature is in one community
when juxtaposed with the other. Finally, we provide qualitative Data collected from Reddit have been analyzed within a variety
interpretations of the discursive features that emerge as salient and of research contexts. Network approaches have been used to char-
what discursive frames and strategies they indicate. acterize common user roles in subreddits [5] and user loyalty to
Our theoretical framework for interpreting results is informed subreddits [14]. Prior work on Reddit has also focused on specific
by the computational approach we take. Our use of topic models to behaviors of users including the usage of hate speech [6] and norm
learn the primary discursive features that we analyze constitutes a violations [7].
“distant reading” of the discourses that trades fine-grained, nuanced Reddit is an especially useful source for data given that each
interpretations for access to large-scale patterns that would not community has a well-defined focus. For example, the subreddit
be otherwise observable. In other words, we are interested in the r/ChangeMyView has been analyzed in order to better understand
broad tendencies of the discourses, which necessarily obscures more persuasion [25] and to characterize user susceptibility [17]. Addi-
specific aspects of the discourses that might also be illuminating. tionally, birth narratives from the subreddit r/BabyBumps have been
While we do conduct manual readings of documents from each computationally analyzed to better understand the experiences of
subreddit, we do so in a way that is guided and constrained by people who have given birth [1].
the topic models we obtain. The theoretical framework within A contribution of the present study to Reddit scholarship is
which we compare the two discourses is motivated by our desire the comparative approach taken to understand the relationship
to make the comparisons objective (at least, as much as possible). between the language used in two subreddits.
Therefore, rather than qualitatively compare quantitative features,
we rely on information theory to quantitatively interrogate the 2.2 Latent Dirichlet allocation
relationships between the two discourses and to quantify how In order to identify the word-usage patterns underlying discussions
salient each feature is for distinguishing between each discourse. from r/China and r/Sino, we train probabilistic topic models via
This approach assumes that what is most interesting about these latent Dirichlet allocation (LDA) [3]. LDA results in two kinds of
competing discourses is what most juxtaposes them. distributions being inferred: some number of distributions over the
We apply this methodology within two cases. The first consists of vocabulary present in the corpus and a mixture of those distribu-
a general comparison between all discussions from the two subred- tions that best represent each document. The distributions over
dits over a roughly four-year period during which both subreddits vocabulary are referred to as topics, though this is not always the

74
Comparative Discourse Analysis Using Topic Models: Contrasting Perspectives on China from Reddit SMSociety ’20, July 22–24, 2020, Toronto, ON, Canada

most appropriate way to think about what they represent. Topics produced by LDA were found to be useful in representing the dis-
comprise a pattern of word-usage and may correspond to certain cursive context under analysis. Brooks and McEnery [4] provide a
rhetorical styles as well as actual topics. Additionally, topics learned less favorable view of LDA within discourse analysis, criticizing it
though LDA arguably reflect certain concepts from the sociology of on the grounds that it lacks linguistic theoretical grounding, and
culture, including framing, polysemy, heteroglossia, and a relational that the topics produced by LDA from their data were difficult to
approach to meaning [9]. interpret from lists of high-probability words and often lacked the-
LDA requires the number of learned topics, k, to be specified. matic coherence across documents. In the present study, we did
Because it is unlikely that a uniquely “true” number of topics exists not find similar problems with our topic models after a combined
underlying any non-trivial corpus, the selection of k is often based analysis of the high-probability words for each topic along with
on more pragmatic grounds, primarily how useful the resulting manual readings of exemplar documents for each topic.
topics are for the researchers making sense of the corpus. While
several quantitative methods exist for evaluating topic models [8, 18, 2.3 Information theory and measures of
29], qualitative evaluation is necessary [23]. Different selections of k
may result in slightly different though potentially equally plausible
divergence
sets of topics with k influencing the specificity of the topics [20]. Because the features of interest from topic models are in the form of
Topic models have been used in a variety of contexts including probability distributions, they lend themselves to the use of informa-
comparative philosophy [21], literary scholarship [11], cultural evo- tion theoretic measures for rigorously interrogating relationships
lution [2], and in comparing Twitter data from different sources between objects within the inferred topic space. In this study, we
[19]. While Nichols et al. [21] and Morstatter et al. [19] both use are specifically interested in the relationship between distributions
topic models within a comparative context, their approaches differ of topics among two collections of documents, each representing
from ours in key ways. In the case of Morstatter et al. [19], two a Reddit community. We follow the usage of the partial Kullback-
document collections are compared by training two separate topic Leibler divergence by Klingenstein et al. [16] who use the measure
models for each collection and calculating the similarity between to identify features that were most salient for distinguishing be-
matched topics from each model. The two models reflect two dis- tween violent and non-violent trials in England over time. Here,
tinct feature spaces. Their approach discovers whether a feature we are interested in how well a given feature acts as a signal of one
from one model has an analogous feature in the other model and, if community over the other.
so, how similar the two features are. Thus, their interest is in finding Other relevant uses of divergence measures include comparisons
the extent to which two separate data sets produce similar features of hashtag usage between protestors and counterprotestors on
as a proxy for understanding how well a sample of data represents Twitter [10] and comparisons of proceedings from natural language
a much larger data set. While this is an appropriate method for processing conferences [13]. Notably, Hall et al. [13] also use LDA to
the questions being asked in that study, having two distinct sets of represent the documents being compared, but the method used for
features for the two subreddits we are comparing would complicate creating collection-level topic distributions amounts to calculating
our ability to calculate how distinguishing a feature is in either of an average topic distribution to represent each collection. While
the subreddits. In other words, we are not interested in comparing this is a reasonable approach, it results in the loss of document-level
the overall similarity of two distinct feature spaces, but rather the topic interactions, which we preserve in this study.
characteristics in a common set of features that are strong signals
of one subreddit relative to the other. 3 METHODS
The comparative approach taken by Nichols et al. [21] uses a In the following sections we describe the data collected and the
single topic model to compare documents within a shared feature methods used to analyze them. After data collection, we trained
space, which the present study more closely mirrors. However, topic models on the combined data from the two subreddits. We
in that study, the authors compare three philosophical works by then constructed feature representations for the two document
treating the ten highest probability topics within each of the three collections in two ways: first, by counting the dominant topic for
texts as sets. The texts are then compared based on the topics within each document within a collection and second, by counting com-
the intersections of each set. By comparing the sets of each text’s binations of topics for each document within a collection, using
top ten topics, useful information from the probabilities of these a threshold value to determine which topics to combine. Informa-
topics within a text are mostly discarded (outside of determining tion theoretic measures of divergence were then used to identify
which topics should be included in a text’s topic set). Information the most distinguishing topics or combinations of topics between
theory provides a set of tools for making more rigorous comparisons the two communities’ collection-level topic distributions. While
between probability distributions, which we use as the basis of our these methods are used in the context of Reddit discussions, they
quantitative comparisons. are likely to be useful in any context in which collections of text
Prior work exists which attempts to use LDA within the con- are compared and where the size of these data sets is too large to
text of discourse analysis. LDA was combined with the theoretical feasibly make sense of them through manual reading alone.
framework of critical discourse analysis in order to examine how
Muslims and Islam are discursively constructed within Swedish
social media [27] as well as the discursive relationship between 3.1 Data
Islamophobia and anti-feminism [28]. In both studies, the topics For each community, we collected all submission identifiers from
the community’s date of creation up to December of 2019 using

75
SMSociety ’20, July 22–24, 2020, Toronto, ON, Canada Zachary Stine and Nitin Agarwal

Figure 1: Monthly submission frequency of r/China and r/Sino from June 2015 through November 2019. Month labels are
formatted as YYYY-MM.

the service PushShift.io. We then used Reddit’s application pro- require a Reddit account to access. While these data are considered
gramming interface (API) to collect the text of each submission public, we avoid linking to specific submissions, direct quoting,
along with all comments from the submission’s discussion thread. and mentioning any user names in order to avoid bringing any
Submissions are available for r/China going back to January of 2008, unwanted attention to the individuals whose comments we analyze.
while submissions are only available for r/Sino as early as June 2015,
with only five available for r/Sino in its first two months of existence
3.2 Collection-level topic distributions
(Figure 1). Given the more recent creation of r/Sino, submissions
considered when training topic models for either community were In order to get a range of topic specificity, we trained LDA models
posted no earlier than August 2015. We consider a document to with 30, 90, and 150 topics, which we refer to throughout the paper
be the text of a submission and the comments from its discussion as models A, B, and C, respectively. With these models, each docu-
thread. ment can be represented as a distribution over 30, 90, or 150 topics,
From these submissions, we performed basic preprocessing. Af- where each topic is a distribution over the vocabulary of 65,176
ter tokenizing, only tokens consisting of at least three characters word types. When referring to a topic, we include the model name
were kept. Common words that occur in over 25% of all submissions to distinguish between two different topics that happen to share
were removed. Rare words that occur in fewer than five submissions the same topic number (e.g., A.10 and B.10 are two different topic
were also removed. While both r/China and r/Sino are predomi- features from models A and B respectively). We used the Gensim
nantly English-language communities, Chinese characters (hanzi) Python package for LDA model training [22].
are sometimes used. All Chinese characters were identified based We considered multiple methods for constructing topic distribu-
on their Unicode values and removed. We did not stem tokens, as tions that reflect a collection of documents. While an LDA model
this has been shown to have minimal or even negative effects on provides topic distributions for each document, we would like to
topic modeling [24]. Prior to preprocessing, there were 261,555 construct a topic distribution that reflects all documents within a
unique word types, which were reduced to 65,176 word types af- collection. Existing methods for combining document-level topic
ter processing. Documents were discarded if they contained fewer distributions into a collection-level topic distribution include cal-
than 20 post-processing tokens, resulting in 97,619 total documents culating an average topic distribution from the document-level
(down from 147,681 documents). topic distributions for all documents in a collection, such as in [13].
All of the data we collected are public and did not require IRB Another possible method would be to assign each word in a doc-
approval. Both r/China and r/Sino are publicly accessible and do not ument to a topic based on the document’s topic distribution and
then count these word-topic assignments within each document

76
Comparative Discourse Analysis Using Topic Models: Contrasting Perspectives on China from Reddit SMSociety ’20, July 22–24, 2020, Toronto, ON, Canada

in the collection to make a collection-level topic distribution, such of elements in the tuple is based on a threshold parameter and
as in [26]. This method also incorporates the length of each docu- is therefore flexible. To the best of our knowledge, this is a novel
ment in the collection-level topic distribution—longer documents method for representing documents as categorical topic features
will have greater influence over the resulting collection-level topic based on their topic distributions as inferred through LDA.
distribution. To construct a topic tuple distribution, we first define a threshold
Both of these methods for constructing collection-level topic parameter, t, within a range from 0 to 1. For each document be-
distributions result in the loss of information about potential in- longing to a community, that document’s topic tuple consists of the
terdependencies between topics that are salient within the same ordered topic indices that, when their corresponding proportions
document and thus provide important context. For example, a docu- in the topic distribution are summed together, equal or exceed the
ment that contains language primarily about both the United States specified threshold. A document’s topic tuple must have at least
and the Hong Kong protests is better represented by a combination one element and only the minimum number of elements necessary
of these two topics that is lost if we consider each of the two topics to meet the threshold condition.
in isolation. Additionally, a feature may be prevalent in both collec- As an example, consider the following topic probabilities for
tions, but may be combined with other features differently in the some document with four topics: 0.01, 0.49, 0.41, and 0.09 corre-
two collections. sponding to the proportions of topics T.0, T.1, T.2, and T.3, respec-
The loss of potential topic relationships in the methods just dis- tively, and which sum to one. If we define the threshold to be 0,
cussed results from both methods being ways of calculating the then the document’s topic tuple only includes the dominant topic,
frequency of each topic within a collection, whether by counting T.1, which has the largest probability of 0.49. However, if we define
topic proportions in each document and then normalizing or by the threshold to be 0.5, then topic T.1 is no longer sufficient to meet
counting word-topic assignments and then normalizing. In order the threshold. Instead, the topic with the next highest proportion,
to capture topic interdependencies within documents, we propose T.2, must be combined with T.1 to form the topic tuple, (T.1, T.2).
expanding the feature space of topics to also include combinations The summed probability of these two topics in the document is 0.9,
of topics. By mapping each document’s topic distribution to a sin- which satisfies the threshold of 0.5.
gle categorical feature consisting of either an individual topic or This example illustrates the kind of interdependencies between
a combination of topics, we can construct collection-level topic topics that can be preserved using this method. Topics T.1 and T.2
distributions that preserve topic relationships from the document are both similarly salient in the document (based on their similar
level within the broader collection-level distribution. Below, we de- proportions in the document’s topic distribution), which is reflected
scribe two kinds of collection-level topic distributions constructed by the topic tuple containing both. If instead, the proportion of T.1
from mapping each document’s topic distribution to a categori- was 0.85 and the proportion of T.2 0.05, then only T.1 would be
cal representation: dominant topic distributions and topic tuple needed to meet the threshold of 0.5. In this case, T.1 is uniquely
distributions. salient within the topic distribution and so additional topics are not
needed in the document’s topic tuple to meet the threshold.
3.2.1 Dominant topic distributions. As a baseline, we first calcu-
It should be noted that the same threshold will make higher
lated collection-level distributions by assigning each document to
demands when used on topic distributions with larger number of
the topic with the highest probability in the document’s topic dis-
topics. For example, a threshold of 0.5 may result in a topic tuple of
tribution. After assigning each document to a topic in this way, a
2 elements if k is 90, but may result in a topic tuple of 3 elements for
community’s collection-level topic distribution can be formed from
the same document when represented in a distribution where k is
the relative frequencies of these document-topic assignments. We
150. This is simply due to the probability mass having to be spread
refer to collection-level topic distributions created in this way as
out over a larger number of elements in the case of 150 topics versus
dominant topic distributions.
90 topics. Additionally, increasing the threshold value may result
Dominant topic distributions can be thought of as a special case
in a larger number of features that constitute the collection-level
of the topic tuple distributions described below where the threshold
topic distribution (see Table 1). Arbitrarily increasing the number
value is zero. This method treats each document as equally impor-
of features in this way may have undesirable effects by decreasing
tant regardless of length, in contrast to the word-topic assignment
the ability to meaningfully discriminate between two topic tuple
method discussed above. However, relationships between topics are
distributions.
necessarily lost in dominant topic distributions, since documents
We constructed topic tuple distributions using topics from each
will always be assigned to an individual topic. Therefore, we use
of the three topic models described above for threshold values of
dominant topic distributions as a baseline with which to compare
0.1, 0.3, 0.5, and 0.7. We limited our qualitative analysis to threshold
the results found using topic tuple distributions in order to see
values of 0.3 and 0.5 in order to avoid the potential problems of
what, if anything, is gained from the combining topics.
having too many features.
3.2.2 Topic tuple distributions. As we will see, interesting findings
can be made from analyzing dominant topic distributions. How-
ever, dominant topic distributions necessarily obscure potentially
interesting interdependencies between topics. In order to preserve 3.3 Levels of analysis
potential interdependencies between multiple topics within a sin- We compared collection-level topic distributions representing
gle document, we propose a method for mapping a document’s r/China and r/Sino at two different levels of analysis. First, we
topic distribution to an ordered tuple of topics where the number compared distributions reflecting all documents used in training

77
SMSociety ’20, July 22–24, 2020, Toronto, ON, Canada Zachary Stine and Nitin Agarwal

Table 1: Number of features present in collection-level topic is for r/China against a background based on a distribution repre-
distributions. senting r/Sino. We calculated the partial KL for each feature using
r/China as the expectation distribution to rank features in order of
Model Dominant Topic tuple Topic tuple relative salience in r/China and then did the same with r/Sino as
topic (t=0.3) (t=0.5) the expectation distribution.
For each comparison, we then examined the ten most frequent
A (k=30) 30 1,232 10,515
features in each community as well as the ten most distinguishing
B (k=90) 89 6,786 32,231
features of each community (based on the partial KL values). We
C (k=150) 138 12,087 54,618
conducted these comparisons for each combination of LDA model,
feature type, and level of analysis described above. In order to
our topic models, from August 2015 through November 2019. Sec- understand the significance of each feature in context, we manually
ond, we narrowed our focus to documents from 2019 that contain read multiple documents that possess the feature. This is necessary
language about Hong Kong as a more focused case study of the since interpreting topic features based only on some number of
different perspectives on the Hong Kong protests. These documents highly probable terms in a topic can be problematic.
were selected after identifying six topics relating to Hong Kong—
one from model A (A.11), two from model B (B.30 and B.44), and 4 RESULTS
three from model C (C.6, C.70, and C.118). Any document which In this section, we describe our findings from the broader level of
has one of these six topics as its most dominant or second-most analysis followed by findings from documents discussing Hong
dominant topic and that has a submission date between January Kong during 2019. At each level of analysis, we first examine domi-
through November of 2019 was included in this set. nant topic features followed by topic tuple features. We report both
For both levels of analysis, we filtered out all documents with highly frequent and highly distinguishing features in the tables
dominant topic A.18, as they correspond to submissions that are below. Highly frequent features are reported with their relative
dominated by one of several boilerplate moderation comments, frequency within a community’s document collection (e.g., Table
typically due to the submission not following the community rules. 2). While highly frequent features within a community provide a
This filtering resulted in 5,114 documents being discarded in the general characterization of that community’s discourse, the fea-
subsequent analysis. tures which are most distinguishing reflect which aspects of one
discourse are comparatively salient in that discourse relative to
3.4 Feature comparisons the other. The most distinguishing features of each community’s
For each level of analysis, we calculated the Jensen-Shannon diver- discourse are reported with their partial KL values, given in bits
gence between a given collection-level topic distribution represent- (e.g., Table 3).
ing r/China and a distribution over the same features representing
r/Sino. In order to measure how strongly each feature of the dis- 4.1 Broad comparison of discourse
tribution functions as a signal of a community, we calculated the In order to get a broad sense of how the discursive constructions
partial Kullback-Leibler divergence, KLi¸for each element of the of China differ between the two communities, we first consider
distribution, which reflects how much each feature individually findings that arise when comparing documents from August 2015
contributes to the Jensen-Shannon divergence [16]. The Jensen- through November 2019. In each of the three topic models, a topic
Shannon divergence can be formulated in the following manner emerges that is prevalent in both communities. These topics—A.24,
as the symmetrized version of the asymmetric Kullback-Leibler B.31, and C.123—reflect a general rhetorical style that tends to be
divergence: negative and critical, based on our manual reading of exemplar
1 documents that feature these topics with high probability. These
JSD (p, q) = [KLD (p, m) + KLD (q, m)] (1) stylistic topics are the most frequently observed dominant topics
2
in both communities at this level of analysis. Several other features
where p and q are the distributions being compared and m = 1/2 (p
exist within each of the topic models, so reporting them within
+ q). The Kullback-Leibler divergence with an expectation based
each model would be redundant. We limit reporting results at this
on p is given by
level of analysis to the features from model B (k=90), as they are
Õ pi
KLD (p, m) = pi log2 (2) interpretable, but not overly specific.
i
m i
4.1.1 Results from dominant topic distributions. When examining
from which the partial Kullback-Leibler divergence for the ith fea- the most frequent features in r/China from the dominant topic dis-
ture in the distribution is simply tribution, the features that appear most frequently (aside from the
pi stylistics topic B.31) concern practical matters—asking questions
KLi (p, m) = pi log2 (3)
mi and seeking advice (e.g., how to ship a package to China) (B.76),
The partial Kullback-Leibler divergence measures how strongly personal aspects of life in China as a foreign national (B.21), discus-
feature i acts as a signal of the expectation distribution (p as writ- sions about jobs and working in China (B.87), the use of VPNs for
ten in equation 3) [16]. Thus, knowing the partial Kullback-Leibler accessing websites (B.18), etc. In addition to these more practical
divergence for each feature with an expectation based on a distri- topics, discussions about trade with the United States (B.75) occurs
bution representing r/China tells us how conspicuous that feature as the fifth most frequent feature.

78
Comparative Discourse Analysis Using Topic Models: Contrasting Perspectives on China from Reddit SMSociety ’20, July 22–24, 2020, Toronto, ON, Canada

Table 2: Most frequent dominant topic features.

r/China features Proportion of r/Sino features Proportion of

collection collection
B.31 Critical stylistics 0.612 B.31 Critical stylistics 0.458
B.76 Questions 0.064 B.24 International partnerships 0.072
B.21 Being a foreign national 0.041 B.75 Trade with the US 0.052
B.87 Jobs and working life 0.024 B.19 Military/engineering 0.045
B.75 Trade with the US 0.019 B.9 Political ideology 0.029
B.18 VPNs/website access 0.017 B.37 Technological growth 0.029
B.9 Political ideology 0.016 B.86 Economic standing 0.023

Table 3: Most distinguishing dominant topic features.

r/China features Partial KL (bits) r/Sino features Partial KL (bits)

B.31 Critical stylistics 0.1186 B.24 International partnerships 0.0611
B.76 Questions 0.0583 B.19 Military/engineering 0.0418
B.21 Being a foreign national 0.0329 B.75 Trade with the US 0.0283
B.87 Jobs and working life 0.0182 B.37 Technological growth 0.0195
B.18 VPNs/website access 0.0159 B.25 Financial reporting 0.0133
B.72 Comm. Applications 0.0088 B.9 Political ideology 0.0112
B.33 Purchasing products 0.0088 B.86 Economic standing 0.0109
B.64 Water sanitation 0.0066 B.54 Scientific research 0.0107

After the stylistics topic B.31, the most frequent dominant topics experiences as foreign nationals living in China (B.21), their work-
observed in r/Sino include discussions around China establishing ing lives (B.87), and practical concerns (e.g., B.18, B.72, and B.33). In
partnerships with other countries (B.24), trade with the United juxtaposition to this, the discourses that emerge from r/Sino focus
States (B.75), military and engineering innovation in China (B.19), on China at the state level with respect to its relationships and
political ideology and systems (B.9), technological innovation and relative standings with other countries (B.24, B.75, and B.86) as well
growth (B.37), China’s international economic standing (B.86), etc. as its own internal growth, development, and power (B.19, B.37,
An overview of these features for both communities can be seen in B.25, and B.54).
Table 2
While we can see some interesting differences between the most
4.1.2 Results from topic tuple distributions. While the analysis of
frequent dominant topic features of the two communities, calculat-
dominant topic distributions has yielded interesting results, the
ing the partial KL values of each feature provides a rigorous way
most frequent feature in either community is the somewhat vague
of ranking the comparative salience of each feature. For r/China,
stylistic topic B.31, which accounts for over 60 percent of the r/China
the ordering of distinguishing dominant topics resembles its most
collection and over 45 percent of the r/Sino collection, and which is
frequent dominant topics with a few changes. After topics B.31,
the strongest signal of r/China relative to r/Sino. When we compare
B.76, B.21, and B.87, the next most distinguishing topic for r/China
collection-level distributions of topic tuples, we find that several
is B.18 (VPNs), followed by several less frequent topics reflecting
interesting features emerge in which B.31 is dominant but interde-
discussions about needing help with communication applications
pendent with an additional topic.
(most often, the messaging application, WeChat) (B.72), purchasing
Using a threshold value of 0.3, we find that the most frequent
products (B.33), and water sanitation (B.64).
topic tuples in r/China involve the same topics seen in Table 2,
The most distinguishing dominant topic features for r/Sino sim-
but now in more contextually informative combinations, includ-
ilarly reflect several of its highest-frequency features with some
ing (B.31, B.21) reflecting discourse that features critical stylistic
changes. Topic B.24 (international partnerships) is the most distin-
elements in combination with life as a foreign national in China,
guishing topic, followed by B.19 (military and engineering innova-
(B.21, B.31) representing the same combination but with life as a
tion), B.75 (trade relations with the United States), B.37 (technologi-
foreign national given primacy, and (B.31, B.76) combining critical
cal innovation and growth), B.25 (financial reporting), B.9 (political
stylistics with questions and advice. An interesting picture begins
ideology), B.86 (economic standing), and B.54 (scientific research).
to emerge from these co-salient topic tuples that is borne out when
See Table 3 for an overview of the most distinguishing dominant
reading the source documents with these features—while r/China
topics.
may often invoke critical language that is untethered from more
Notably, the discourses that emerge from r/China over this broad
specific discursive foci, the emergence of (B.31, B.21) and (B.21,
period of time tend to reflect the experiences of individuals—their
B.31) as relatively frequent reflects the tendency of r/China users

79
SMSociety ’20, July 22–24, 2020, Toronto, ON, Canada Zachary Stine and Nitin Agarwal

Table 4: Most frequent topic tuple features (t=0.3).

r/China features Proportion of r/Sino features Proportion of

collection collection
(B.31) Critical stylistics 0.341 (B.31) Critical stylistics 0.219
(B.31, B.21) Stylistics & foreigner 0.033 (B.24) International partnerships 0.032
(B.76) Questions 0.023 (B.75) Trade with the US 0.025
(B.21, B.31) Foreigner & stylistics 0.017 (B.31, B.75) Stylistics & US trade 0.025
(B.31, B.76) Stylistics & questions 0.017 (B.19) Military/engineering 0.024

Table 5: Most distinguishing topic tuple features (t=0.3).

r/China features Partial KL (bits) r/Sino features Partial KL (bits)

(B.31) Critical stylistics 0.0968 (B.24) International partnerships 0.0298
(B.31, B.21) Stylistics & foreigner 0.0259 (B.19) Military/engineering 0.0235
(B.76) Questions 0.0218 (B.75) Trade with the US 0.0146
(B.21, B.31) Foreigner & stylistics 0.0141 (B.31, B.88) Stylistics & Western hypocrisy 0.0128
(B.31, B.76) Stylistics & questions 0.0132 (B.31, B.75) Stylistics & US trade 0.0111

to discuss their lives in China as foreign nationals in ways that are From analyzing the feature distributions that characterize the col-
often negative. lection of documents from r/China and that of r/Sino over a period
Likewise, we see several of the same features occur with high of over four years, we see that both communities employ a generally
frequency in r/Sino at a threshold of 0.3 which we saw as dominant similar way of using language that involves being highly critical
topics, but now including the topic tuple (B.31, B.35) representing (topic B.31). From an analysis of the dominant topic distributions
the combination of the critical stylistics feature with discussions of each community, we see that (aside from B.31), r/China submis-
about trade with the United States. Many of these documents in- sions are often concerned with the experiences of individuals—most
clude discussions that heavily criticize the United States in relation often as foreign nationals navigating their lives in China. When
to the so-called trade war between the two countries that began in we look more deeply into the topic relationships that may occur
2018. See Table 4 for an overview of the five most frequent topic within documents (by constructing distributions of topic tuples
tuples for this threshold value. rather than dominant topics), we find that the critical stylistics
For r/China, the same five most frequent features are also the topic frequently pairs with these other topics, providing greater
five most distinguishing features relative to r/Sino. However, a few context for understanding the discourses.
interesting changes are present within the five most distinguishing
features of r/Sino relative to r/China. The first three most distin- 4.2 Comparison of discourse concerning Hong
guishing features correspond to those described for the dominant
Kong in 2019
topic features from Table 3. Additionally, we see the topic tuple
(B.31, B.88), corresponding to a combination of the critical stylistics When we compare discourse surrounding Hong Kong during 2019,
topic with B.88, which represents discourse about the “West” (typi- we again see that the critical stylistics topic B.31 is the most fre-
cally used to refer to the United States), most often as accusations quently occurring feature in each community, both when analyzing
of hypocrisy (e.g., perceived double standards regarding the state’s dominant topics and topic tuples with a threshold of 0.3.
treatment of Uyghurs in light of the United States’ treatment of
4.2.1 Results from dominant topic distributions concerning Hong
those seeking refugee status there) and more general charges of
Kong in 2019. In looking at the top five most frequent dominant
propaganda and anti-Chinese bias in Western media. This feature
topics from the two communities, they share much in common
demonstrates the usefulness of examining topic tuples in this man-
in terms of the features’ rankings (see Table 6). Notably, the topic
ner: When only analyzing dominant topics, B.88 is obscured by the
B.35 may represent two kinds of language. On the one hand, B.35
prevalence of highly critical language that often accompanies dis-
appears in submissions which include a string of phrases intended
cussions of the West by r/Sino. By allowing for the possibility that
to provoke censorship. These phrases typically include references
more than one topic is needed to adequately represent a document,
to Tibet, the Tiananmen Square massacre of 1989, “democratization,”
we can see the different ways in which r/Sino uses the critical stylis-
“independence,” and “freedom” among others. Among the r/Sino
tics topic, whether in criticism of perceived Western hypocrisy or
collection of documents concerning Hong Kong, the appearance of
in criticism of the China-US trade war. See Table 5 for an overview
B.35 almost always indicates this usage within a submission title,
of the most distinguishing topic tuples from each community at a
which are flagged as violating of the subreddit rules (it is likely
threshold of 0.3.
considered a form of trolling). However, the appearance of B.35
within r/China may also include language that shares some words

80
Comparative Discourse Analysis Using Topic Models: Contrasting Perspectives on China from Reddit SMSociety ’20, July 22–24, 2020, Toronto, ON, Canada

Table 6: Most frequent dominant topic features concerning Hong Kong in 2019.

r/China features Proportion of r/Sino features Proportion of

collection collection
B.31 Critical stylistics 0.649 B.31 Critical stylistics 0.677
B.44 Protest violence 0.109 B.44 Protest violence 0.142
B.30 Protest politics 0.102 B.30 Protest politics 0.052
B.35 Sensitive phrases 0.054 B.35 Sensitive phrases 0.041
B.75 Trade with the US 0.011 B.88 Western hypocrisy 0.012

Table 7: Most distinguishing dominant topics concerning Hong Kong in 2019.

r/China features Partial KL (bits) r/Sino features Partial KL (bits)

B.30 Protest politics 0.0411 B.44 Protest violence 0.0257
B.35 Sensitive phrases 0.0106 B.31 Critical stylistics 0.0206
B.76 Questions 0.0054 B.65 Automatic summary bot 0.0098
B.83 CPC policy 0.0033 B.88 Western hypocrisy 0.0091
B.49 Subreddit rules 0.0023 B.45 Media criticism 0.0027

in common with the trolling usage just described. Submissions that document. Here, those features include language about the protests
feature actual discussions invoking Tiananmen Square, Tibet, or in relation to their political underpinnings (B.30) and to protest-
democracy may also have this dominant topic. related violence (B.44). See Table 8 for an overview of the most
Examining which dominant topics most distinguish each com- frequent topic tuples.
munity yields interesting differences (see Table 7). Topic B.30 is Notably, language about protest-related violence does not occur
highly conspicuous in r/China and represents general discussions within any of the five most distinguishing topic tuples for r/China.
around the protests, typically framed as political tensions between However, three of the five most distinguishing features for r/Sino
mainland China and Hong Kong. The most distinguishing domi- feature language about violence, almost always as carried out by
nant topic of r/Sino, B.44, also reflects language about the Hong protestors. Instead, r/China’s distinguishing features concerning
Kong protests, but more specifically concerns violence occurring Hong Kong deal more with the underlying politics, both as reflected
during protests. Such submissions from r/Sino tend to focus on vio- by B.30 and some of the discussions related to B.35 that sometimes
lence alleged to have been committed by the protestors (though in invoke language about democracy. See Table 9 for an overview
r/China this topic may reflect violence carried out against protestors of the most distinguishing topic tuples for each community at a
by police in addition to violence committed by protestors). This threshold of 0.3.
marks an interesting change in the discursive strategies we previ- Interestingly, the topic tuple (B.31, B.88) appears as the third most
ously described for r/Sino: While r/Sino broadly tends to focus on distinguishing feature for r/Sino, despite these documents being
states as actors, rather than individuals, as described in section 4.1 required to have a Hong Kong-related topic as its first or second
above, in discourse around the Hong Kong protests, the community most probable topic. This is a case where using more specific topics
emphasizes the negative actions of individuals. can be helpful as these documents have a Hong Kong-related topic
The other dominant topics that distinguish r/China at this level in the 150-topic model as the first or second most dominant topic
of analysis include the previously described sensitive phrases topic that does not appear in the 90-topic model.
(B.30), question asking (B.76), language about reports or announce- If we increase the threshold to 0.5, we do see that there is a con-
ments from the CPC (B.83), and the enforcing of subreddit rules nection between r/Sino’s usage of critical stylistics (B.31), charges of
(B.49). The critical stylistics topic is more conspicuous within r/Sino Western hypocrisy (B.88) and Hong Kong-related topics within the
at this level of analysis whereas, at the broad level of analysis 90-topic model. See Table 10 for the distinguishing features of each
described in section 4.1, this topic served as a stronger signal of community when examining topic tuples with a threshold of 0.5.
r/China. While B.65 reflects submissions that include automatically At this threshold, we see that distinguishing discussions on r/Sino
constructed summaries by a self-declared bot account, we also see often combine discussions of the protests with charges of Western
the appearance of B.88, denoting accusations of Western hypocrisy, hypocrisy (B.88). Importantly, the connection that r/Sino forges
and B.45, criticizing media outlets for reporting alleged falsehoods. between the Hong Kong protests and Western hypocrisy becomes
clear when topic tuples are examined. These results suggest two
4.2.2 Results from topic tuple distributions concerning Hong Kong
dominant discursive strategies employed by users of r/Sino when
in 2019. When we analyze the topic tuples representing each com-
discussing the protests—to foreground alleged violence committed
munity’s collection of documents, we again see that the critical
by protestors and to shift discursive focus onto the hypocrisy of
stylistics topic is often co-salient with other relevant features, which
the West.
are obscured when only considering the dominant topic of each

81
SMSociety ’20, July 22–24, 2020, Toronto, ON, Canada Zachary Stine and Nitin Agarwal

Table 8: Most frequent topic tuples (t=0.3) concerning Hong Kong in 2019.

r/China features Proportion of r/Sino features Proportion of

collection collection
(B.31) Critical stylistics 0.401 (B.31) Critical stylistics 0.393
(B.31, B.30) Stylistics & protests 0.087 (B.31, B.44) Stylistics & violence 0.115
(B.31, B.44) Stylistics & violence 0.073 (B.31, B.30) Stylistics & protests 0.080
(B.44) Protest violence 0.047 (B.44) Protest violence 0.066
(B.35) Sensitive phrases 0.037 (B.44, B.31) Violence & stylistics 0.048

Table 9: Most distinguishing topic tuples (t=0.3) concerning Hong Kong in 2019.

r/China features Partial KL (bits) r/Sino features Partial KL (bits)

(B.31, B.35) Stylistics & sensitive phrases 0.0116 (B.31, B.44) Stylistics & violence 0.0331
(B.30) Protest politics 0.0112 (B.44) Protest violence 0.0148
(B.30, B.31) Protests & stylistics 0.0092 (B.31, B.88) Stylistics & Western hypocrisy 0.0109
(B.35, B.31) Sensitive phrases & stylistics 0.0073 (B.44, B.31) Violence & stylistics 0.0104
(B.31) Critical stylistics 0.0056 (B.65, B.31) Summary bot & stylistics 0.0025

Table 10: Most distinguishing topic tuples (t=0.5) concerning Hong Kong in 2019.

r/China features Partial KL (bits) r/Sino features Partial KL (bits)

(B.31, B.30) Protests & stylistics 0.0248 (B.31, B.44) Stylistics & violence 0.0238
(B.31) Critical stylistics 0.0151 (B.31, B.44, B.88) Stylistics & violence & 0.0226
Western hypocrisy
(B.31, B.30, B.35) Stylistics & protests & 0.0111 (B.44, B.31) Violence & stylistics 0.0179
sensitive phrases
(B.31, B.44, B.35) Stylistics & violence & 0.0099 (B.31, B.88) Stylistics & Western hypocrisy 0.0099
sensitive phrases
(B.31, B.35) Stylistics & sensitive phrases 0.0094 (B.31, B.88, B.30) Stylistics & Western 0.0093
hypocrisy & protests

5 DISCUSSION The discursive tendencies of the two communities concerning

The results described in section 4 reflect the discursive tendencies the Hong Kong protests during 2019 show that, while r/China is
that are both prevalent in r/China and r/Sino and that best differenti- most distinguished by discussions of the political underpinnings of
ate the two communities. The most frequently observed features in the protests, r/Sino is most distinguished by its focus on violence
the two communities tend to overlap. By calculating which features committed by protestors. Here we see an interesting reversal from
serve as the strongest signals of one community relative to the the discursive foci that distinguished the communities more broadly.
other, we can see beyond the features they have in common and The actions of individuals are here salient in r/Sino, while political
identify the frames and discursive strategies that are conspicuous tensions between Hong Kong and the rest of China are salient in
in one community in light of the other. r/China. In r/Sino, the West (typically the US) continues to be an im-
At the broad level of analysis that includes documents from portant part of distinguishing discourse wherein discussions about
August 2015 through November 2019, we see that r/China is most the Hong Kong protests are often nested within discourse about
distinguished from r/Sino by its focus on individual concerns and ex- the hypocrisy of the West. In other words, when r/Sino discusses
periences, often on the part of foreign nationals working or studying the Hong Kong protests, its users often end up discussing the West,
in China. These more practical, mundane, and individual-focused pointing a finger back to perceived critics. This forms an important
discourses exist in contrast to the discussions on r/Sino that distin- discursive strategy of r/Sino alongside the focus on violence commit-
guish it from r/China. The primary actors in r/Sino discussions are ted by protestors: the flaws of the US and the Hong Kong protests
not the individual users, but rather, states. Rather than describing are emphasized, often through concrete language about the experi-
life in China, r/Sino describes China in terms of accomplishments— ences of individuals (e.g., refugee-seekers in the US and innocent
international partnerships, economic standing, technological inno- victims of protestor violence in Hong Kong), while such concrete
vation, etc. The United States also appears as an important character language is less prevalent when discussions are about China, which
in r/Sino discourse, serving as a foil to China.

82
Comparative Discourse Analysis Using Topic Models: Contrasting Perspectives on China from Reddit SMSociety ’20, July 22–24, 2020, Toronto, ON, Canada

is discussed at a more abstract, and therefore idealized, level. As- focus on violence committed by protestors (a reversal from the lack
pects of this discursive strategy can also be seen in r/China through of focus on individuals more broadly) and by the tendency for dis-
the focus on concrete, negative experiences of individuals living in cussions about the protests to foreground accusations of Western
China, while discussing the Hong Kong protests primarily in terms hypocrisy. These findings contribute to a broader understanding of
of more abstract, idealized entities. These discursive strategies echo the popular Western perspectives surrounding China.
the “Fallacy of the Misguided Comparison” as described by Hall
and Ames [12] within the context of cross-cultural communication ACKNOWLEDGMENTS
between the West and China. The authors describe this fallacy as This research is funded in part by the U.S. National Science Foun-
the comparison of “the ideals of one society or culture with the prac- dation (OIA-1920920, IIS-1636933, ACI-1429160, and IIS-1110868),
tices of another” [12]. The implications of these findings are that U.S. Office of Naval Research (N00014-10-1-0091, N00014-14-1-0489,
popular conceptions of China from Reddit are likely to reflect such N00014-15-P-1187, N00014-16-1-2016, N00014-16-1-2412, N00014-
misguided comparisons, by either privileging the ideals of China 17-1-2605, N00014-17-1-2675, N00014-19-1-2336), U.S. Air Force
(as in r/Sino) or its flawed realities (as in r/China), leaving a gap Research Lab, U.S. Army Research Office (W911NF-16-1-0189), U.S.
where more even-handed cross-cultural understanding between Defense Advanced Research Projects Agency (W31P4Q-17-C-0059),
the West and China might exist. Arkansas Research Alliance, the Jerry L. Maulden/Entergy Endow-
Many of these findings come into clearer view when analyzing ment at the University of Arkansas at Little Rock, and the Australian
topic tuples as document features rather than single dominant top- Department of Defense Strategic Policy Grants Program (SPGP)
ics. This method permits us to see topic combinations that provide (award number: 2020-106-094). Any opinions, findings, and conclu-
important context. For example, it is not just the case that r/Sino sions or recommendations expressed in this material are those of
uses critical language stylistics, but rather, it pairs critical language the authors and do not necessarily reflect the views of the funding
stylistics with features like protest violence and Western hypocrisy, organizations. The researchers gratefully acknowledge the support.
whereas r/China uses the same topic in combination with describing
experiences as foreign nationals. REFERENCES
While this analysis has yielded interesting insights, there are [1] Maria Antoniak, David Mimno, and Karen Levy. 2019. Narrative paths and
limitations present in the current study. Our analyses focus on the negotiation of power in birth stories. In Proc. ACM Hum.-Comput. Interact. 3,
frequency of certain features at the document level, treating each CSCW,Article 88 (November 2019), 27 pages. DOI: https://doi.org/10.1145/3359190
[2] Alexander T. J. Barron, Jenny Huang, Rebecca L. Spang, and Simon DeDeo. 2018.
document equally. However, various kinds of metadata are available Individuals, institutions, and innovation in the debates of the French Revolu-
from Reddit that are likely to be of interest when combined with tion. PNAS 115, 18 (May 2018), 4607–4612. DOI: https://doi.org/10.1073/pnas.
1717729115
these features. One kind of potentially interesting metadata is a [3] David M. Blei, Andrew Y. Ng, and Michael I. Jordan. 2003. Latent Dirichlet
document’s score, which is derived from the number of positive allocation. Journal of Machine Learning Research 3, 993–1022.
and negative votes it received (known as upvotes and downvotes, [4] Gavin Brooks and Tony McEnery. 2019. The utility of topic modelling for dis-
course studies: A critical evaluation. Discourse Studies 21, 1 (Feb. 2019), 3–21. DOI:
respectively). Correlating document scores with discursive features https://doi.org/10.1177/1461445618814032.
might provide additional information on which features are not only [5] Cody Buntain and Jennifer Golbeck. 2014. Identifying social roles in reddit using
frequent, but broadly endorsed by the community. Additionally, network structure. In Proceedings of the 23rd International Conference on World
Wide Web (WWW ’14 Companion). Association for Computing Machinery, New
our focus has been on two important China-related subreddits, but York, NY, USA, 615–620. https://doi.org/10.1145/2567948.2579231
there are other communities whose analysis would contribute to a [6] Eshwar Chandrasekharan, Umashanthi Pavalanathan, Anirudh Srinivasan, Adam
Glynn, Jacob Eisenstein, and Eric Gilbert. 2017. You Can’t Stay Here: The Efficacy
larger understanding of the various China-related discourses that of Reddit’s 2015 Ban Examined Through Hate Speech. Proc. ACM Hum.-Comput.
are active within the English-speaking world of Reddit, but with Interact. 1, CSCW, Article 31 (December 2017), 22 pages. DOI: https://doi.org/10.
the caveats we noted in section 1. 1145/3134666
[7] Eshwar Chandrasekharan, Mattia Samory, Shagun Jhaver, Hunter Charvat, Amy
Bruckman, Cliff Lampe, Jacob Eisenstein, and Eric Gilbert. 2018. The Internet’s
Hidden Rules: An Empirical Study of Reddit Norm Violations at Micro, Meso, and
6 CONCLUSIONS Macro Scales. Proc. ACM Hum.-Comput. Interact.2, CSCW, Article 32 (November
2018), 25 pages. DOI: https://doi.org/10.1145/3274301
The subreddits r/China and r/Sino represent two popular and dis- [8] Jonathan Chang, Jordan Boyd-Graber, Sean Gerrish, Chong Wang, and David M.
tinct sets of English-language discursive constructions of China. Blei. 2009. Reading tea leaves: How humans interpret topic models. In Advances
in Neural Information Processing Systems (NIPS ‘09). Vancouver, 288–296.
For a number of reasons, understanding popular modes of discourse [9] Paul DiMaggio, Manish Nag, and David Blei. 2013. Exploiting affinities between
around China are important owing to China’s international impor- topic modeling and the sociological perspective on culture: Application to news-
tance more broadly. Using latent word-usage patterns underlying paper coverage of U.S. government arts funding. Poetics 41, 6, 570–606. DOI:
https://doi.org/10.1016/j.poetic.2013.08.004
discussions from both communities, we have examined the word- [10] Ryan Gallagher, Andrew J. Reagan, Christopher M. Danforth, Peter Sheri-
usage patterns that are most frequent in each community and that dan Dodds. 2018. Divergent discourse between protests and counter-protests:
most distinguish them against a backdrop informed by the other. #BlackLivesMatter and #AllLivesMatter. PLoS ONE 13, 4 (Apr. 2018). DOI: https:
//doi.org/10.1371/journal.pone.0195644
We find that r/China is broadly distinguished by a focus on the (of- [11] Andrew Goldstone and Ted Underwood. 2012. What can topic models teach us
ten negative) experiences of individuals, whereas r/Sino is broadly about the history of literary scholarship? Journal of Digital Humanities 2, 1 (Dec.
2012).
distinguished by a focus on states. When we focus our analysis on [12] David L. Hall and Roger T. Ames. 1999. The Democracy of the Dead: Dewey,
discussions related to Hong Kong during 2019, we find that r/China Confucius, and the Hope for Democracy in China (1st. ed.). Open Court, Chicago
is distinguished by discussions of the political underpinnings of and Lasalle, IL.
[13] David Hall, Daniel Jurafsky, and Christopher D. Manning. 2008. Studying the
the protests deriving from tensions between China and Hong Kong history of ideas using topic models. In Proceedings of the Conference on Em-
as abstract primary characters, while r/Sino is distinguished by a pirical Methods in Natural Language Processing (EMNLP ’08). Association for

83
SMSociety ’20, July 22–24, 2020, Toronto, ON, Canada Zachary Stine and Nitin Agarwal

Computational Linguistics, USA, 363–371. [22] Radim Řehůřek and Petr Sojka. 2010. Software framework for topic modelling
[14] William L. Hamilton, Justine Zhang, Cristian Danescu-Niculescu-Mizil, Dan with large corpora. In Proceedings of the LREC 2010 Workshop on New Challenges
Jurafsky, and Jure Leskovec. 2017. Loyalty in online communities. In Proceedings for NLP Frameworks. ELRA, Malta, 45–50.
of the 11th International AAAI Conference on Web and Social Media (ICWSM ‘17). [23] Margaret E. Roberts, Brandon M. Stewart, and Dustin Tingley. 2016. Navigating
AAAI Press, 540–543. https://www.aaai.org/ocs/index.php/ICWSM/ICWSM17/ the Local Modes of Big Data: The Case of Topic Models. In Computational Social
paper/view/15710/14848 Science: Discovery and Prediction Cambridge University Press, New York, NY,
[15] Marianne Jørgensen and Louise J. Phillips. 2002. Discourse Analysis as Theory 51–97.
and Method (1st. ed.). Sage, London. [24] Alexandra Schofield and David Mimno. 2016. Comparing apples to apple: The
[16] Sara Klingenstein, Tim Hitchcock, and Simon DeDeo. 2014. The civilizing process effects of stemmers on topic models. In Transactions of the Association for Com-
in London’s Old Bailey. PNAS 111, 26 (Jul. 2014), 9419–9424. DOI: https://doi.org/ putational Linguistics 4, 287–300. DOI: https://doi.org/10.1162/tacl_a_00099
10.1073/pnas.1405984111 [25] Chenhao Tan, Vlad Niculae, Cristian Danescu-Niculescu-Mizil, and Lillian Lee.
[17] Humphrey Mensah, Lu Xiao, and Sucheta Soundarajan. 2019. Characterizing 2016. Winning arguments: Interaction dynamics and persuasion strategies in
susceptible users on Reddit’s ChangeMyView. In Proceedings of the 10th Inter- good-faith online discussions. In Proceedings of the 25th International Conference
national Conference on Social Media and Society (SMSociety ’19). Association for on World Wide Web (WWW ’16). International World Wide Web Conferences
Computing Machinery, New York, NY, USA, 102–107. https://doi.org/10.1145/ Steering Committee, Republic and Canton of Geneva, CHE, 613–624. https://doi.
3328529.3328550 org/10.1145/2872427.2883081
[18] David Mimno, Hanna M. Wallach, Edmund Talley, Miriam Leenders, and Andrew [26] Laure Thompson and David Mimno. 2018. Authorless topic models: Biasing mod-
McCallum. 2011. Optimizing semantic coherence in topic models. In Proceedings els away from known structure. In Proceedings of the 27 th International Conference
of the Conference on Empirical Methods in Natural Language Processing (EMNLP on Computational Linguistics. Association for Computational Linguistics, USA,
’11). Association for Computational Linguistics, USA, 262–272. 3903–3914.
[19] Fred Morstatter, Jürgen Pfeffer, Huan Liu, and Kathleen M. Carley. 2003. Is the [27] Anton Törnberg and Petter Törnberg. 2016. Muslims in social media discourse:
sample good enough? Comparing data from Twitter’s streaming API with Twit- Combining topic modeling and critical discourse analysis. Discourse, Context and
ter’s firehose. In Proceedings of the 7th International AAAI Conference on Weblogs Media 13, 132–142. DOI: https://doi.org/10.1016/j.dcm.2016.04.003.
and Social Media, AAAI Press, Cambridge, MA., 400–408. [28] Anton Törnberg and Petter Törnberg. 2016. Combining CDA and topic modeling:
[20] Dong Nguyen, Maria Liakata, Simon DeDeo, Jacob Eisenstein, David Mimno, Analyzing discursive connections between Islamophobia and anti-feminism on
Rebekah Tromble, and Jane Winters. 2019. How we do things with words: an online forum. Discourse & Society 27, 4, 401–422. DOI: https://doi.org/10.1177/
Analyzing text as social and cultural data. arXiv:1907.01468. Retrieved from 0957926516634546
https://arxiv.org/abs/1907.01468 [29] Hanna M. Wallach, Iain Murray, Ruslan Salakhutdinov, and David Mimno. 2009.
[21] Ryan Nichols, Edward Slingerland, Kristoffer Nielbo, Uffe Bergeton, Carson Evaluation methods for topic models. In Proceedings of the 26th Annual Interna-
Logan, and Scott Kleinman. 2018. Modeling the contested relationship between tional Conference on Machine Learning (ICML ’09). Association for Computing
Analects, Mencius, and Xunzi: Preliminary evidence from a machine-learning Machinery, New York, NY, USA, 1105–1112. DOI: https://doi.org/10.1145/1553374.
approach. The Journal of Asian Studies 77, 1 (Feb. 2018), 19–57. DOI: https://doi. 1553515
org/10.1017/S0021911817000973

Sinophone Studies Across Disciplines 1st Preprint
No ratings yet
Sinophone Studies Across Disciplines 1st Preprint
390 pages
Conflict and Communication - A Changing Asia in A Globalizing World - Social and Political Perspectives
No ratings yet
Conflict and Communication - A Changing Asia in A Globalizing World - Social and Political Perspectives
223 pages
Routledge Handbook of Chinese Culture and Society
100% (2)
Routledge Handbook of Chinese Culture and Society
509 pages
Employment Contract
No ratings yet
Employment Contract
2 pages
Applied Linguistics-2013-Cheng-173-90
No ratings yet
Applied Linguistics-2013-Cheng-173-90
18 pages
SEC AAS Program-Draft
No ratings yet
SEC AAS Program-Draft
15 pages
CND Conference Abstracts
No ratings yet
CND Conference Abstracts
55 pages
A Study On The Discourse Construction of Foreign Mainstream Media's China-Related Reports
No ratings yet
A Study On The Discourse Construction of Foreign Mainstream Media's China-Related Reports
7 pages
Reassessing The Chinese Diaspora From The South History Culture and Narrative
No ratings yet
Reassessing The Chinese Diaspora From The South History Culture and Narrative
9 pages
Political Discourse in China's English Language Press
No ratings yet
Political Discourse in China's English Language Press
23 pages
An Ideological Square Analysis of The Podcast Discourse in Chinese Dreams of The BBC World Service
No ratings yet
An Ideological Square Analysis of The Podcast Discourse in Chinese Dreams of The BBC World Service
18 pages
A CULTURALIST APPROACH TO DISCOURSE - Shi Xu - Chapter - 2023
No ratings yet
A CULTURALIST APPROACH TO DISCOURSE - Shi Xu - Chapter - 2023
14 pages
00 Adeli Ammann GESAMT
No ratings yet
00 Adeli Ammann GESAMT
225 pages
Kci Fi002668475
No ratings yet
Kci Fi002668475
27 pages
Gao JMMDInpress
No ratings yet
Gao JMMDInpress
18 pages
10.4324 9781315213705 Previewpdf
No ratings yet
10.4324 9781315213705 Previewpdf
98 pages
Comparative China Zhang
No ratings yet
Comparative China Zhang
3 pages
Speaking Out An Advanced Chinese Reader (Sarangbook - Ir)
No ratings yet
Speaking Out An Advanced Chinese Reader (Sarangbook - Ir)
149 pages
Language Canadian Media: Representations, Ideologies, Policies
No ratings yet
Language Canadian Media: Representations, Ideologies, Policies
292 pages
2020 Birznieks Besamusca Bijl
No ratings yet
2020 Birznieks Besamusca Bijl
204 pages
Humn 2007 - L1
No ratings yet
Humn 2007 - L1
19 pages
China and The Human
No ratings yet
China and The Human
28 pages
Chapter1 Weiwang
No ratings yet
Chapter1 Weiwang
12 pages
Contesting Chinese Modernity Postcoloniality and Discourses On Modernisation at A Chinese University Campus
No ratings yet
Contesting Chinese Modernity Postcoloniality and Discourses On Modernisation at A Chinese University Campus
17 pages
Assignment 2 (Week 2)
No ratings yet
Assignment 2 (Week 2)
5 pages
Cultural Discourse Studies Through The Journal of Multicultural Discourses 10 Years On
No ratings yet
Cultural Discourse Studies Through The Journal of Multicultural Discourses 10 Years On
9 pages
Critical Discourse Analysis: Paltridge 2007
No ratings yet
Critical Discourse Analysis: Paltridge 2007
12 pages
Chinese Discourse Power. Capabilities and Impact
No ratings yet
Chinese Discourse Power. Capabilities and Impact
42 pages
Read The Cultural Other Forms of Otherness in The Discourses of Hong Kong's Decolonization
No ratings yet
Read The Cultural Other Forms of Otherness in The Discourses of Hong Kong's Decolonization
256 pages
17-Nguyễn Văn Giang - 7052900528 -Nghệ An 6
No ratings yet
17-Nguyễn Văn Giang - 7052900528 -Nghệ An 6
3 pages
The Reception and Rendition of Freud in China (Tao Jiang)
No ratings yet
The Reception and Rendition of Freud in China (Tao Jiang)
337 pages
CALLAHAN SinospeakChineseExceptionalism 2012
No ratings yet
CALLAHAN SinospeakChineseExceptionalism 2012
24 pages
Brian Paltridge - Discourse Analysis. An Introduction
100% (2)
Brian Paltridge - Discourse Analysis. An Introduction
117 pages
Angloscene
No ratings yet
Angloscene
212 pages
Critical Discourse Analysis in Historiography - The Case of Hong Kong's Evolving Political Identity-Palgrave Macmillan UK (2012)
100% (1)
Critical Discourse Analysis in Historiography - The Case of Hong Kong's Evolving Political Identity-Palgrave Macmillan UK (2012)
351 pages
Brian - Paltridge - Discourse - Analysis - An - in 1
No ratings yet
Brian - Paltridge - Discourse - Analysis - An - in 1
117 pages
A Scholarly Review of Chinese Studies in North America
100% (1)
A Scholarly Review of Chinese Studies in North America
476 pages
Presentation On Intercultural Discourse System: Prof. Y.D.Jadhav
No ratings yet
Presentation On Intercultural Discourse System: Prof. Y.D.Jadhav
4 pages
A Preface To Racial Discourse in India
No ratings yet
A Preface To Racial Discourse in India
4 pages
s41599 023 01699 7
No ratings yet
s41599 023 01699 7
13 pages
Cultural Linguistics and Critical Discourse Studies
100% (3)
Cultural Linguistics and Critical Discourse Studies
220 pages
A Scholarly Review East Asian Studies
No ratings yet
A Scholarly Review East Asian Studies
476 pages
Resistance in The Era of Nationalisms Performing Identities in Taiwan and Hong Kong
No ratings yet
Resistance in The Era of Nationalisms Performing Identities in Taiwan and Hong Kong
300 pages
125918859
No ratings yet
125918859
6 pages
Multimodal Chinese Discourse Understanding Communication and Society in Contemporary China (Dezheng (William) Feng) (Z-Library)
No ratings yet
Multimodal Chinese Discourse Understanding Communication and Society in Contemporary China (Dezheng (William) Feng) (Z-Library)
245 pages
DungPhanMy Vietnam - China Overview in
No ratings yet
DungPhanMy Vietnam - China Overview in
22 pages
Appetites Food and Sex in Postsocialist China Judith Farquhar Download
No ratings yet
Appetites Food and Sex in Postsocialist China Judith Farquhar Download
91 pages
Intercultural Communication: ICC and Power: Kooyl@ukm - My
No ratings yet
Intercultural Communication: ICC and Power: Kooyl@ukm - My
32 pages
A Critical Discourse Analysis of News Reports On S
No ratings yet
A Critical Discourse Analysis of News Reports On S
14 pages
Chinaperspectives 3803
No ratings yet
Chinaperspectives 3803
12 pages
Newspaper Commentaries On Terrorism in China and Australia: A Contrastive Genre Study
No ratings yet
Newspaper Commentaries On Terrorism in China and Australia: A Contrastive Genre Study
10 pages
Op Ed Genre Worksheet
No ratings yet
Op Ed Genre Worksheet
2 pages
Social Media and Collective Remembrance: China Perspectives
No ratings yet
Social Media and Collective Remembrance: China Perspectives
9 pages
Fairclough Wodak - Critical Discourse Analysis
No ratings yet
Fairclough Wodak - Critical Discourse Analysis
20 pages
RHD 2024 Epaper
No ratings yet
RHD 2024 Epaper
50 pages
East Asian Studies Thesis
100% (2)
East Asian Studies Thesis
5 pages
Brian Paltridge-Discourse Analysis - An Introduction .18-32
No ratings yet
Brian Paltridge-Discourse Analysis - An Introduction .18-32
15 pages
DS ASSign 1
No ratings yet
DS ASSign 1
6 pages
(PDF) Research Proposal of A Corpus-Based Discourse Analysis of British and American Mainstream Media On "The Belt and Road Init
No ratings yet
(PDF) Research Proposal of A Corpus-Based Discourse Analysis of British and American Mainstream Media On "The Belt and Road Init
1 page
Shirley Suicidal 2023
No ratings yet
Shirley Suicidal 2023
245 pages
Desmarais A
No ratings yet
Desmarais A
163 pages
Using The Lenses of Emotion and Support To Understand Unemployment Discourse On Reddit
No ratings yet
Using The Lenses of Emotion and Support To Understand Unemployment Discourse On Reddit
25 pages
387200637 我们赖以生存的隐喻
No ratings yet
387200637 我们赖以生存的隐喻
250 pages
Aditya Birla PPT Rakhal
No ratings yet
Aditya Birla PPT Rakhal
9 pages
Antiragging Affidavit Form
No ratings yet
Antiragging Affidavit Form
2 pages
Baho Commerce Form Four First Term Examinations 2024
No ratings yet
Baho Commerce Form Four First Term Examinations 2024
3 pages
Constitution of UK and USA, Part 1 Complete Notes of UK Constitution
No ratings yet
Constitution of UK and USA, Part 1 Complete Notes of UK Constitution
37 pages
Problems On Ages Questions Specially For Sbi Po Prelims
No ratings yet
Problems On Ages Questions Specially For Sbi Po Prelims
18 pages
The Matrix Organization
No ratings yet
The Matrix Organization
24 pages
Men S Grooming in Latin America
No ratings yet
Men S Grooming in Latin America
15 pages
C-89 Regular School Timings W.E.F. 23rd June 2025
No ratings yet
C-89 Regular School Timings W.E.F. 23rd June 2025
1 page
Ligji - Per - Sigurine - Kibernetike - NR - 2 - Date - 26.1.2017
No ratings yet
Ligji - Per - Sigurine - Kibernetike - NR - 2 - Date - 26.1.2017
11 pages
Robinson V Chief Constable of West Yorkshire Police
No ratings yet
Robinson V Chief Constable of West Yorkshire Police
8 pages
Baroque, Rococo, Neoclassicism, Romanticism, Realism
100% (1)
Baroque, Rococo, Neoclassicism, Romanticism, Realism
25 pages
Section A1: Company Profile: Form A: HW Generator Registration Form
No ratings yet
Section A1: Company Profile: Form A: HW Generator Registration Form
3 pages
Los Milagros de Jesús - The Miracles of Jesus
No ratings yet
Los Milagros de Jesús - The Miracles of Jesus
8 pages
Affidavit of Consent: Electrification Permit at Tarlac Electric Cooperative I, Magallanes
No ratings yet
Affidavit of Consent: Electrification Permit at Tarlac Electric Cooperative I, Magallanes
56 pages
Katz ABC Conference Keynote 2022
No ratings yet
Katz ABC Conference Keynote 2022
53 pages
Subjecive Question Cyber Security
No ratings yet
Subjecive Question Cyber Security
15 pages
(Backup) (Backup) FA1 Revision 2019
No ratings yet
(Backup) (Backup) FA1 Revision 2019
32 pages
Chapter 04 Social Entrepreneurship and The Global Environment For Entrepreneurship
No ratings yet
Chapter 04 Social Entrepreneurship and The Global Environment For Entrepreneurship
11 pages
Listening 2 (Dic-2021)
No ratings yet
Listening 2 (Dic-2021)
6 pages
Gemhi Employee Handbook Rev-2023
No ratings yet
Gemhi Employee Handbook Rev-2023
53 pages
BESG
No ratings yet
BESG
137 pages
Blanche Ely High School Medical Sciences Magnet Program Registration 2022-2023
No ratings yet
Blanche Ely High School Medical Sciences Magnet Program Registration 2022-2023
2 pages
Science Q2 Reviewer
No ratings yet
Science Q2 Reviewer
6 pages
Custome Clearance Procedure of Siemens
No ratings yet
Custome Clearance Procedure of Siemens
17 pages
Jerome
No ratings yet
Jerome
15 pages
Chapter One: Nutrition Screening: Sample Malnutrition Screening Tools
100% (1)
Chapter One: Nutrition Screening: Sample Malnutrition Screening Tools
4 pages
ECON 5170 Final EXAM 120pts
No ratings yet
ECON 5170 Final EXAM 120pts
5 pages
Prasar Bharati (Broadcasting Corporation of India) (Technician) Recruitment Regulations, 2013
No ratings yet
Prasar Bharati (Broadcasting Corporation of India) (Technician) Recruitment Regulations, 2013
13 pages
Unit 9 - Revision
No ratings yet
Unit 9 - Revision
13 pages

Comparative Discourse Analysis Using Topic Models

Uploaded by

Comparative Discourse Analysis Using Topic Models

Uploaded by

Comparative Discourse Analysis Using Topic Models:

Contrasting Perspectives on China from Reddit

ABSTRACT Understanding the range of popular Western perspectives on China

Table 2: Most frequent dominant topic features.

r/China features Proportion of r/Sino features Proportion of

Table 3: Most distinguishing dominant topic features.

r/China features Partial KL (bits) r/Sino features Partial KL (bits)

Table 4: Most frequent topic tuple features (t=0.3).

r/China features Proportion of r/Sino features Proportion of

Table 5: Most distinguishing topic tuple features (t=0.3).

r/China features Partial KL (bits) r/Sino features Partial KL (bits)

r/China features Proportion of r/Sino features Proportion of

Table 7: Most distinguishing dominant topics concerning Hong Kong in 2019.

r/China features Partial KL (bits) r/Sino features Partial KL (bits)

r/China features Proportion of r/Sino features Proportion of

r/China features Partial KL (bits) r/Sino features Partial KL (bits)

r/China features Partial KL (bits) r/Sino features Partial KL (bits)

5 DISCUSSION The discursive tendencies of the two communities concerning

You might also like