Social Media Narratives
Social Media Narratives
Social Media Narratives
A Dissertation
Doctor of Philosophy
by
Nicholas Botzer
Abstract
by
Nicholas Botzer
ii
CONTENTS
Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . v
Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix
Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . x
Chapter 1: Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1 Moral Judgement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.2 Conversational Flow . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.3 Intent . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.4 Thesis Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
iii
3.3 Conversation Prediction . . . . . . . . . . . . . . . . . . . . . . . . . 49
3.4 Conversation Traversals . . . . . . . . . . . . . . . . . . . . . . . . . 52
3.4.1 Spreading Activation . . . . . . . . . . . . . . . . . . . . . . . 55
3.5 Comparative Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . 58
3.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
3.6.1 Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
3.6.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
Chapter 5: Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
5.1 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
iv
FIGURES
2.1 Example of (a) a post title and (b) a comment in the /r/AmItheAsshole
subreddit. The NTA prefix and comment-score (not shown) indicates
that the commenter judged the poster as “Not the Asshole”. . . . . . 10
2.2 A post (in blue) made by a user along with the top response comment
(white). The comment is then fed to our Judge-BERT classifier (green)
to determine the moral valence of the post. . . . . . . . . . . . . . . . 20
2.3 A screenshot of the human annotation system. . . . . . . . . . . . . . 21
2.4 Plots showing the prediction of Judge-BERT versus the annotation
agreement by humans for both positive and negative classes. X-axis
displays the annotator agreement against the prediction made by Judge-
BERT. 5/0 represents all annotators agreeing and it matching the
model while 0/5 shows all annotators agreeing for the opposite class,
meaning a wrong prediction by the model. . . . . . . . . . . . . . . . 22
2.5 Judge-BERT analysis at various levels of annotator agreement. . . . . 23
2.6 Posts judged to have positive valence as a function of post score.
Higher indicates more positive valence. Higher post scores are as-
sociated with more positive valence (Mann Whitney τ ∈ [0.40, 0.47],
p < 0.001 two-tailed, Bonferroni corrected) . . . . . . . . . . . . . . . 25
2.7 Lorenz curve depicting the judgement inequality among users; Gini
coefficient = 0.515 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
2.8 Number of comments (normalized) as a the negativity threshold is
raised. As the negativity threshold is raised the fraction of comments
revealed tends towards 1. Higher lines indicate a higher concentration
of negative users and vice versa. . . . . . . . . . . . . . . . . . . . . . 29
2.9 A diagram showing the posting habits of a Returner. Posts are in the
light blue boxes with blue arrows represent the order of posts. An
example of a post response is shown in the white box with the red
arrow representing the post it came from. Each post is prefaced with
the overarching title, “Me and my partner are having a baby.” followed
by the current update on the situation. The response comments have
also been condensed from their full length. . . . . . . . . . . . . . . . 31
v
2.10 Example of a post title from /r/relationships. Subreddit rules require
the poster to indicate their age and gender, as well as any other indi-
viduals gender and age. . . . . . . . . . . . . . . . . . . . . . . . . . . 34
vi
3.8 Entity graph example of spreading activation on /r/news when Barack Obama
is selected as the starting entity. The x-axis represents the (threaded)
depth at which each entity was mentioned within conversations rooted
at Barack Obama. The y-axis represents the semantic space of each
entity, i.e., similar entities are closer than dissimilar entities on the y-
axis. Node colors represent equivalent entity sets. In this example, we
observe that conversations starting from Barack Obama tend to center
around the United States, political figures such as Donald Tump, and
discussion around whether his religion is Islam. . . . . . . . . . . . . . 56
3.9 Illustration of an entity graph created from threaded conversations
from /r/news (red-edges) and r/worldnews (blue-edges). The x-axis
represents the (threaded) depth at which each entity set was mentioned
within conversations rooted at White House. The y-axis represents the
semantic space of each entity, i.e., similar entities are closer than dis-
similar entity sets on the y-axis. Nodes colors represent equivalent en-
tity sets. Conversations in /r/news tends to coalesce to United States,
while conversations in /r/worldnews tend to scatter into various other
countries (unlabeled black nodes connected by thin blue lines) . . . . 61
3.10 Comparison between the first 6 months of /r/Coronavirus from 2020
to 2021. Illustration of an entity graph created from threaded conver-
sations from /r/Coronavirus in Jan–June of 2020 (red-edges) and from
Jan–June of 2021 (blue-edges). The x-axis represents the (threaded)
depth at which each entity set was mentioned within conversations
rooted at United States. The y-axis represents the semantic space of
each entity set, i.e., similar entity sets are closer than dissimilar entity
sets on the y-axis. Node colors represent equivalent entity sets. Con-
versations tended to focus on China and Italy early in the pandemic,
but turn towards a broader topic space later in the pandemic. . . . . 63
4.1 Example of pseudo label selection when using a threshold (top) ver-
sus the top-k sampling strategy (bottom). In this toy scenario, we
chose k = 2, where each class is represented by a unique shape. As
the threshold selection strategy pseudo-labels data elements (shown as
yellow) that exceed the confidence level, the model tends to become
biased towards classes that are easier to predict. This bias causes a
cascade of mis-labels that leads to even more bias towards the majority
class. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
vii
4.2 TK-KNN overview. The model is (1) trained on the small portion of
labeled data. Then, this model is used to predict (2) pseudo labels
on the unlabeled data. Then the cosine similarity (3) is calculated for
each unlabeled data point with respect to the labeled data points in
each class. Yellow shapes represent unlabeled data and green represent
labeled data. Similarities are computed and unlabeled examples are
ranked (4) based on a combination of their predicted probabilities and
cosine similarities. Then, the top-k (k = 2) examples are selected (5)
for each class. These examples are finally added (6) to the labeled
dataset to continue the iterative learning process. . . . . . . . . . . . 73
4.3 Convergence analysis of pseudo-labelling strategies on CLINC150 at
1% labeled data. TK-KNN clearly outperforms the other pseudo-
labelling strategies by balancing class pseudo labels after each training
cycle. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
4.4 Ablation results for each dataset using 1% labeled data. . . . . . . . 83
4.5 A comparison of TK-KNN on HWU64 with 1% labeled data as β varies. 85
4.6 A comparison of TK-KNN on HWU64 with 1% labeled data as k varies. 86
viii
TABLES
ix
ACKNOWLEDGMENTS
Throughout graduate school I have had a plethora of people that have supported
me with my work and outside of it as well. I would first like to thank my advisor,
Tim Weninger, for tolerating me and guiding me though out the process. I could not
have asked for a better mentor that gave me the flexibility to work on the problems
I found interesting. I would also like to thank my committee members Meng Jiang,
Kevin Bowyer, and Yonatan Bisk. They all provided fantastic feedback and advice
from my proposal and were great members for my committee. While I had numerous
people guide me through my research I would also like to acknowledge the people that
helped me outside of the lab. I want to thank my wife Xiaoqing, whom I was lucky
enough to meet during my PhD. You have supported me since we first met and have
made my life better since coming into it. To Joey Heyl, for helping me out anytime
I needed it with literally anything. I couldn’t ask for a more dependable friend and
you became an amazing gym partner for me when I moved back home. To all of
my family, for supporting and encouraging me the entire way through. Having soup
nights with you all throughout gave me a chance to relax and enjoy our time together.
All of you were always happy when I was making progress and there for me whenever
I would struggle or be stressed. To Zach Petrusch, I couldn’t have asked for a more
compassionate friend. You’ve always picked up the phone and talked to me whenever
I’m going through tough situations and brought so much fun to our D&D group. To
Scott Mulvihill, you’ve been a my friend for so long and I can always count on you
to invite me out. Every time I come home I always know I’ll be able to hangout with
you and get out and enjoy the world. To Scott Ramey and Matt Wagenhoffer, for
x
being the best remote friends I could have had. The two of you helped keep me sane
by playing games with me throughout my entire time in grad school but especially
during Covid. Finally, I want to thank the rest of the Big Dawgs in House Alan. I
had conversations with all of you and received nothing but support and compassion
for my graduate school work. Without so many people supporting me I would not
have been able to succeed in graduate school and I am extremely thankful to have
all of you in my life.
xi
CHAPTER 1
INTRODUCTION
Social media has become an integral part of modern society, providing a platform
for individuals to express their opinions and engage with others. However, the prolif-
eration of digital communication has turned online spaces into breeding grounds for
polarizing conversations [136], personal attacks [143], and moral outrage [37]. There-
fore, it is essential to understand how people engage with one another in these spaces
as they often create over arching narratives based on different groups. The way that
people share their experiences on social media to create these narratives holds impor-
tant implications for various fields such as psychology, sociology, and communication
studies.
The definition of a narrative, especially on social media, is poorly defined, but
previous works generally accept it as the passage of time or process [67, 130], often
referred to as a “change of state” [90]. But these works are often focused on un-
derstanding narratives from stories or poems, not those that are shaped in everyday
life on social media. The narratives that come from these stores are often linear in
nature. These stories often have a clear beginning, middle, and end to them following
the linear progression. This naturally arises as the narrative is being told from the
perspective of one individual. In contrast, social media narratives are very fractured
and non-linear in how they develop. With input and opinions coming from a wide
variety of people the trajectory of the narrative can quickly shift and be hard to
follow. Because so many different perspectives contribute to social media narratives
a variety of different techniques are necessary to understand them.
1
A significant challenge in understanding these evolving narratives is the massive
volume of data being generated on social media. This information overload [52] makes
it difficult for individuals to sift through sources and comprehend the complex dy-
namics at play, often leading to confusion regarding the messages they receive [113].
Social media platforms also utilize recommendation systems to present users with
relevant information about current events and their interests. While these systems
are considered useful, particularly from a business standpoint, they frequently create
feedback loops and echo chambers on social media [73]. These feedback loops create
communities that often support different messaging and often have varying narratives
related to different topics, or current events. With all these dynamics at play the im-
portance of automated methods to aggregate and understand social media narratives
have become increasingly important.
The resurgence of deep learning and more powerful models has made new meth-
ods for exploring and understanding these trends available to researchers. Natural
language processing in particular, has seen an explosion of use cases in recent years
due to the transformer architecture [138] and pre-trained large language models [41].
These pre-trained models can be fine-tuned and modified for a wide variety of prob-
lem domains and achieve good performance on these tasks. This has allowed for
larger scale and broader analysis to be conducted on social media.
In this dissertation, I explore user narratives from three distinct perspectives: an-
alyzing moral judgements, conversational flow, and user intent. These three methods
help shed light on users values, beliefs, and intentions on social media. Each approach
offers unique insights into how narratives shift over time and influence peoples atti-
tudes and beliefs.
2
1.1 Moral Judgement
As long as society has existed cultures have formed their own moral norms [64]
that are accepted and followed. These moral norms help insure that a group of
people can cooperate together to ultimately help the group survive [15]. When an
individual within the group breaks one of these norms they will typically be punished
by others within the group. Other people coming out against a person’s actions
helps to inform the perpetrator, along with bystanders, that a particular action is
unacceptable. These moments help to define the overarching moral beliefs that are
inherent to any given society.
With the advent of the internet, moral judgements and norms have undergone
some interesting developments. The use of social media has opened up two main
avenues to alter moral norms, exposure to other cultures, and a larger global com-
munity to cast judgements. This is the first time in history that so many people
have been able to share a variety of cultures and viewpoints so easily; thus enabling
others to examine and judge these differences. The increased sharing of experiences
often leads to heated debates on social media covering a wide variety of topics [147].
On social media taking a moral stance has quickly become a predominant method to
attack other individuals or support your own reasoning [16], as it often incites anger
into people. These moral stances are often used to form a narrative on social media
surrounding real world events. They are often calls to action based on a perceived
wrong doing in the world.
Recent findings emphasize positive feedback from moral outrage acts as a feedback
mechanism [18]. The feedback will reinforce the original users stance and increase
the likelihood for further expressions of moral outrage. This often causes more moral
messaging that is used to construct narratives surrounding a given issue or topic.
To study morality on social meida, researchers have often turned towards moral
foundations theory (MFT) [61] to guide their methods. MFT utilizes a lexicon of
3
words to classify text along five different axis of morality. While a generic dictionary
exists for MFT [117], more recent targeted dictionaries and corpora have been created
for specific social media platforms such as Twitter [68] and Reddit [135]. While
this method has lead to a number of findings, others have found it falls short when
considering shifting social media topics and may not accurately capture realistic moral
dynamics. This makes the MFT difficult to apply broadly to social media to capture
the moral messaging that is being used to build a narrative. More recent works look
to leverage large pre-trained models and a variety of psycholinguistic features, often
including MFT, to better understand moral judgements [144, 145, 63].
I discuss work for analyzing moral judgements on the social media website Reddit.
Specifically, I extract data from the subreddit /r/AmITheAsshole and train a moral
judgement classifier based on the comments. This classifier is then applied to infer a
variety of social factors regarding moral judgements. Analyzing user patterns allows
us to understand posting habits and the types of messages used to support moral
messaging often used in narratives.
4
a conversation by using entity linking [12]. Looking at the entities of a conversation
is attractive for understanding the flow, as it focuses on the key subjects being ref-
erenced. For my purposes, the action these entities are taking or their sentiment is
not important. The focus instead is to understand the patterns that cause groups
to discuss a topic and end up at a specific topic from a starting point. As repeated
messaging and echo chambers are a driving factor of narratives in online discourse,
my method will allow for the observation of these patterns in a broad manner.
One could look at the rise generative large language models (LLM) as a sort
of proxy for this investigation. LLM’s are trained to predict the next token based
on the previous tokens generated [106] and have therefore learned a sense of how
a conversation should flow given a specific prompt. Although I do not analyze the
generations of these LLM’s it could prove an interesting avenue for further analysis
to understand conversational flow. Specifically, one could give a prompt and generate
a variety of responses from the model and then analyze the flow of them similar to
my method discussed later. I also anticipate that these models will see widespread
use in creating different automatic posts to push forward narratives from groups that
wish to exert influence over others.
1.3 Intent
5
by showing their support or disdain for particular topics. With the ever changing na-
ture of the world it can be very challenging to capture what intents are important and
understand the broader narrative happening. Applying machine learning methods to
understand these intents in challenging due to the cost of annotating large datasets
and the broad variety of intents that users will message about. I address this issue
by using a semi-supervised learning method to achieve high performance with very
limited labeled data. Semi-supervised learning is useful in problems domains where a
plethora of unlabeled data is available but only a small amount of data can be labeled.
As previously discussed social media has a vast amount of examples, so a method
that can leverage the unlabeled examples is important to improving performance.
A semi-supervised method also allows for training on targeted domains as new
situations occur in the world. Without the need to annotate a large number of
examples, important classes can be quickly chosen and a few examples annotated.
This will allow for broader down stream analysis once the model has been trained.
By having an accurate classifier for intents, popular messages can be found and
examined for other attributes such as named entities, sentiment, and emotion. All of
these attributes can help drive insight into the social factors that influence popular
narratives and messaging on social media.
An interesting problem when dealing with intent for social media is the domain
you wish to apply the model to. As intent can be considered the actions an individual
wants to take you can classify a variety of issues into this domain. For instance, I
discuss how moral judgements are an important aspect of narrative analysis but can
also be viewed as a form of intent. Whether a user wishes to cast a moral judgement
in their messaging or not can be interpreted as a binary classification problem with
intent. Similarly, it is possible to look at political elections throughout the world
and capture the intent of candidates, supporters, and detractors via intent. Messages
could capture individuals asking for support of a candidate, positioning for or against
6
a given topic, or capturing the emotional response of the poster. Intent allows for a
broad net to be cast that is dependent on the target demographic and domain for it
to be applied. As such, the semi-supervised method discussed allows for an analyst to
quickly position a model into their domain of interest and narrow down the specific
intents they are interested in understanding.
Narratives are pervasive on social media and are scantly understood. To under-
stand narratives is to understand how different actors in a society wish to construct
and manipulate messages to the public. This motivates my thesis:
This dissertation consists of the three main focuses previously discussed moral
judgements, conversational flow, and user intent. Chapter 2 focuses on understand-
ing moral judgements on Reddit and looking at different social aspects that influence
them. Chapter 3 proposes a graph based model and visualization to interpret conver-
sational flow from different communities. Findings are presented comparing different
communities on Reddit and the key entities they discuss.
For the remainder of the dissertation I focus on classifying user intents. Chapter
4 proposes a new semi-supervised learning method to classify intents in low data
scenarios. Results demonstrated strong performance, particularly when labeled data
is extremely scare (1 to 2 examples per class). Finally, in 5 I summarize the main
findings of the work and propose some future directions.
7
CHAPTER 2
The work presented in this chapter is a collaboration with Shawn Gu and Tim
Weninger and was published in the IEEE Journal on Computational Social Systems
in 2022 [13]
2.1 Introduction
How do people render moral judgements of others? This question has been pon-
dered for millennia. Aristotle [1], for example, considered morality in relation to
the end or purpose for which a thing exists. Kant [77] insisted that one’s duty was
paramount in determining what course of action might be good. Consequentialists
[128] argue that actions must be evaluated in relation to their effectiveness in bringing
about a perceived good. Regardless of the particular ethical frame that one ascribes
to, the common practice of evaluating others’ behavior in moral terms is widely re-
garded as important for the well-being of a community. Indeed, ethnographers and
sociologists have documented how these kinds of moral judgements actually increase
cooperation within a community by punishing those who commit wrongdoings and
informing them of what they did wrong [14].
The process of rendering moral judgement has taken an interesting turn in the
current era where Online social systems enable people to encounter and consider
the lives and perspectives of others from around the world. At no other time in
history have so many people been able to examine (and judge) such a variety of
cultures and viewpoints so readily. This increased sharing and mixing of viewpoints
8
inevitably leads to fierce online debates about various topics [148], while the content
and outcomes of these debates provides researchers with the opportunity to ask spe-
cific questions about argument, disagreement, moral evaluation, and judgement with
the aid of new computational tools.
To that end, recent work has resulted in the creation of statistical models that
can capture moral sentiment in text [117]. However, these models rely heavily on a
gazette of words and topics as well as their alignment on moral axes. The central
motivation for these works are grounded in moral foundation theory [61] where studies
also tend to investigate the use of morality as related to current events in the news.
Despite their usefulness in understanding the moral valence of specific current events,
the goal of the current work is to study moral judgements rendered on social media
that apply to more common personal situations.
We focus on Reddit in particular, where users can create posts and have discus-
sions in threaded comment-sections. Although the details are complicated, users also
perform curation of posts and comments through upvotes and downvotes based on
their preference [54, 56]. This assigns each post and comment a score reflecting how
others feel about the content. Within Reddit there are a large number of subreddits,
which are small communities typically dedicated to a particular topic. One subreddit
in particular is centered on questions of moral judgement. This subreddit is named
/r/AmItheAsshole. Posters to /r/AmItheAsshole are typically looking to hear from
other Reddit users about whether or not they handled their personal situation in an
ethically appropriate manner. The community works like this. First, users post a de-
scription of a situation in which they were involved, and they are also encouraged to
explain details of other people involved as well as the final outcome of the situation.
Next, other users respond to the initial post with a moral judgement as to whether
the original user was an asshole or not the asshole. Figure 2.1 shows an example of a
typical post and one of its top responses. One important rule of /r/AmItheAsshole
9
(a)
(b)
Figure 2.1. Example of (a) a post title and (b) a comment in the
/r/AmItheAsshole subreddit. The NTA prefix and comment-score (not
shown) indicates that the commenter judged the poster as “Not the
Asshole”.
is that top-level responses must categorize the behavior described in the original post
to one of four categories: Not the Asshole (NTA), You’re the Asshole (YTA), No ass-
holes here (NAH), Everyone sucks here (ESH). In addition to providing a categorical
moral judgement, the responding user must also provide an explanation as to why
they selected that choice. Reddit’s integrated voting system then allows other users
to individually rate the judgements with which they most agree (upvote) or disagree
(downvote). After some time has passed the competition among different judgements
will settle, and one of the judgements will be rated highest. This top comment is
then accepted as the judgement of the community. This process of passing and rating
moral judgement provides a unique view into our original question about how people
make judgements of morality, ethics, and behavior.
Compared to other methodologies of computational evaluation of moral senti-
ments, collecting judgements from /r/AmItheAsshole (AITA) has some important
benefits. First, because posters and commenters are anonymous on Reddit, they
are more likely to share their sensitive stories and frank judgements without fear of
10
reprisal [99, 74]. Second, the voting mechanism of Reddit allows a large number of
users to engage in an aggregated judgement in response to the original post [57].
However, the breadth and variety of this data does pose additional challenges. For
instance, judgements are provided without an explicit moral-framing, and, similarly,
Reddit-votes are susceptible to path dependency effects [58].
The study of how social norms and morals are reasoned about on social media has
only recently become a topic of interest [50, 47]. In these works, large scale annotation
studies were performed using data collected from moral situations; one data source
used in both studies was our subreddit of interest /r/AmItheAsshole. The annotation
efforts from Forbes et al. resulted in a dataset that contains heuristics for various
actions and how acceptable they were found [50]. Likewise, Emelin et al. curated a
new dataset by asking people to create diverse narrative scenarios using the previous
heuristics as writing prompts [47]. Although these works investigated computational
models of social norms and actions, they did not consider how people cast moral
judgement upon others. A recent study looked at online shaming on Twitter [5] –
an increasingly common way to cast moral judgement. In this study, 1000 shaming
tweets were collected and placed into categories based on the type of shaming, which
are further analyzed to create an anti-shaming system. Yet the creation of a model
based on these datasets would be difficult due to the lack of positive moral examples.
In the present work we use data from AITA to investigate how users provide
moral judgements of others. Specifically, we extracted representative judgement-
labels from each comment and used these labels and comments to train a classifier.
Human annotators were then used to verify the performance of this classifier on
other subreddits. This classifier was then broadly applied to infer the moral valence of
comments from other communities in order to answer the following research questions:
11
RQ1: Is moral valence correlated with the score of a post?
Recent research into morality has found that immoral acts posted online trigger
stronger moral outrage responses than if the act was witnessed in person [37, 9].
These strong responses can be helpful in platform moderation as the viral nature
of these posts may drive user engagement. Others lament the rise of cancel culture
as a kind of piling-on effect that can have severe negative consequences for the one
judged to be immoral [5, 97]. Because content on Reddit is rated by the community,
we would expect that posts with immoral acts generate higher scores. With this in
mind we investigated various Reddit communities to observe the behavior in each.
RQ3: Are self-reported gender and age descriptions associated with posi-
tive or negative moral judgements?
The role that gender plays on social media has been analyzed from a variety
of viewpoints. One angle that has been analyzed is how users receive support on
social media based on their gender [139]. The findings from three social platforms,
including Reddit, show that women receive higher rates of support and disparagement
in comparison to men. De Choudhury et al. also sought to understand how gender
12
plays a role in mental health disclosure on social media finding that men desire social
support less often than women [38]. Findings of differences in the topics of interest
between genders has also been found on social media and Reddit in particular [134].
Studies that analyze gender and age on social media have investigated how language
use aligns with various personality types along these dimensions [121]. These findings
can be useful when investigating gender inequalities that exist in the world. In the
present work, we chose to focus on gender and age as it relates to moral judgements
on social media.
To answer these research questions we first tried several modern text classification
systems and evaluated their ability to predict moral judgement on a held out test
set of AITA comments. We dubbed the best classifier Judge-BERT, because it was a
fine-tuned version of the BERT language model for text classification [41]. We then
applied Judge-BERT to comments from several other communities.
In summary, we found that posts that were judged to have positive moral valence
(i.e., NTA label) typically scored higher than posts with negative moral valence.
We also found that certain subreddit-communities where users confess to something
immoral (i.e., such as /r/confessions) tended to attract users whose posts were more-
negative. Among these negative-users we found that their posting habits tended
towards three different types. Finally, we showed that self-described male users were
more likely to be judged with negative moral valence than female users.
We retrieved moral judgements by collecting posts and comments from the sub-
reddit /r/AmItheAsshole, taken from the Pushshift data repository [7].
We restricted our data collections to posts submitted between January 1, 2017,
13
TABLE 2.1
and August 31, 2019. In order to assure that labels reflected the result of robust
discussion we excluded those posts containing fewer than 50 comments. Subreddit
rules required that top-level comments begin with one of four possible prefix-labels
indicated in Table 2.1. Because of this rule, we further restricted our data collection
to contain only top-level comments and their prefix-label. Comments with the INFO
prefix, which indicates a request for more information, and comments with no prefix
were also removed from consideration. This methodology resulted in a collection
of 7,500 posts and 1,260,818 comments with explicit moral judgements. Posters
and commenters appeared to put a lot of thought and effort into these discussions.
Each post contained 381 words on average and each comment contained 57 words
on average. For each of the comments we removed the prefix labels from the text.
This was done to ensure the models we trained could not simply learn based on the
labels that were extracted from the original text. We also truncated longer comments
down to a max-length of 128 tokens. This was determined based on the distribution of
comment sizes and performance across multiple tests of the model at varying lengths.
Given that all of the comments we have extracted come from one community, there
14
are most likely survey biases included in our dataset. This can, in part, be attributed
to the self-referential nature of Reddit [127]. Although we did not consider all sub-
reddits, we did consider all comments within a specific timeframe from each selected
subreddit. Differences between these subreddits provided the ability to compare and
contrast community-behavior and their users. Temporal and path dependency biases
certainly affected some of the measures in the present work (e.g., comment score),
however, we remain confident that our dependent-variable, i.e., moral valence, is not
correlated with temporal and ordinal affects. In other words, the timing of a com-
ment almost certainly did not affect its moral valence. Likewise, any ordinal affects
would necessarily occur after the posting, so the causality arrow, if it exists, can only
point in the opposite direction.
Given the dataset with textual posts and textual comments labeled with positive
or negative moral judgements, our goal is to predict whether an unlabeled comment
assigns a positive (NTA or NAH) or negative (YTA or ESH) moral judgement to the
user of the post. It is important to note that this classifier predicts the judgement of
the comment(er), not the morality of the poster.
We define our problem formally as follows.
Problem Definition Given a top level comment C with moral judgement A ∈ {+, −}
that responded to post P we aim to find a predictive function f such that
f : (C) → A (2.1)
Formally, this takes the form of a text classification task where class inference denotes
the valence of a moral judgement. To train such a classifier we utilized the comments
15
we have extracted with their respective class labels from /r/AmItheAsshole. We note
again that the text representing each class label (e.g., NTA, YTA) has been removed
from the comment.
The choice of classification model f is important, and we aim to train a model
that performs well and generalizes to other datasets. We test a variety of models to
ensure we can find one that performs the best on this task. Our choices are based
on recent advances in NLP along with prior methods that have demonstrated strong
performance on short text classification. We selected four text classification models
for use in the current work along with one sentiment analysis method:
• Multinomial Naı̈ve Bayes [81]: Uses word counts to learn a text classification
model and has shown success in a wide variety of text classification problems.
• Doc2Vec [87]: We created a document embedding for each comment label pair in
our dataset. The document embeddings are then input into a logistic regression
classifier that calculates the class margin.
• BERT Embeddings [41]: We extracted word embeddings from BERT for each
token in the comments. These are averaged together and input into a logistic
regression classifier that calculates the class margin.
• Judge-BERT: We fine-tuned the BERT-base model using the class labels that
we extracted from each comment. Specifically, we added a single dropout layer
after BERT’s final layer, followed by a final output layer that consists of our two
classes. The model was trained using the Adam optimizer and a cross entropy
loss function over three epochs as recommended by Devlin et al [41].
16
TABLE 2.2
VADER (Sentiment) 54.55 ± 0.00 38.63 ± 0.00 45.49 ± 0.00 41.78 ± 0.00
Doc2Vec Embeddings 65.92 ± 0.04 61.22 ± 0.15 13.5 ± 0.09 22.1 ± 0.09
BERT Embeddings 70.10 ± 0.07 64.28 ± 0.2 36.96 ± 0.14 46.96 ± 0.08
Multinomial Naı̈ve Bayes 72.12 ± 0.17 62.58 ± 0.17 55.22 ± 0.07 58.66 ± 0.05
Judge-BERT 89.03 ± 0.13 85.57 ± 0.18 83.48 ± 0.27 84.51 ± 0.17
Results are mean-averages and standard deviations over five-fold cross-validation. Judge-
BERT performed the best on within-distribution testing.
We evaluated our four classifiers using accuracy, precision, recall, and F1 metrics.
In this context a false positive is the instance when the classifier improperly assigns a
negative (i.e., asshole) label to a positive judgement. A false negative is the instance
when the classifier improperly assigns a positive (i.e., non-asshole) label to a negative
judgement. We performed 5-fold cross-validation and, for each metric, report the
mean-average and standard deviation over the 5 folds.
The results in Table 2.2 indicate that the Doc2Vec, BERT, and Multinomial Naı̈ve
Bayes classifiers do not perform particularly well at this task. Fortunately, the fine-
tuned Judge-BERT classifier performs relatively well, with an accuracy near 90% and
where type 1 and type 2 errors are relatively similar. Overall, these results indicate
that the Judge-BERT classifier is able to accurately classify moral judgements.
17
2.2.3 Sentiment Analysis
We used VADER to see how much similarity existed between sentiment analysis
of a persons judgement versus the morality of such a judgement. Our assumption
was that sentiment would be correlated with positive and negative moral judgement.
However, Results from VADER on our dataset showed a very different outcome.
VADER performed much worse in comparison to our text classification methods.
The interpretation of these findings is not that VADER is bad at sentiment anal-
ysis. Rather, the moral judgements being cast in /r/AmItheAsshole did not conform
to the same distribution of words that typical sentiment analysis tools, like VADER,
capture.
With our trained Judge-BERT classifier, our goal was to better understand moral
judgement across a variety of online social contexts and to analyze various trends
in moral judgement. In order to minimize the transfer-error rate it was important
to select subreddit-communities that were similar to the training dataset. These
subreddits were all based on posts and comments that were purely textual in nature.
This highlights the conversational similarities that we found with /r/AmItheAsshole
and enabled smooth transference from one community to another. In total we chose
ten subreddits to explore in our initial analysis. These subreddits can be broken into
three main stylistic groups and are briefly described in Table 2.3. These subreddits are
some of the more popular subreddits that generally had users initiating conversations
about a situation in their life.
We applied the Judge-BERT classifier to the comments and posts of these ten
subreddits. Specifically, given a post and its comment tree we identified the top-level
comment with the highest score. This top-rated comment, which had received the
18
TABLE 2.3
Subreddit Description
Advice
/r/relationship advice
Users pose questions in a scenario like the AITA
/r/relationships
dataset and receive advice or feedback on their
/r/dating advice
situation.
/r/legaladvice
/r/dating
Confessionals
Users confess to something that they have been
/r/offmychest
keeping to themselves. Typically, confessions are about
/r/TrueOffMyChest
something immoral the poster has done.
/r/confessions
Conversational Users engage in conversations with others to have a
/r/CasualConversation simple conversation or to here other opinions in order
/r/changemyview to change their worldview.
most upvotes from the community, is considered to be the one passing judgement on
the original poster. This design follows how /r/AmItheAsshole passes judgements as
a community. As illustrated in Fig. 2.2, this top-rated comment was then fed to our
classifier and the resulting prediction was used to label the moral valence of the post
and poster. It is important to be clear here: we did not predict the moral valence
of the comment itself, but rather the top-rated comment was used to predict the
commenter’s judgement on the post.
2.3.1 Transferability
19
TL;DR: Married, slept with another
man, and regretted it immediately.
Husband found out, I am not sure
if he wants to leave me or not, but + or
ral V –
(Mo
I am willing to do anything to fix
alen
it. Need advice. ce)
Judge-BERT
If you were so unsatisfied why not
try and fix things before you de-
stroy someones life. You don’t
really deserve a second chance.
You’re actually terrible and I hope
you learn your lesson.
Figure 2.2. A post (in blue) made by a user along with the top response
comment (white). The comment is then fed to our Judge-BERT classifier
(green) to determine the moral valence of the post.
label. Instead, we randomly selected 100 comments from each of the ten subreddits
for our analysis; 50 comments were labeled as NTA and 50 were labeled as YTA by
Judge-BERT. We displayed each comment to five different annotators on Mechanical
Turk. We told each annotator that comments came from /r/AmItheAsshole and
then asked them to label each comment as YTA or NTA. Each worker must have
completed a practice question before starting and have correctly answered a clear,
randomly-inserted control question for their results to count. This task was reviewed
and approved by the University of Notre Dame’s Internal Review Board (#20-01-
5751). An example screenshot of the questionnaire can be found in Fig 2.3.
In addition to the 10 other subreddits, we also included /r/AmItheAsshole in
this experiment as a baseline to compare how human annotators performed on the
actual data in comparison to the subreddits of interest. Because the comments
from /r/AmItheAsshole actually have labels, which we analyzed earlier, we com-
pare the actual labels to the human annotations. Therefore, we can consider the
/r/AmItheAsshole annotations a kind of upper bound to identify the limit of human
performance and agreement on this task.
Annotators labeled 2,961 comments as NTA and 2,039 comments at YTA; about a
20
Figure 2.3. A screenshot of the human annotation system.
3:2 imbalance in favor of NTA. If we consider each human label to be the groundtruth,
then we can calculate the transference performance of Judge-BERT. Because of the
class imbalance of the human labels a random classifier achieves an F1-score 45%.
Overall, we found that Judge-BERT obtained an F1 score of 53% with a precision
of 59% and recall of 48%. Recall that the /r/AmItheAsshole subreddit used the
actual labels, not Judge-BERT labels, yet still only obtained an F1-score of 64%. By
comparing these in-domain human results with the in-domain F1-scores from Tab 2.2,
we found that Judge-BERT far outperformed humans on this task. Other subreddits
varied in performance: /r/relationships performed the best with an F1-score of 56%;
/r/CasualConversation performed the worst with an F1-score of 42%, which dipped
mostly because of a very low recall score (34%) indicating that humans were far less
likely than Judge-BERT to rate a /r/CasualConversation comment as YTA.
Human annotators were not always in agreement with each other; and when
they agreed (fully or partially), their agreement did not always match the label
of Judge-BERT. There is much to unpack from these results, but Fig. 2.4 shows
the agreements rates for comments labeled YTA and NTA by Judge-BERT (except
in the case of /r/AmItheAsshole, which is labelled by the comment itself). If we
21
Labelled YTA Labelled NTA
150
Count of Comments
100
50
0
5/0 4/1 3/2 2/3 1/4 0/5 5/0 4/1 3/2 2/3 1/4 0/5
Annotator Agreement Annotator Agreement
/r/legaladvice /r/relationships /r/relationship advice
/r/dating advice /r/dating /r/changemyview
/r/confessions /r/offmychest /r/TrueOffMyChest
/r/CasualConversation /r/AmITheAsshole
22
ROC Curve
1
(AUC=0.484).
If we discard the sensitivity analysis and just look at the majority label from the
human annotators we find that the average majority-label accuracy is 62%. But this
masks a large discrepancy: when the Judge-BERT label is NTA, then the accuracy is
71.8%; when the Judge-BERT label is YTA, then the accuracy drops to worse than
random (44.4%). Given the 50/50 breakdown of NTA/YTA labels, we can deduce
that human annotators were far less likely than Judge-BERT to label a comment as
YTA.
Here we can begin to answer RQ1: Is moral valence correlated with the score of
a post? In other words, do posts with positive moral valence score higher or lower
than posts with negative moral valence? To answer this question, we extracted all
posts and their highest scoring top-level comment from 2018 from each subreddit
23
TABLE 2.4
Positive Negative
in Table 2.3. The counts for number of positive and negative posts found for each
subreddit are listed in Table 2.4.
Popularity scores on Reddit exhibit a power-law distribution, so the mean-scores
and their differences will certainly be misleading. Instead, we plot the ratio of com-
ments judged to be positive against all comments as a function of the post score
cumulatively in Fig. 2.6. Higher values in the plot indicate more positive valence.
The results here are clear: post popularity was associated with positive moral va-
lence. Most of the subreddits appeared to have similar characteristics except for
/r/CasualConversation, which had a much higher positive valence (on average) than
the other subreddits. Mann-Whitney Tests for statistical significance on individual
24
Valence Ratio for Post Score ≤ x
0.8
0.7
0.6
0.5
100 101 102 103 104 105
Post Score (log)
/r/relationship advice /r/relationships /r/dating advice
/r/legaladvice /r/dating /r/offmychest
/r/TrueOffMyChest /r/confessions /r/CasualConversation
/r/changemyview
subreddits as well the aggregation of these tests with Bonferroni correction found that
posts with positive valence had significantly higher scores than posts with negative
valence (τ ∈ [0.40, 0.47], p < 0.001 two-tailed).
The correlation between posts with positive moral valence and higher scores can
be explained based on the process by which posts are made. Because posts are made
before votes are cast and the text of a post is (typically) unchanged, then the votes
received for a post is likely related to the moral valence of a given post. Posts are
created on a regular basis and will become more popular based on user votes. With
this in mind it is reasonable to assume that posts with positive moral valence are
upvoted sooner and more often than others, leading to higher total scores.
These findings appear to conflict with studies we previously mentioned that
showed that negative posts elicit anger and encourage a negative feedback loop on
social media [9, 37]. As these studies have shown, posts that elicit moral outrage end
25
up being shared more often on other social media sites. Our expectation for Reddit
was the same.
A further inspection of the posts indicated that posts classified as having posi-
tive moral valence often found users expressing that a moral norm had indeed been
breached. However, the difference in our results compared to others may be ex-
plained by perceived intent, that is, whether or not the moral violation occurred
from an intentional agent towards a vulnerable agent, c.f., dyadic morality [119].
Our inspection of comments expressing negative moral judgement confirmed that the
perceived intent of the poster was critical to the judgement rendered. These negative
judgements typically highlighted what the poster did wrong and advised the poster
to reflect on their actions (or sometimes simply insulted the poster). Conversely,
we found that many posts judged to be positive clearly show that the poster was
the vulnerable agent in the situation to some other intentional agent. Responses to
these posts often displayed sympathy towards the poster and also outrage towards
the other party in the scenario. These instances are perhaps best classified as exam-
ples of empathetic anger [66], which is anger expressed over the harm that has been
done to another. We also note that some of the content labelled to have positive
moral valence is simply devoid of a moral scenario. Simply put, comment’s without a
moral impetus tended to be labeled as NTA. Examples of this can be primarily seen
in /r/CasualConversation where the majority of posts are about innocuous topics.
Another possible explanation for our findings is that users on other online so-
cial media sites like Facebook and Twitter are more likely to like and share news
headlines that elicit moral outrage; these social signals are then used by the site’s
algorithms to spread the headline further throughout the site [17, 56]. Furthermore,
the content of the articles triggering these moral responses often covers current news
events throughout the world. Our Reddit dataset, on the other hand, typically deals
with personal stories and therefore tend to not have the same in-group/out-group
26
reactions as those found on viral Facebook or Twitter posts.
27
Cumulative negative judgements
1
0.8
0.6
0.4
0.2
0
0 0.2 0.4 0.6 0.8 1
Fraction of Population
Figure 2.7. Lorenz curve depicting the judgement inequality among users;
Gini coefficient = 0.515
question: are some users judged more positively or negatively than others? What
does that distribution look like? To understand this breakdown we first plot a Lorenz
curve in Fig. 2.7. We found that the distribution of moral valence is highly unequal:
about 10% of users receive almost 40% of the observed negative judgements (Gini
coefficient = 0.515).
This clearly indicated that there were a handful of users that received the vast
majority of negative judgements. To identify those users which receive a statistically
significant proportion of negative judgements we performed a one-sided binomial test
on each user. Simply put, this test emits a negativity probability, i.e., the probability
(p-value) that the negativity of a user was not due to chance.
Finally, we illustrate the membership of each subreddit as a function of a users’
negativity probability. As expected, Fig. 2.8 shows that as we increased the nega-
tivity threshold from almost certainly negative to uncertainty (from left to right) we
also increased the fraction of comments observed. These curves therefore indicate the
density of comments that are made from negative users (for varying levels of negativ-
28
Fraction of Comments Observed
100
10−1
10−2
10−3
10−4
10−5
29
ity); higher lines (especially on the left) indicate higher concentration of negativity.
We found that /r/confessions, /r/changemyview, and /r/TrueOffMyChest contain a
higher concentration of comments from more-negative users. On the opposite side
of the spectrum, we found that /r/CasualConversation and /r/legaladvice have deep
curves, which implies that these communities have fewer negative users than others.
1. Explainer: These users argued that what they did was not that wrong.
2. Stubborn Opinion: Users that did not acquiesce to the prevailing opinion of
the responders.
3. Returner: Users that repeatedly posted the same situation hoping to elicit
more-favorable responses.
The first type of user that we observed was the Explainer. The explainer typically
made a post and received comments that condemned their immoral actions. In
response to this judgement, the explainer would reply to many of the comments
in an attempt to convince others that what they did was in fact moral. Often, this
only served to exacerbate the judgements made against them. This then led to further
negative judgements. In fact, we found that many of these users had only made a
handful of posts that each resulted in a large number of comments in self-defense. The
large number of users that responded to these comments with negative judgements
is similar to the effect of online firestorms [114] but at a scale contained to only an
30
Me and my partner are having a baby.
Posts Responses
You really are insecure
How noticeable is her
about her losing her
belly bump?
hot body.
Figure 2.9. A diagram showing the posting habits of a Returner. Posts are
in the light blue boxes with blue arrows represent the order of posts. An
example of a post response is shown in the white box with the red arrow
representing the post it came from. Each post is prefaced with the
overarching title, “Me and my partner are having a baby.” followed by the
current update on the situation. The response comments have also been
condensed from their full length.
individual post. For these types of posts we also note that some people did come to
the defense of the poster, which follows similar findings that people show sympathy
after a person has experienced a large amount of outrage [118].
The second type of user we observed is the Stubborn Opinion user. These users
are similar to but opposite from the Explainers. Rather than trying to change their
perspective, the Stubborn Opinion user would refuse to acquiesce to the prevailing
opinion of the comment thread. For example, users posting to /r/changemyview
that do not express a change of opinion despite the efforts and agreement of the
commenting users often incur comments casting negative judgement. This back-
and-forth sometimes becomes hostile. Many of these conversations end in personal
31
attacks from one of the participants, which has also been shown in previous work on
conversations in /r/changemyview [27].
The third type of user is the Returner. The returner sought repeated feedback
from Reddit on the same subject. For example, when returners made posts seeking
moral judgement, they often engaged in some of the discussion and may even agree
with some of the critical responses. Some time later, the user returned to edit their
original post or to make another post providing an update about their situation. An
example of a Returner is illustrated in Fig 2.9. In this case, a user continued to request
advice after recently impregnating their partner. In these situations responding users
often found previous posts on the same topic made by the same user and then used
this information to build a stronger case against the user or to argue that the new
post was nothing but a thinly-veiled attempt to shine a more-favorable light on their
original situation. These attempts usually backfired and resulted in more negative
judgments being cast against the user.
Our final task investigates RQ3: Are self-reported gender and age descriptions
associated with positive or negative moral judgements? Recent studies on this topic
have found that gender and moral judgements have a strong association [111]. Specif-
ically, women are perceived to be victims more often than men and harsher punish-
ments are sought for men. The rates at which men commit crimes tends to be higher
than the rates of female crime and society generally views crimes as a moral viola-
tion [32]. If we apply these recent findings to our current research question we expect
to find that male users were judged negatively more often than females.
Many studies have analyzed the relationship between morality and age. Most
of these studies have followed groups of people through their teenage years as they
develop into adulthood [34, 3]. In these studies people were presented with moral
32
TABLE 2.5
CONTINGENCY TABLES
/r/relationship advice
Positive Negative
/r/relationships
Positive Negative
dilemmas and interviewed to assess their moral reasoning. This is similar to how we
expect people to respond to the scenarios presented on Reddit. One major deviation
from these studies is that our posts encompassed only a small portion of the scope
of possible moral scenarios.
This task is not usually available on public social media services because gender
and age are not typically revealed, while also allowing for anonymous posting. Fortu-
nately, the posting guidelines of /r/relationships and /r/relationship advice required
posters to indicate their age and gender in a structured manner directly in the post
title. An example of this can be seen here:
where the poster uses [M27] to indicate that they identified as male aged 27 years
and that their partner [F25] was identified as female aged 25 years. Using these
conventions we were able to reliably extract users’ self-reported age and gender.
We again applied our Judge-BERT model to assign a moral judgement to the post
33
Figure 2.10. Example of a post title from /r/relationships. Subreddit rules
require the poster to indicate their age and gender, as well as any other
individuals gender and age.
34
TABLE 2.6
/r/relationship advice
/r/relationships
Our second task was to determine if gender and age were associated with moral
judgement. In other words, were young females, for instance, judged more positively
than, say, old males? To answer this question, we fit a two-variable logistic regression
model where the binary-variable gender is encoded as 0 for female and 1 for male.
We report the findings from the logistic regressor for each subreddit in Table 2.6.
These results indicate that males were judged more negatively than females. Specif-
ically, in /r/relationship advice being male was associated with a 35% increase in
receiving a negative judgement. Similarly, in /r/relationships being male was associ-
ated with a 46% increase in receiving a negative judgement.
35
We also found that age had a relatively small effect on moral judgement: increased
age is slightly correlated with negative judgement. Specifically, in /r/relationship advice
an increase in age by one year was associated with a 0.59% increase in receiving a
negative judgement. In /r/relationships an increase in age by one year was associated
with a 0.34% increase in receiving a negative judgement.
Simply put, those who were older and those who were male (independently) were
statistically more likely to receive negative judgements from Reddit than those who
were younger and female. Although gender was much more of a contributing factor
than age and neither association was particularly strong.
2.5 Conclusion
In this study, we showed that it is possible to learn the language of moral judge-
ments from text taken from /r/AmITheAsshole. We demonstrated that by extracting
the labels and fine-tuning a BERT language model we could achieve good performance
at predicting whether a user rendered a positive or negative moral judgement. This
performance was verified by human annotators on our subreddits of interest. Using
our trained classifier we then analyzed a group of subreddits that are thematically
similar to /r/AmITheAsshole for underlying trends. Our results showed that users
prefer posts that have a positive moral valence rather than a negative moral valence.
Another analysis revealed that a small portion of users are judged to have substan-
tially more negative moral valence than others and they tend towards subreddits such
as /r/confessions. We also showed that these highly negative moral valence users fall
into three different types based on their posting habits. Lastly, we demonstrated that
age and gender have a minimal effect on whether a user was judged to have positive
or negative moral valence.
Although the Judge-BERT classifier enabled us to perform a variety of analysis
it does pose some limitations. The test-subreddits do deviate from the types of
36
moral analysis observed in the training data. While we did show that Judge-BERT
generalizes to other subreddits, this does not change that moral judgement is not
the focus of /r/CasualConversation, for example. Another important limitation is
that these claims may not generalize to all of Reddit. Previous work has shown
that different subreddit communities can have unique norms [26]. For instance, our
finding that men were judged more negatively than woman may only hold in the two
subreddits of analysis, /r/relationships and /r/relationship advice. Whether this rule
holds for all of Reddit is unconfirmed based on our study. This could also implicate
the classifier if the norms that are learned from /r/AmItheAsshole do not align with
other subreddits of analysis.
In the future we hope to implement argument mining in order to gain a better
understanding of the reasons for these judgements by extracting the underlying argu-
ments given by users. As previously mentioned studies such as Forbes et al. [50] have
done this using a large scale annotation effort. Creating an automated method that
performs well at this task would allow for a more in depth analysis of the judgements
being cast on Reddit. Argument mining has seen success on Reddit already with
extracting the persuasive arguments from subreddits like /r/changemyview [45] and
would enable us to get a better understanding of moral judgements on social media.
This would also allow us to aggregate the underlying themes from these judgements
for further analysis. An argument mining system would also allow us to garner a
more clear picture of our current findings. We could analyze our research questions
again such as RQ3 as it relates to gender. The new study could find out more of an
explanation to these findings.
37
CHAPTER 3
The work presented in this chapter is a collaboration with Tim Weninger and was
published in the Journal of Knowledge and Information Systems in 2023 [11].
3.1 Introduction
In any conversation, members continuously track the topics and concepts that are
being discussed. The colloquialism “train-of-thought” is often used to describe the
path that a discussion takes, where a conversation may “derail,” or “come-full-circle,”
etc. An interesting untapped perspective of these ideas exists within the realm of
the Web and Social Media, where a train-of-thought could be analogous to a trail
over a graph of concepts. With this perspective, an individual’s ideas as expressed
through language can be mapped to explicit entities or concepts, and, therefore,
a single argument or train-of-thought can be treated as a path over the graph of
concepts. Within a group discussion, the entities, concepts, arguments, and stories
can be expressed as a set of distinct paths over a shared concept space, what we call
an entity graph.
Scholars have long studied discourse and the flow of narrative in group conversa-
tions, especially in relation to debates around social media [102] and intelligence [92].
The study of language and discourse is rooted in psychology [24] and consciousness
[23].
Indeed, the linguist Wallace Chafe considered “...conversation as a way separate
minds are connected into networks of other minds.” [24] Looking at online conversa-
38
tions from this angle, a natural hypothesis arises: If we think of group discussion as
a graph of interconnected ideas, then can we learn patterns that are descriptive and
predictive of the discussion?
Fortunately, recent developments in natural language processing, graph mining,
and the analysis of discourse now permit the algorithmic modelling of human discus-
sion in interesting ways by piecing them together. This is a broad goal, but in the
present work we provide a first step towards graph mining over human discourse.
Another outcome of the digital age is that much of human discourse has shifted to
online social systems. Interpersonal communication is now observable at a massive
scale. Digital traces of emails, chat rooms, Twitter or other threaded conversations
that approximate in person communication are commonly available. A newer form of
digital group discussion can be seen in the dynamics of Internet fora where individuals
(usually strangers) discuss and debate a myriad of issues.
Technology that can parse and extract information from these conversations cur-
rently exists and operates with reasonable accuracy. From this large body of work,
the study of entity linking has emerged as a way to ground conversational statements
to well-defined entities, such as those that constitute knowledge bases and knowledge
graphs [125]. Wikification, i.e., where entities in prose are linked to Wikipedia-entries
as if it was written for Wikipedia, is one example of entity linking [30]. The Infor-
mation Cartography project is another example that uses these NLP-tools to create
visualizations that help users understand how related news stories are connected in
a simple, yet meaningful manner [124, 123, 78]. But because entity linking tech-
niques have been typically trained from Wikipedia or long-form Web-text, they have
a difficult time accurately processing conversational narratives, especially from so-
cial media [40]. Fortunately, recent progress in Social-NLP has made considerable
strides in recent years [108] providing the ability to extract grounded information
from informal, threaded online discourse [82].
39
Figure 3.1. Illustration of an entity graph created from threaded
conversations from r/politics (blue-edges) and r/conservative (red-edges).
The x-axis represents the (threaded) depth at which each entity was
mentioned within conversations, extracted from Reddit, rooted at
Joe Biden. The y-axis represents the semantic space of each entity, i.e.,
similar entities are closer than dissimilar entities on the y-axis. Edge colors
denote whether the transition from one entity set to another occurs more
often from one groups conversations than another. Node colors represent
equivalent entity sets along the x-axis. In this visualization we observe a
pattern of affective polarization as comments coming from /r/Conservative
are more likely to drive the conversation towards topics related to the
opposing political party.
Taking this perspective, the present work studies and explores the flow of entities
in online discourse through the lens of entity graphs. We focus our attention on
discussion threads from Reddit, but these techniques should generalize to online
discussions on similar platforms so long as the entity linking system can accurately
link the text to the correct entities. The threaded conversations provide a clear
indication of the reply-pattern, which allows us to chart and visualize conversation-
paths over entities.
To be clear, this perspective is the opposite of the conventional social networks
40
approach, where information and ideas traverse over user-nodes; on the contrary, we
consider discourse to be humans traversing over a graph of entities. The conventional
approach to social networks is important for areas such as influence maximization
[79] and the spread of behaviors [22]. Instead, the goal of our alternative perspective
is to discover this network of minds and uncover patterns of how they think over
topics. This alternative perspective is motivated by the large number of influence
campaigns [120], information operations [140], and the effectiveness of disinformation
[59]. These campaigns often operate by seeding conversations in order to exploit
conversation patterns and incite a particular group. Another motivation for our
proposed methodology is humans attraction towards homophily and the large number
of echo chambers that have been created online [33, 53]. Prior works [53] looking at
echo chambers in political discourse rely on this notion of the ideas spreading between
user-nodes. Other works looking at morality [17] also follow this notion of how moral
text spreads throughout a user network. We stress here that our entity graph will
allow for a flipped perspective of having users move across the graph of entities in
various types of conversations. This position allows for a different form of analysis
into how different groups or communities think as a whole.
Our way of thinking is illustrated in Fig. 3.1, which shows a subset of path traver-
sals, which we describe in detail later, from thousands of conversations in /r/politics
and /r/conservative that start from the entity Joe Biden. As a brief preview, we
find that conversations starting with Joe Biden tend to lead towards United States in
conversations from the /r/conservative subreddit (indicated by a red edge), but com-
monly lead towards mentions of the Republican Party in conversations from /r/politics
(indicated by blue-purple edge). From there the conversations move onward to vari-
ous other entities and topics that are cropped from Fig 3.1 to maintain clarity.
In the present work, we describe how to create entity graphs and use them to
answer questions about the nature of online, threaded discourse. Specifically, we ask
41
three research questions:
RQ2 What do entity graphs of Reddit look like? In general, does online discourse
tend to splinter, narrow, or coalesce? Do conversations tend to deviate or stay
on topic?
We find that entity graphs provide a detailed yet holistic illustration of online
discourse in aggregate that allow us to address our proposed research questions.
Conversations have an enormous, visually random, empirical possibility space, but
attention tends to coalesce towards a handful of common topics as the depth increases.
Prediction is difficult, and gets more difficult the longer a conversation goes on.
Finally, we show that entity graphs present a particularly compelling tool by which
to perform comparative analysis. For example, we find, especially in recent years,
that conservatives and liberals both tend to focus their conversations on the out-group
– a notion known as affective polarization [72]. We also find that users also tend to
stick to the enforced topics of a subreddit as shown by how r/news tends towards
entities from the United States and r/worldnews tends towards non-US topics.
3.2 Methodology
Of all the possible choices from which to collect online discourse, we find that
Reddit provides exactly the kind of data that can be used for this task. It is freely
and abundantly available [8], and it has a large number of users and a variety of
topics. Reddit has become a central source of data for many different works [93].
For example, recent studies on the linguistic analysis of Schizophrenia [161], hate
42
speech [25], misogyny [48], and detecting depression related posts [132] all make
substantial use of Reddit data.
The threading system that is built into Reddit comment-pages is important for
our analysis. Each comment thread begins with a high level topic (the post title),
that is often viewed as the start of a conversation around a specific topic. Users
often respond to the post with their own comments. These can be viewed as direct
responses to the initial post, and then each of these comments can have replies. This
threading system generates a large tree structure where the root is the post title. Of
course, such a threading system is only one possible realization of digital discussion,
but this system provides the ability to understand how conversations move as users
respond to each other in turn. Twitter, Facebook, and Youtube also have discussion
sections, but it is very difficult to untangle who is replying to whom in these (mostly)
unthreaded systems.
Reddit contains a large variety of subreddits, which are small communities focused
on a specific topic. We limit our analysis to only a small number of them, but for
each selection we obtain their complete comment history from January 2017 to June
2021. In total we selected five subreddits: /r/news, /r/worldnews, /r/Conservative,
/r/Coronavirus and /r/politics. We selected these subreddits because they are large
and attract a lot of discussion related to current events, albeit with their own per-
spectives and guidelines. These subreddits also contain a large number of entities,
which we plan to extract and analyze.
Like most social sites, Reddit post-engagement follows the 90-9-1 rule of Internet
engagement. Simply put, most users don’t post or comment, and most posts receive
almost no attention [93]. Because of this we limit our data to include only those
threads that are in the top 20% in terms of number of comments per post. Doing so
ensures that we mostly collect larger discussions threads that have an established back
and forth. We also ignore posts from the well-known bot accounts, (e.g., AutoMod,
43
Germany challenges Russia over alleged
cyberattacks
Germany Russia
About time somebody had the
balls to stand up to Russia
Russia
And then what happens? Sure
Russia is messing around with Entity
other countries, but I doubt much Extraction Russia
and Linking
will come from this.
Figure 3.2. (Left) Example comment thread with the post title as the root,
two immediate child comments, one of which has two additional child
comments. Entity mentions are highlighted in yellow. (Right) The resulting
entity tree where each comment is replaced by their entity set. Note the
case where the mention-text Trump in the comment thread is represented
by the standardized entity-label Donald Trump in the entity tree.
We use entity linking tools to extract the entities from each post title and comment
in the dataset (c.f. [125]). Entity linking tools seek to determine parts of free form
text that represent an entity (a mention) and then map that mention to the appropri-
ate entity-listing in a knowledge base (disambiguation), such as Wikipedia. Existing
models and algorithms rely heavily on character matching between the mention-text
and the entity label, but more-recent models have employed deep representation
learning to make this task more robust [122].
44
An example of entity linking on a comment thread is illustrated in Fig. 3.2. Each
comment thread T contains a post R which serves as the root of the tree cr ∈ T
and comments cx ∈ T , where subscript r and x serve to index the post title and a
specific comment. Each comment can reply to the root c → r or to another comment
cx → cy thereby determining a comment’s depth ` ∈ [0, . . . , L]. Comments and post
titles may or may not contain one or more entities S(c). These entity sets are likewise
threaded, such that S(cx ) → S(cy ) means that the entities in cx were responded to
with the entities in cy , i.e., cx is the parent of cy . With this formalism, the entity
linking task transforms a comment threads into an entity tree as seen in Fig. 3.2.
Specifically, we utilize the End-to-End (E2E) neural model created by Kolitaskas
et. al. [82] to perform entity linking on our selected subreddits. Previous work
has shown that entity linking on Reddit can be quite challenging due to the wide
variety of mentions used [12]. The E2E model we use has been shown to have a high
level of precision on Reddit but lacks a high recall [12]. We find using this model
appropriate as we want to ensure that the entities we find are correct and reliable,
but acknowledge that it may miss a portion of the less well-known entities, as well as
missing any new entities that arise from entity drift. The choice of this entity linker
also influenced our decision to analyze the selected subreddits as the performance
is better in these selected subreddits. We also experimented with the popular REL
entity linker [137]. Although it did retrieve many more entities from the comments,
we found a large number of the entities to be incorrect.
Using the E2E model we extract entities from each post title and comment in-
dividually and construct the entity tree as illustrated in Fig 3.2. Table 3.1 shows a
breakdown of the post, comment, and entity statistics for each subreddit considered
in the present work.
45
TABLE 3.1
Given an entity tree, our next task is to construct a model that can be used to
make predictions about the shape and future of the conversation, but also can be
used as a visual, exploratory tool. Although entity trees may provide a good picture
for a single conversation, we want to investigate patterns in a broader manner. To
do this we consider conversations coming from a large number of entity trees in
aggregate. This model takes the form of a weighted directed graph G = (V , E, w)
where each vertex v ∈ V is a tuple of an entity set S(c) and it’s associated depth ` in
the comment tree v = (S(c), `). Each directed edge in the graph e ∈ E connects two
vertices e = (v1 , v2 ) such that the depth , ` of v1 must be one less than the depth of
v2 . Each edge in the graph e ∈ E also contains a weight w : E → R that represents
the frequency of the transition from one entity set to another. This directed graph
captures not only the specific concepts and ideas mentioned within the discourse, but
also the conversational flow over those concepts.
Continuing the example from above, Fig. 3.3 shows three individual paths P
representing the entity tree from Fig. 3.2(b). Each entity set moves from one depth
46
Germany Russia
Figure 3.3. Paths extracted from the entity tree in Fig. 3.2(b) represented
by directed edges over entity sets.
Germany
Germany Russia Russia Germany
to the next, representing the progression of the discussion.Europe
Germany
During Europe
the construction of the entity paths, we remove comments that do not
Russia
have
Russia any replies. Short paths, Donald_Trump
Donald_Trump those with a length less thanRussia
Russia three, do not offer much
Russia
information in terms of how the conversation will progress, because the conversation
Depth: 1 2 3
empirically did not progress. It may be useful to analyze why some topics resulted
in no follow-on correspondence, but we leave this as a matter for future work.
Because we wish to explore online discourse in aggregate, this is the point where
we aggregate across many comment threads T ∈ T where T represents an entire
subreddit or an intentional mixture of subreddits depending on the task. We extract
all of the conversation paths from our comment threads T to now have a group of
conversation paths P. To generate our graph we iterate over our group of paths
P and aggregated them together to construct our entity graph. For every instance
of an entity set transition in a conversation path we increment the weight w of it’s
respective edge in our entity graph. One key aspect of this is that we count this
transition only once per each comment thread T . This ensures that entity transitions
do not get over counted, by virtue of the thread being larger and containing more
47
conversation paths overall.
One of the limitations of the current graph structure is that the graph does not
capture conversation similarities if some of the entities overlap between two different
vertices. For instance, another entity tree may result in having an entity set S(cr )
that contains a subset of the entities in a given vertex. This new entity set may have
a similar conversational flow but will not be captured in our current entity graph
because the model does not allow for any entity overlap.
To help alleviate this issue we borrow from the notion of a hypergraph and perform
a star-expansion on our graph G [160]. A hypergraph is defined as H = (X, E)
where X is the set of vertices and E is a set of non-empty subsets of X called
hyperedges. The star expansion process turns a hypergraph into a simple, bipartite
graph. It works by generating a new vertex in the graph for each hyperedge present
in the hypergraph and then connects each vertex to each new hyperedge-vertex. This
generates a new graph G(V , E) from H by introducing a new vertex and edge for
each hyperedge such that V = E ∪ P.
While our model is a graph we can treat each entity set S(c) as a hyperedge in our
case to perform this star expansion. This will give us new vertices to represent each
individual entity and allow us to capture transitions from one entity set to another
if they share a subset of entities. An example of the resulting graph after performing
a star-expansion can be seen in Fig. 3.4. This helps to provide valid transition paths
that would otherwise not exist without the star expansion. When the star expansion
operation is performed the edge weights between the new individual entity vertices
and their respective entity sets is set to the number of times that entity set occurred
at a given depth l. Although the star expansion process will generate a much larger
graph due to the large number of vertices, it proves to be useful for prediction and
aligning entity set vertices in a visual space.
This graph-model therefore represents the entities, their frequent combinations,
48
Germany
Germany Russia Russia Germany
Europe
Germany
Europe
Russia
Donald_Trump
Russia Donald_Trump Russia Russia
Russia
Depth: 1 2 3
and the paths frequently used in their invocation over a set of threaded conversations.
Having generated these entity graphs we turn our attention to the three research
questions. RQ1 first asks if these entity graphs can be used to predict where a conver-
sation may lead. Clearly this is a difficult task, but recent advances in deep learning
and language models have led to major improvements and interest in conversational
AI [107], which has further lead to the development of a number of models that utilize
entities and knowledge graphs [149] from various sources including Reddit [154]. The
main motivation of these tools is to use the topological structure of the knowledge
graphs (entities and their relationships) to improve a conversational agents’ ability to
more-naturally select the next entity in the conversation. The typical methodology
in related machine learning papers seeks to predict the next entity in some conversa-
tion [96]. In these cases, a dataset of paths through a knowledge graph is constructed
from actual human conversations as well as one or more AI models. Then a human
49
Percentage of Valid Entities
1 /r/news
/r/worldnews
/r/politics
0.8 /r/coronavirus
/r/conservative
0.6
2 4 6 8
Conversation Depth (`)
Figure 3.5. Percent of the predictions made on the testing set that, on
average, exist in the training set for 5-folds. Higher is better.
annotator picks the entity that they feel is most natural [96, 75].
Our methodology varies from these as we are not focused on making a machine
learning model to accurately predict these entities precisely. Our goal is to demon-
strate more broad patterns of people conversing over and through the topics. To
this end, we do not evaluate with a standard machine learning paradigm aiming to
optimize for metrics such as accuracy, precision, recall, etc. To demonstrate that
our entity graph captures broad patterns that can be further explored we perform
two tasks: (1) the generalization task and (2) a similarity prediction task. Each task
uses 5-fold cross validation where we split the entity graph into 80/20 splits for Htrain
and Htest respectively. We perform this cross validation in a disjoint manner with
the Reddit threads that we have extracted. This creates 5 different entity graphs,
one for each split, and validates the model’s generalization to unseen Reddit threads.
Although this disjoint split ensures the threads are separate, we do not consider the
temporal aspect of these threads.
The first task: generalization, gets at the heart of our broader question on the
predictability of conversation paths. In this task we simply calculate the number of
50
entity sets, at each level in Htest that also appear in the same level in Htrain of our
kS` ∈Htest \S` ∈Htrain k
entity graph. Formally, we measure generalization as 1 − kS` ∈Htest k
for each
`.
In simple terms, generalization tells us, given an unseen conversation comment, if
the model can make a prediction from the given comment by matching at least one
entity in our entity graph model. This task therefore validates how well the model
captures general conversation patterns by matching at the entity level.
Results of this analysis are shown in Fig. 3.5 where color and shape combinations
indicate the subreddit and ` is represented along the x-axis. Error bars represent
the 95% confidence interval of the mean across the 5 folds. We find that the entity
graph captures much more of the information early in conversations. As the depth
increases to three and beyond, we note a sharp drop in the overlap between the test
and training sets. The widening confidence interval also indicates that the amount
of information varies based on the test set. From these results, we conclude that
analyzing the flow of an unseen conversation early-on is reasonable, but findings
from deeper in the conversation may be difficult because key entities may be missing
from the entity graph.
The second task: similarity prediction looks to measure the similarity between
a predicted entity set and the actual entity set. This methodology uses the entity
embeddings from the E2E entity linking model to represent the entities in the vector
space. For each root in Htest we find its matching root in the Htrain ; if a match does not
exist, we discard and start again. Then we make the Markovian assumption and per-
form probabilistic prediction for each path in the training set via P r(S`+1 (cy )|S` (cx )),
i.e., the empirical probability of a conversation moving to S`+1 (cy ) given the conver-
sation is currently at S` (cx ) in Htrain . The probability for each transition is based on
the edge weights that we captured during the graph construction step. As determined
in the previous experiment, entity sets are increasingly unlikely to match exactly as
51
/r/news /r/worldnews /r/politics
6
4
WMD
0
0 2 4 6 80 2 4 6 8 0 2 4 6 8
Conversation Depth (`) Conversation Depth (`) Conversation Depth (`)
Figure 3.6. Box plot of Word Movers Distance (WMD) as a function of the
conversation depth `. Lower is better. Box plots represent WMD-error of
entity representations predicted by the narrative hypergraph over all
entities, over all depth, over five folds.
the depth increases; so rather than a 0/1 loss, we measure the word movers distance
(WMD) between the predicted entities and the actual entities [84].
Results for this comparison are shown in Fig. 3.6 for three of the larger subreddits.
We again find that as the depth of the conversation increases the distance between
our predicted tree and the ground truth entities rises. These results indicate that as
a conversation continues, the variety of topics discussed tends to increase. Therefore,
predictions are likely to not align well very to those of the true conversation. This is
most clearly seen in the /r/politics plot in Fig. 3.6, where we note a sharp increase
in the later parts of the conversation. If the variety of topics was consistent, then we
would expect the WMD to stay relatively flat throughout the conversation depth.
Next, we investigate RQ2 through a visualization of the entity graph. Recall that
the entity graph contains entity sets over the depths of the conversation. Specifically,
we seek to understand what conversations on Reddit look like. Do they splinter,
52
narrow, or behave in some other way? We call the set of visual paths conversation
traversals because they indicate how users traverse the entity graph.
We generate these visual conversation traversals using a slightly modified force di-
rected layout [51]. Graph layout algorithms operate like graph embedding algorithms
LINE, node2vec, etc, but rather than embedding graphs into a high dimension space,
visual graph layout tools embed nodes and edges into a 2D space. In our setting
we do make some restrictions to the algorithm in order to force topics to coalesce
into a visually meaningful and standardized space. Specifically, we fix the position
of each vertex in our graph on the x-axis according to `. As in Fig. 3.4, individual
entity vertices always occur to the left of entity set vertices, making the visualization
illustrate how conversations flow from the start to finish in a left to right fashion.
This restriction forces the embedding algorithm to adjust the position only on
the y-coordinate, and this is necessary to allow the individual entity to entity set
edges from the star-expansion to pull entity set vertices close together if and only if
they share many common entities. Loosely connected or disconnected entities will
therefore not be pulled together. As a result, the y-axis tends to cluster entities and
entity-sets together in a semantically meaningful way.
Embedding algorithms are typically parameterized with a learning rate parameter
that determines how much change can happen to the learned representation at each
iteration. Because we want entities to be consistent horizontally, we modify the
learning rate function to increasingly dampen embedding updates over 100 iterations
per depth. For example, given a entity graph of depth L = 10, we would expect 1,000
iterations total. We initially allow all entities and entity sets to update according to
the default learning rate, but as the iterations increase to 100 the learning rate of the
entities and entity sets at ` = 1 will slowly dampen and eventually lock into place
at iteration 100. When these entities and entity sets lock we also lock those same
entities and entity sets at all other depths. This ensures that each of these entities
53
Figure 3.7. Entity graph showing the visual conversation traversals from
/r/news. This illustration shows the paths of conversations over entity sets.
The x-axis represents the depth of the conversation; entity sets are
clustered into a semantically meaningful space along the y-axis. Inset
graph highlights five example entity sets and their connecting conversation
paths. Node colors represent equivalent entity sets. In this example we
highlight how entity sets are placed in meaningful semantic positions in
relation to one another.
and entity sets will be drawn as a horizontal line at the given y position.
Then, from iterations 100-200, the learning rate of the entities and entity sets at
` = 2 will slowly dampen and eventually lock into place at iteration 200. Meanwhile
the entities and entity sets at deep levels will continue to be refined. In this way,
the semantically meaningful y-coordinates tend to propagate from left to right as the
node embedding algorithm iterates.
One complication is that the sheer number of entities and the conversation paths
over the entities is too large to be meaningful to an observer. So we do not draw the
entity-nodes generated by the star-expansion and instead opt to rewire entities sets
based on the possible paths through the individual entity nodes. We also tune the
edge opacity based on the edge weights.
We draw the resulting graph with D3 to provide an interactive visualization [10].
54
Conversation traversals of the entity graph generated from /r/news is illustrated
in Fig. 3.7. This illustration is cropped to remove the four deepest vertical axes
(on the right) and is also cropped to show the middle half of the illustration. A
zoomed in version highlights some interesting entity sets present in the /r/news con-
versation. Recall that the entity sets are consistent horizontally so that both red
circles on the left and the right of the inset plot both indicate the entity set with
Donald Trump; likewise the blue circles on the left and the right of the insert both
represent Barack Obama. Edges moving visually left to right indicate topical paths
found in online discourse. In the /r/news subreddit, which tracks only US news,
Donald Trump and Barack Obama are frequent visits, but so too are national entities
like United States (not highlighted), Iraq, and others. It is difficult to see from this
illustration, but the expanded interactive visualization shows a common coalescing
pattern where large sets of entities and unique combinations of ideas typically coalesce
into more simple singleton entities like Barack Obama or United States.
55
Figure 3.8. Entity graph example of spreading activation on /r/news when
Barack Obama is selected as the starting entity. The x-axis represents the
(threaded) depth at which each entity was mentioned within conversations
rooted at Barack Obama. The y-axis represents the semantic space of each
entity, i.e., similar entities are closer than dissimilar entities on the y-axis.
Node colors represent equivalent entity sets. In this example, we observe
that conversations starting from Barack Obama tend to center around the
United States, political figures such as Donald Tump, and discussion around
whether his religion is Islam.
56
relatedness between different concepts.
Formally, spreading activation works by specifying two parameters: (1) a firing
threshold F ∈ [0, . . . , 1] and (2) a decay factor D ∈ [0, . . . , 1]. The vertex/entity set
selected by a user will be given an initial activation Ai of 1. This is then propagated
to each connected vertex as Ai ×wj ×D where wj is the weight of each edge connection
to the corresponding vertex. Each vertex will then acquire its own activation value Ai
based on the total amount of signal received from all incoming edges. If a vertex has
acquired enough activation to exceed the firing threshold F , it too will fire further
propagating forward through the graph. In the common setting, vertices are only
allowed to fire once and the spreading will end once there is no more vertices to
activate.
In our work we use spreading activation as a method for a user to select a starting
topic/entity set within the illustration of conversation traversals. The spreading ac-
tivation function will then propagate the activation of entities along the conversation
paths to highlight those that are mostly likely to activate from a given starting point.
Because we permit the entity graph to be constructed (and labeled) from multiple
subreddits, we can also use the spreading activation function to compare and contrast
how users from different subreddits activate in response to a topic.
After spreading activation has been calculated, our interactive visualization tool
removes all vertices and links that are not part of the activated portion of the graph.
All of the vertices involved in spreading activation will have their size scaled based
on how much activation they received. An example of this is cropped and illus-
trated in Fig. 3.8, which shows how spreading activation occurs when the entity set
Barack Obama is activated within /r/news. Here we see that conversations starting
with (only) Barack Obama tend to move towards discussions about the United States.
We also note that the Islam entity is semantically far away from Barack Obama and
Donald Trump as indicated its placement on the y-axis. The results from using spread-
57
ing activation allow for a much more granular investigation of conversational flow.
These granular levels of conversational flow demonstrate that an individual can search
for patterns related to influence campaigns, echo chambers and other social media
maladies across a number of topics.
58
Although a full analysis of this difficult topic is not within the purvue of the cur-
rent work, we do perform a comparative analysis of /r/Conservative and /r/politics
as proxies for comparing conservative and liberal groups, respectively. We pay partic-
ular attention to determining the particular topics and entities that each group tends
to go towards later (deeper) in the conversation. Such a comparative analysis may be
key to understanding how coordinated influence campaigns orient the conversation
of certain groups or de-rail them.
The comparative illustration using spreading activation was used at the beginning
of the paper in Fig. 3.1 and is not re-illustrated in this section. The illustration yields
some interesting findings. While one might expect /r/Conservative to discuss mem-
bers or individuals related to the republican party, we instead find that conversations
tend to migrate toward mentions of liberal politicians (e.g., Joe Biden) indicated by
red lines in Fig. 3.1. The reverse holds true as well: mentions of Joe Biden leads to-
wards mentions of the Republican Party by the liberal group, as indicated by the blue
line connecting the two. A brief inspection of the underlying comments reveals that
users in each subreddit tend to talk in a negative manner towards the other party’s
politicians. This is a clear example of affective polarization [72] being captured by
our visualization tool. Affective polarization is where individuals organize around
principles of dislike and distrust towards the out-group (the other political party)
even moreso than trust in their in-group.
Another finding we observe is the more pronounced usage of the United States
by conservatives than liberals. This observation could be explained by the finding
that conservatives show a much larger degree of overt patriotism than liberal indi-
viduals [69], which has more recently lead to a renewed interest in populism and
nationalism [39].
59
Scenario 2: US news and Worldnews In our second scenario, we compare the con-
versations from /r/news (red) and /r/worldnews (blue), which are geared towards
US-only news and non-US news respectively.
The comparison between these subreddits reveals unsurprising findings. A much
larger portion of the entity sets come from /r/worldnews as they discuss a much
broader range of topics. Many of the entity transitions that are dominated by
/r/worldnews come from discussions of other countries, events, and people outside
of the United States. The aspects that are shown to come primarily from /r/news
are topics surrounding the United States, China, and major political figures from the
United States. An example of this can be seen in Fig. 3.9 which illustrates spreading
activation starting from White House. Here, the dominating red lines, which reflects
transitions from within conversations on /r/news, converge to United States, even af-
ter topics like Russia or Islam are discussed. An interesting side note is that many of
the unlabeled entities entering the conversation via blue lines (/r/worldnews) in ` = 5
and ` = 6 represent other countries such as Canada, Japan, Mexico, and Germany. The
findings from this comparative analysis do not show any extremely interesting results
but, it does show that the entity graph is able to capture what one would see as the
assumed patterns to find from comparing these two subreddits of interest.
Scenario 3: COVID and Vaccines Our final analysis focuses on comparing a single
subreddit, /r/Coronavirus, but during two different time periods. There is a large
amount of work that has been done analyzing Covid online looking at partisanship
[109], user reaction to misinformation [76], and differences in geographic concerns
[62]. The first segment (highlighted in red) comes from the period of January through
June in 2020, which was during the emergence of the novel Coronavirus. Although
the /r/Coronavirus subreddit had existed for many years prior, it became extremely
active during this time. The second segment was from the following year January -
60
Figure 3.9. Illustration of an entity graph created from threaded
conversations from /r/news (red-edges) and r/worldnews (blue-edges). The
x-axis represents the (threaded) depth at which each entity set was
mentioned within conversations rooted at White House. The y-axis
represents the semantic space of each entity, i.e., similar entities are closer
than dissimilar entity sets on the y-axis. Nodes colors represent equivalent
entity sets. Conversations in /r/news tends to coalesce to United States,
while conversations in /r/worldnews tend to scatter into various other
countries (unlabeled black nodes connected by thin blue lines)
61
June 2021. This time period corresponded to the development, approval and early
adoption of vaccines.
Our analysis of this visualization yielded some interesting findings related to the
coronavirus pandemic that we illustrate in Fig. 3.10. If we begin spreading activation
from the perspective of United States we find that most of the discussion leads to
China and Italy in 2020, which appears reasonable because of China and Italy’s early
struggles with virus outbreaks. In comparison, the 2021 data appeared more likely
to mention Sweden, India, and Germany, which had severe outbreaks during those
months. Our findings from spreading activation allow us to capture the shifting
changes in countries of interest from 2020 to 2021 as the pandemic progressed.
3.6 Conclusion
In the current work we presented a new perspective by which to view and think
about online discourse. Rather than taking the traditional social networks view where
information flows over the human participants, our view is to consider human con-
versations as stepping over a graph of concepts and entities. We call these discourse
maps entity graphs and we show that they present a fundamentally different view of
online human communication.
Taking this perspective we set out to answer three research questions about (1)
discourse prediction, (2) illustration, and (3) behavior comparisons between groups.
We found that discourse remains difficult to predict, and this prediction gets harder
the deeper into the conversation we attempt predictions. We demonstrate that the
visual conversation traversals provide a view of group discourse, and we find that
online discourse tends to coalesce into narrow, simple topics as the conversation
deepens – although those topics could be wildly different from starting topic. Finally,
we show that the spreading activation function is able to focus the visualization to
provide a comparative analysis of competing group dynamics.
62
Figure 3.10. Comparison between the first 6 months of /r/Coronavirus
from 2020 to 2021. Illustration of an entity graph created from threaded
conversations from /r/Coronavirus in Jan–June of 2020 (red-edges) and
from Jan–June of 2021 (blue-edges). The x-axis represents the (threaded)
depth at which each entity set was mentioned within conversations rooted
at United States. The y-axis represents the semantic space of each entity
set, i.e., similar entity sets are closer than dissimilar entity sets on the
y-axis. Node colors represent equivalent entity sets. Conversations tended
to focus on China and Italy early in the pandemic, but turn towards a
broader topic space later in the pandemic.
63
3.6.1 Limitations
While the work in its current state is helpful for better understanding conver-
sations, it is not without its limitations. Foremost, in the present work we only
considered conversations on Reddit. Another limitation is that the entity linking
method we chose is geared towards high-precision at the cost of low-recall. This
means that we can be confident that the entities extracted in the conversations are
mostly correct, but we have missed some portion of entities. The recall limitation
does inhibit the total number of entities we were able to collect; a better system
would provide for better insights in our downstream analysis. This issue can also be
highlighted with the long tail distribution of entities and the challenges this poses to
current methods [71]. An entity linking model that focuses on recall may still result
in useful graphs as prior works have found that many of the entities are considered
“close enough” even when they are not a perfect match to ground truth data [43].
Using a different entity linking model could lead to different patterns extracted from
our method. For a model that optimizes for higher recall it could create a much
larger entity graph, though it would likely contain a fair amount of noise due to the
precision-recall trade off.
Another limitation inherent to the present work is the consideration of conver-
sations as threaded trees. This is an imperfect representation of natural, in-person
conversation, and still different from unthreaded conversations like those found on
Twitter and Facebook, which may require a vastly different entity graph construction
method. Finally, the interactive visualization tool is limited in its ability to process
enormous amounts of conversation data because of its reliance on JavaScript libraries
and interactive browser rendering.
64
3.6.2 Future Work
These limitations leave open avenues for further exploration in future work. Our
immediate goals are to use the entity graphs to better understand how narratives are
crafted and shaped across communities. Improvements in the entity linking process
and addition of concept vertices, pronoun anaphora resolution, threaded information
extraction and other advances in SocialNLP will serve to improve the technology
substantially. We also plan to ingest other threaded conversational domains such
as Hackernews, 4chan, and even anonymized email data. Extensions of this work
could also include capturing more information between entity transitions such as the
sentiment overlayed on a given entity or group of entities. This extra information
could allow us to create entity graphs that not only show the transition but also how
various groups speak and feel about those specific entities.
65
CHAPTER 4
Large language models like BERT [41] have significantly pushed the boundaries
of Natural Language Understanding (NLU) and created interesting applications such
as automatic ticket resolution [91]. A key component of such systems is a virtual
agent’s ability to understand a user’s intent to respond appropriately. Successful
implementation and deployment of models for these systems require a large amount
of labeled data to be effective. Although deployment of these systems often generate
a large amount of data that could be used for fine-tuning, the cost of labeling this
data is too high. Semi-supervised learning methodologies are an obvious solution
because they can significantly reduce the amount of human effort required to train
these kinds of models [85, 158] especially in image classification tasks [150, 101, 146].
However, as well shall see, applications of these models is difficult for NLU and intent
classification because of the label distribution.
Indeed, research most closely realted to the present work is the Slot-List model by
Basu et al. [6], which focuses on the meta-learning aspect of semi-supervised learning
rather than using unlabeled data. In a similar vein the GAN-BERT [36] model shows
that using an adversarial learning regime can be devised to ensure that the extracted
BERT features are similar amongst the unlabeled and the labeled data sets and
substantially boost classification performance. Other methods have investigated how
data augmentation can be applied to the NLP domain to enforce consistency in the
models [28], and several other methods have been proposed from the computer vision
66
Existing threshold Imbalanced decision
selection strategy boundary
Unlabeled set of
examples
Figure 4.1. Example of pseudo label selection when using a threshold (top)
versus the top-k sampling strategy (bottom). In this toy scenario, we chose
k = 2, where each class is represented by a unique shape. As the threshold
selection strategy pseudo-labels data elements (shown as yellow) that
exceed the confidence level, the model tends to become biased towards
classes that are easier to predict. This bias causes a cascade of mis-labels
that leads to even more bias towards the majority class.
67
community. However, a recent empirical study found that many of these methods
do not provide the same benefit to NLU tasks as they provide to computer vision
tasks [29] and can even hinder performance in certain instances.
Intent classification remains a challenging problem for multiple reasons. Gen-
erally, the number of intents a system must consider is relatively large, with sixty
classes or more. On top of that, most queries consists of only a short sentence or two.
This forces models to need many examples in order to learn nuance between different
intents within the same domain. In the semi-supervised setting, many methods set a
confidence threshold for the model and assign pseudo-labels to the unlabeled data if
their confidence is above the threshold [129]. This strategy permits high-confidence
pseudo-labeled data elements to be included in the training set, which typically re-
sults in performance gains. Unfortunately, this approach also causes the model to
become overconfident for classes that are easier to predict. The issue is more pro-
nounced for intent classification because of feedback loops that can quickly cause the
model to become biased towards a small number of classes.
In the present work, we describe the Top-K K-Nearest Neighbor (TK-KNN)
method for training semi-supervised models. The main idea of this method is illus-
trated in Figure 4.1. TK-KNN makes two improvements over other pseudo-labeling
approaches. First, to address the model overconfidence problem, we use a top-k sam-
pling strategy when assigning pseudo-labels. Second, we enforce a balanced set of
classes by taking the top-k predictions per class, not simply the top-k overall predic-
tions. Furthermore, when selecting the top-k examples the sampling strategy does not
simply rely on the model’s predictions, which tend to be noisy. Instead we leverage
the embedding space of the labeled and unlabeled examples to find those with simi-
lar embeddings and combine them with the models’ predictions. Experiments using
standard performance metrics of intent classification are performed on three datasets:
CLINC150 [86], Banking77 [20], and Hwu64 [89]. We find that the TK-KNN method
68
outperforms existing methods in most scenarios and performs exceptionally well in
the low-data scenarios.
Intent Classification The task of intent classification has attracted much atten-
tion in recent years due to the increasing use of virtual customer service agents.
Recent research into intent classification systems has mainly focused on learning out
of distribution data [151, 155, 31, 157]. These techniques configure their experiments
to learn from a reduced number of the classes and treat the remaining classes as
out-of-distribution during testing. Although this research is indeed important in its
own regard, it deviates from the present work’s focus on semi-supervised learning.
69
Another major drawback of this selection method is that the model can become
very biased towards the easy classes in the early iterations of learning [2]. Recent
methods, such as FlexMatch [152], have discussed this problem at length and at-
tempted to address this issue with a curriculum learning paradigm that allows each
class to have its own threshold. These thresholds tend to be higher for majority
classes lower for less-common classes. However, this only serves to exacerbate the
problem because the less-common classes will have less-confident labels. A previous
work by Zou et al. [162] proposes a similar class balancing parameter to be learned
per class, but is applied to the task of unsupervised domain adaptation. The closest
previous work to ours is co-training [98] that iteratively adds a single example from
each class throughout the self-training.
The TK-KNN strategy described in the present work addesses these issues by
learning the decision boundaries for all classes in a balanced way while still giving
preference to accurate labels by considering the proximity between the labeled and
the unlabeled examples in the embedding space.
70
the label, or not perturb the data enough to help regularize the model. While our
work does not focus on leveraging consistency to improve model performance, we
71
labeled example per class. This scarcity of labeled data makes forming reliable clus-
ters quite challenging. Therefore, the TK-KNN model described in the present work
adapted the K-Nearest Neighbors search strategy to help guide our pseudo-labeling
process.
As described above, we first employ pseudo labeling to iteratively train (and re-
train) a model based on its most confident-past predictions. In the first training cycle,
72
Labeled Examples Unlabeled Examples
Text Intent Text Predicted Intent
1. Train 2. Predict
Is there a carry on item weight Carry On How do you setup direct Direct Deposit
Model Unlabeled deposit? (Prob 0.78)
limit?
BERT
… … …
Text Pseudo Setup direct deposit 0.62 Intent: Direct Deposit Intent: Carry On
5. Select for me. 4. Rank the
Label
top-k examples
How do you setup direct Direct
deposit? Deposit Intent: Carry On
Text Cosine Score
I would like to setup direct Direct
deposit. Deposit What are the carry on 0.99
rules?
What are the carry on rules? Carry On
Tell me the carry on 0.89
restrictions for United.
Tell me the carry on restrictions Carry On
for United. Labeled Examples Selected
Carry on restrictions for 0.74 Closest
Air Emirates. Unlabeled Examples Neighbor
Figure 4.2: TK-KNN overview. The model is (1) trained on the small portion of
labeled data. Then, this model is used to predict (2) pseudo labels on the unlabeled
data. Then the cosine similarity (3) is calculated for each unlabeled data point with
respect to the labeled data points in each class. Yellow shapes represent unlabeled
data and green represent labeled data. Similarities are computed and unlabeled
examples are ranked (4) based on a combination of their predicted probabilities and
cosine similarities. Then, the top-k (k = 2) examples are selected (5) for each class.
These examples are finally added (6) to the labeled dataset to continue the iterative
learning process.
73
the model is trained on only the small portion of labeled data X. In the subsequent
cycles, the model is trained on the union of X and a subset of the unlabeled data
U that has been pseudo-labeled by the model in the previous cycle. Figure 4.2
illustrates an example of this training regime with the TK-KNN method.
We use the BERT-base [41] model with an added classification head to the top.
The classification head consists of a dropout layer followed by a linear layer with
dropout and ends with an output layer that represents the dataset’s class set C.
However, other BERT-like models should work in this framework. We select the
BERT-base model for fair comparison with other methods.
When applying pseudo-labeling, it is often observed that some classes are easier
to predict than others. In practice, this causes the model to become biased towards
the easier classes [2] and perform poorly on the more difficult ones. The Top-K
sampling process within the TK-KNN system seeks to alleviate this issue by growing
the pseudo-label set across all labels together.
When we perform pseudo labeling, we select the top-k predictions per class from
the unlabeled data. This selection neither uses nor requires any threshold; instead,
it limits each class to choose the predictions with the highest confidence. We rank
each predicted data element with a score based on the models predicted probability.
After each training cycle, the number of pseudo labels in the dataset will have
increased by k times the number of classes. This process continues until all examples
are labeled or some number of pre-defined cycles has been reached. We employ
standard early stopping criteria [104] during each training cycle to determine whether
74
or not to stop training.
4.2.4 KNN-Alignment
Although our top-k selection strategy helps alleviate the model’s bias, it still relies
entirely on the model predictions. To enhance our top-k selection strategy, we utilize
a KNN search to modify the scoring function that is used to rank which pseudo-
labeled examples should be included in the next training iteration. The intuition for
the use of the KNN search comes from the findings in [159] where ”closer” instances
are more likely to share the same label based on the neighborhood information when
some labels are corrupted, which often occurs in semi-supervised learning from the
pseudo-labeling strategy.
Specifically, we extract a latent representation from each example in our training
dataset, both the labeled and unlabeled examples. We formulate this latent repre-
sentation in the same way as Sentence-BERT [110] to construct a robust sentence
representation. This representation is defined as the mean-pooled representation of
the final BERT layer that we formally define as:
Where CLS is the class token, T is each token in the sequence, M is the sequence
length, and z is the extracted latent representation. When we perform our pseudo
labeling process we extract the latent representation for all of our labeled data X as
well as our unlabeled data U .
For each unlabeled example, we calculate the cosine similarity between its latent
representation and the latent representations of the labeled counterparts belonging
to the predicted class.
The highest cosine similarity score between the unlabeled example and its labeled
75
neighbors is used to calculate the score of an unlabeled example. An additional
hyperparameter, β, permits the weighing of the model’s prediction and the cosine
similarity for the final scoring function.
With these scores we then follow the previously discussed top-k selection strategy
to ensure balanced classes. The addition of the K-nearest neighbor search helps us
to select more accurate labels early in the learning process. We provide pseudo code
for our pseudo-labeling strategy in Algorithm 1.
As we use the cosine similarity to help our ranking method we want to ensure
that similar examples are grouped together in the latent space.While the cross entropy
loss is an ideal choice for classification, as it incentivizes the model to produce accu-
rate predictions, it does not guarantee that discriminative features will be learned,
which our pseudo labeling relies on. To address this issue, we supplemented the
cross-entropy loss with a supervised contrastive loss [80] and a differential entropy
regularization loss [116], and trained the model using all three losses jointly.
C
X
LCE = − yi log(ŷi ) (4.4)
i=1
We select the supervised contrastive loss [80] as the method ensures our model
learns discriminative features by maximizing inter-class examples and minimizing
intra-class examples. This ensures that our model with learn good representations
in the latent space that separate examples belonging to different classes. The su-
pervised contrastive loss relies on augmentations of the original examples. To get
76
these augmentations we simply apply dropout to the representations that we extract
from the model. As is standard for the supervised contrastive loss we add a separate
projection layer to our model to align the representations. The representations fed
to the projection layer is the mean-pooled BERT representation as shown in Eq. 4.2.
This ensures that our model will learn good sentence representations that will be
used to select similar examples.
X −1 X sim(zi , zp )/τ
LSCL = log P (4.5)
i∈I
|P (i)| a∈A sim(zi , za )/τ
p∈P (i)
When adopting the contrastive loss previous works [46] have discussed how the
model can collapse in dimensions as a result of the loss. We follow this work in
adopting a differential entropy regularizer in order to spread the representations our
more uniformly. The method we use is based on the Kozachenko and Leonenko [83]
differential entropy estimator:
N
1 X
LKoLeo =− log(pi ) (4.6)
N i=1
Where pi = min(i6=j) ||f (xi ) − f (xj )||. This regularization helps to maximize the
distance between each point and its neighbors. By doing so it helps to alleviate
the collapse issue. We combine this term with the cross-entropy and contrastive
objectives, weighting it using a coefficient γ.
The joint training of these individual components leads our model to have better
discriminative features that are more robust, that results in improved generalization
and prediction accuracy.
77
TABLE 4.1
4.3 Experiments
Datasets We use three well-known benchmark datasets to test and compare the
TK-KNN model against other models on the intent classification task. Our intent
classification datasets are CLINC150 [86] that contains 150 in-domain intents classes
from ten different domains and one out-of-domain class. BANKING77 [20] that
contains 77 intents, all related to the banking domain. HWU64 [89] which includes
64 intents coming from 21 different domains. Banking77 and Hwu64 do not provide
validation sets, so we created our own from the original training sets. All datasets
are in English. A breakdown of each dataset is shown in Table 4.1.
We conducted our experiments with varying amounts of labeled data for each
dataset. All methods are run with five random seeds and the mean average accu-
racy of their results are reported [44]. This methodology permits tests of statistical
significance. Reported results are therefore accompanied by 95% confidence intervals.
78
4.3.2 Baselines
• Supervised: Use only labeled portion of dataset to train the model without
any semi-supervised training. This model constitutes a competitive lower bound
of performance because of the limits in the amount of labeled data.
• Pseudo Labeling (PL) [88]: This strategy trains the model to convergence
then makes predictions on all of the unlabeled data examples. These examples
are then combined with the labeled data and used to re-train the model in an
iterative manner.
• Pseudo Labeling with Threshold (PL-T) [129]: This process follows the
pseudo labeling strategy but only selects unlabeled data elements which are
predicted above a threshold τ . We use a τ of 0.95 based on the findings from
previous work.
• Pseudo Labeling with Flexmatch (PL-Flex) [152]: Rather than using a
static threshold across all classes, a dynamic threshold is used for each class
based on a curriculum learning framework.
• GAN-BERT [36]: This method applies generative adversarial networks [60]
to a pre-trained BERT model. The generator is an MLP that takes in a noise
vector. The output head added to the BERT model acts as the discriminator
and includes an extra class for predicting whether a given data element is real
or not.
• MixText [28]: This method extends the MixUp [153] framework to NLP and
uses the hidden representation of BERT to mix together. The method also
takes advantage of consistency regularization in the form of back translated
examples.
• TK-KNN : The method described in the present work using top-k sampling
with a weighted selection based on model predictions and cosine similarity to
the labeled samples.
• Top-k Upper: Top-k sampling method, but always select the correct pseudo-
label. This model serves as an upper bound.
Each method uses the BERT base model with a classification head attached.
We use the base BERT implementation provided by Huggingface 142, that contains
79
a total of 110M parameters. All models are trained for 30 cycles of self-training.
The models are optimized with the AdamW optimizer with a learning rate of 5e-5.
Each model is trained until convergence by early stopping applied according to the
validation set. We use a batch size of 256 across experiments and limit the sequence
length to 64 tokens. For TK-KNN, we set k = 6 and β = 0.75 and report the results
for these settings. An ablation study of these two hyperparameters is presented later.
Computational Use. In total we estimate that we used around 18,000 GPU hours
for this project. For the final experiments and ablation studies we estimate that the
TK-KNN model used 4400 GPU hours. Experiments were carried out on Nvidia
Tesla P100 GPUs that each had 12GB of memory and 16GB of memory.
4.4 Results
Results from these experiments are shown in Table 4.2. These quantitative results
demonstrate that TK-KNN yielded the best performance on the benchmark datasets.
We observed the most significant performance gains for CLINC150 and BANKING77,
where these datasets have more classes. For instance, on the CLINC150 dataset
with 1% labeled data, our method performs 10.92% better than the second best
strategy, FlexMatch. As the portion of labeled data used increases, we notice that
the effectiveness of TK-KNN diminishes.
Another observation from these results is that the GAN-BERT model tends to
be unstable when the labeled data is limited. This causes the model to have much
larger confidence interval than other methods. However, GAN-BERT does improve
as the proportion of labeled data increases. We also find that while the MixText
method shows improvements the benefits of consistency regularization are not as
strong compared to works from the computer vision domain.
These results demonstrate the benefits of TK-KNN’s balanced sampling strategy
80
TABLE 4.2
Percent Labeled
Method 1% 2% 5% 10%
CLINC150
Supervised 27.35 ±1.71 49.15 ±1.99 67.96 ±0.85 75.05 ±1.57
PL 24.51 ±3.92 48.58 ±1.79 69.19 ±0.54 76.92 ±1.05
PL-T 39.05 ±3.26 56.65 ±1.53 71.25 ±0.5 79.29 ±1.62
PL-Flex 42.81 ±4.39 60.07 ±1.42 73.42 ±1.62 78.86 ±1.01
GAN-BERT 18.18 ±0.0 23.29 ±11.42 44.89 ±24.39 63.02 ±25.1
MixText 12.86 ±6.39 37.93 ±16.8 61.39 ±0.77 74.29 ±0.37
TK-KNN (Ours) 53.73 ±1.72 65.87 ±1.18 74.31 ±0.96 79.45 ±1.01
BANKING77
Supervised 34.73 ±1.5 47.51 ±2.89 70.27 ±1.08 80.82 ±0.41
PL 29.09 ±3.83 45.16 ±2.71 69.69 ±2.16 80.26 ±0.49
PL-T 35.12 ±3.86 51.67 ±3.14 71.16 ±1.98 81.88 ±0.43
PL-Flex 40.04 ±3.4 54.18 ±3.31 73.43 ±1.55 82.54 ±0.84
GAN-BERT 5.4 ±9.16 16.98 ±21.73 54.09 ±29.56 79.64 ±1.39
MixText 32.73 ±6.02 54.75 ±3.15 76.59 ±1.05 82.34 ±0.94
TK-KNN (Ours) 54.16 ±4.56 62.71 ±2.30 76.73 ±01.46 84.45 ±0.52
HWU64
Supervised 48.87 ±1.55 63.88 ±1.6 74.67 ±1.91 82.21 ±1.72
PL 48.46 ±1.86 64.39 ±1.66 75.76 ±1.69 82.49 ±0.94
PL-T 56.9 ±1.64 68.29 ±1.79 76.9 ±1.1 82.96 ±1.69
PL-Flex 60.15 ±3.27 69.87 ±0.93 77.99 ±1.4 83.83 ±1.2
GAN-BERT 33.36 ±16.55 32.9 ±29.07 72.32 ±1.41 81.78 ±1.64
MixText 33.3 ±8.98 56.46 ±11.08 66.65 ±7.28 79.72 ±1.27
TK-KNN (Ours) 65.33 ±2.29 73.03 ±1.31 79.63 ±0.56 84.59 ±0.58
Mean test accuracy results and their 95% confidence intervals across 5 rep-
etitions with different different random seeds. All experiments used k = 6 and
β = 0.75. TK-KNN outperformed existing state of the art models, especially
when the label set is small.
81
0.5
Test Accuracy
0.4
0.3
0.2
0 10 20 30
Cycle
PL-T PL-Flex TK-KNN
82
CLINC150 BANKING77 HWU64
Test Accuracy
0.8
0.6
0.4
0 10 20 30 0 10 20 30 0 10 20 30
Cycle Cycle Cycle
Figure 4.4. Ablation results for each dataset using 1% labeled data.
Because TK-KNN is different from existing methods in two distinct ways: (1) top-
k balanced sampling and (2) KNN ranking, we perform a set of ablation experiments
to better understand how each of these affects performance. Specifically, we test
TK-KNN under three scenarios, top-k sampling without balancing the classes, top-k
sampling with balanced classes, and top-k KNN without balancing for classes. When
we perform top-k sampling in an unbalanced manner, we ensure that the total data
sampled is still equal to k ∗ C, where C is the number of classes.
The results from the ablation study demonstrate both the effectiveness of top-k
sampling and KNN ranking. A comparison between our unbalanced sampling top-k
sampling and balanced versions show a drastic difference in performance across all
datasets. We highlight again that the performance difference is greatest in the lowest
resource setting, with a 12.47% increase in accuracy for CLINC150 in the 1% setting.
Results from the TK-KNN method with unbalanced sampling also show an im-
provement over unbalanced sampling alone. This increase in performance is smaller
83
than the difference between unbalanced and balanced sampling but still highlights
the benefits of leveraging the geometry for selective pseudo-labeling.
84
Comparison of β Hyperparameter
0.65
Test Accuracy
0.6
0.55
0.5
0 5 10 15 20 25 30
Cycle
tion, we can see that no single method was always the best, but the model tended
to perform worse when β = 0.0, highlighting the benefits of including our KNN sim-
ilarity for ranking. The model reached the best performance when β = 0.75, which
occurs about a third of the way through the training process.
Comparison of values for k show that TK-KNN is robust to adjustments in this
hyperparameter. We notice slight performance benefits from selecting a higher k of
6 and 8 in comparison to 4. When a higher value of k is used the model will see an
increase in performance earlier in the self-training process, as it has more examples
to train from. This is only acheivable though when high quality correct samples are
selected across the entire class distribution. If a k value was selected that is too large,
more bad examples will be included early in the training process and may result in
poor model performance.
85
Comparison of k Hyperparameter
0.65
Test Accuracy
0.6
0.55
0.5
0 5 10 15 20 25 30
Cycle
4 6 8
4.5 Conclusions
86
4.6 Limitations
87
CHAPTER 5
CONCLUSION
88
proposed a new method for semi-supervised intent detection. With this method I
showed how it was possible to leverage a very small amount of labeled data paired
with unlabeled data to improve model performance. My method relied on balanced
sampling of the classes and a KNN objective to improve pseudo label selection. Exper-
iments from this work demonstrated the benefit of this methodology against previous
works for semi-supervised learning, especially those catered towards natural language
processing.
Intent detection can be applied to a variety of tasks but can be important to
study narratives to detect manipulative content, and aid in the management of false
narratives that may lead to potential social harm.
89
131], due in part to the every changing conversations and high cost of annotations. I
define the idea of intent on social media as an action an individual wants to take, or
inform others to take. This form of messaging has become extremely prevalent with
narratives on social media, as people urge others to action. For instance, during the
French election in 2017 there were numerous topics fiercely debated amongst the two
candidates, with both using Twitter as their preferred platform to disseminate their
stance. To this end, intent could be defined in numerous ways and used with the TK-
KNN method. For this particular setup intent could be viewed as users messaging
about voting for candidates, support or opposing issues, or more general emotion
detection. The method for TK-KNN will work as a short text classifier and can be
applied to any of these setups. Furthermore, the intents of interest could be classes
associated with moral judgements, such as fairness, disgust, or pride.
Of particular interest is discovering new intents as they occur. Like many ma-
chine learning methods, the semi-supervised technique operates on a closed world
assumption that all of the classes are known at inference times. In reality, especially
for social media, new classes will need to be discovered as real world events happen.
To address this modifications would be needed to classify intents that have not been
seen before. These examples would need to be help out and then labeled in some
manner to then be fed back to the semi-supervised algorithm to learn the new class.
Of particular interest would be to leverage large generative models, such as GPT-3 to
automatically label and discover these new classes. This would further minimize the
need for human annotation and could allow for a system that automatically updates
quickly entirely on its own.
To tackle this problem, in-context learning has become a burgeoning method for
addressing these situation [19]. A major driver of this shift was the rise of emergent
properties of language models from scaling them up [141]. While many of these in-
context learning setups rely on just a few labeled examples, work is still being done
90
to understand how to best format these prompts. Prior methods have focused on
how best to retrieve good examples for in-context learning [115]. Methods such as
this have shown improved performance in tasks but much is still not known about
the underlying mechanisms to make in-context learning work well. In particular, an
empirical study [95] has shown how removing accurate labels and replacing them
with random ones impacts overall performance in a minimal manner. The findings
are important to future work in automated intent discovery as they highlight how
random labels can still improve the models classification performance for tasks. Other
methods, such as channel prompting [94], flips the paradigm around by passing the
normally predicted portion to the model to force it to predict what the original input
was. This has showns to increase performance in a variety of prompts and particularly
demonstrates strong performance when known labels are lacking.
91
BIBLIOGRAPHY
11. N. Botzer and T. Weninger. Entity graphs for exploring online discourse. arXiv
preprint arXiv:2304.03351, 2023.
92
12. N. Botzer, Y. Ding, and T. Weninger. Reddit entity linking dataset. Information
Processing & Management, 58(3):102479, 2021.
13. N. Botzer, S. Gu, and T. Weninger. Analysis of moral judgment on reddit.
IEEE Transactions on Computational Social Systems, 2022.
14. R. Boyd and P. J. Richerson. Punishment allows the evolution of cooperation
(or anything else) in sizable groups. Ethology and Sociobiology, 13(3):171–195,
May 1992. ISSN 0162-3095.
15. R. Boyd and P. J. Richerson. Punishment allows the evolution of cooperation
(or anything else) in sizable groups. Ethology and sociobiology, 13(3):171–195,
1992.
16. W. J. Brady and M. J. Crockett. How effective is online outrage? Trends in
cognitive sciences, 23(2), 2019.
17. W. J. Brady, M. Crockett, and J. J. Van Bavel. The mad model of moral con-
tagion: The role of motivation, attention, and design in the spread of moralized
content online. Perspectives on Psychological Science, 15(4):978–1010, 2020.
18. W. J. Brady, K. McLoughlin, T. N. Doan, and M. J. Crockett. How social
learning amplifies moral outrage expression in online social networks. Science
Advances, 7(33):eabe5641, 2021.
19. T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P. Dhariwal, A. Nee-
lakantan, P. Shyam, G. Sastry, A. Askell, et al. Language models are few-shot
learners. Advances in neural information processing systems, 33:1877–1901,
2020.
20. I. Casanueva, T. Temčinas, D. Gerz, M. Henderson, and I. Vulić. Efficient
intent detection with dual sentence encoders. In Workshop on Natural Language
Processing for Conversational AI, 2020.
21. P. Cascante-Bonilla, F. Tan, Y. Qi, and V. Ordonez. Curriculum labeling:
Revisiting pseudo-labeling for semi-supervised learning. In AAAI, 2021.
22. D. Centola. The spread of behavior in an online social network experiment.
science, 329(5996):1194–1197, 2010.
23. W. Chafe. Discourse, consciousness, and time: The flow and displacement of
conscious experience in speaking and writing. University of Chicago Press, 1994.
24. W. Chafe. Language and the flow of thought. The new psychology of language,
pages 93–111, 2017.
25. E. Chandrasekharan, U. Pavalanathan, A. Srinivasan, A. Glynn, J. Eisenstein,
and E. Gilbert. You can’t stay here: The efficacy of reddit’s 2015 ban examined
through hate speech. Proceedings of the ACM on Human-Computer Interaction,
1(CSCW):1–22, 2017.
93
26. E. Chandrasekharan, M. Samory, S. Jhaver, H. Charvat, A. Bruckman,
C. Lampe, J. Eisenstein, and E. Gilbert. The internet’s hidden rules: An
empirical study of reddit norm violations at micro, meso, and macro scales.
Proceedings of the ACM on Human-Computer Interaction, 2(CSCW):1–25,
2018.
29. J. Chen, D. Tam, C. Raffel, M. Bansal, and D. Yang. An empirical survey of data
augmentation for limited data learning in nlp. arXiv preprint arXiv:2106.07499,
2021.
30. X. Cheng and D. Roth. Relational inference for wikification. Empirical Methods
in Natural Language Processing, 2013.
31. Z. Cheng, Z. Jiang, Y. Yin, C. Wang, and Q. Gu. Learning to classify open
intent via soft labeling and manifold mixup. IEEE/ACM Transactions on Audio,
Speech, and Language Processing, 30:635–645, 2022.
37. M. J. Crockett. Moral outrage in the digital age. Nature human behaviour, 1
(11):769–771, 2017.
94
38. M. De Choudhury, S. S. Sharma, T. Logar, W. Eekhout, and R. C. Nielsen. Gen-
der and cross-cultural differences in social media disclosures of mental illness.
In Proceedings of the 2017 ACM conference on computer supported cooperative
work and social computing, pages 353–369, 2017.
39. B. De Cleen. Populism and nationalism. The Oxford handbook of populism, 1:
342–262, 2017.
40. L. Derczynski, D. Maynard, G. Rizzo, M. Van Erp, G. Gorrell, R. Troncy,
J. Petrak, and K. Bontcheva. Analysis of named entity recognition and linking
for tweets. Information Processing & Management, 51(2):32–49, 2015.
41. J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova. Bert: Pre-training of
deep bidirectional transformers for language understanding. arXiv preprint
arXiv:1810.04805, 2018.
42. G. Ding, S. Zhang, S. Khan, Z. Tang, J. Zhang, and F. Porikli. Feature
affinity-based pseudo labeling for semi-supervised person re-identification. IEEE
Transactions on Multimedia, 21(11):2891–2902, 2019.
43. Y. Ding, N. Botzer, and T. Weninger. Posthoc verification and the fallibility of
the ground truth. arXiv preprint arXiv:2106.07353, 2021.
44. R. Dror, G. Baumer, S. Shlomov, and R. Reichart. The hitchhiker’s guide to
testing statistical significance in natural language processing. In ACL, 2018.
45. S. Dutta, D. Das, and T. Chakraborty. Changing views: Persuasion modeling
and argument extraction from online discussions. Information Processing &
Management, 57(2):102085, 2020.
46. A. El-Nouby, N. Neverova, I. Laptev, and H. Jégou. Training vision transformers
for image retrieval. arXiv preprint arXiv:2102.05644, 2021.
50. M. Forbes, J. D. Hwang, V. Shwartz, M. Sap, and Y. Choi. Social chemistry 101:
Learning to reason about social and moral norms. In Proceedings of the 2020
Conference on Empirical Methods in Natural Language Processing (EMNLP),
pages 653–670, 2020.
95
51. T. M. Fruchterman and E. M. Reingold. Graph drawing by force-directed place-
ment. Software: Practice and experience, 21(11):1129–1164, 1991.
52. S. Fu, H. Li, Y. Liu, H. Pirkkalainen, and M. Salo. Social media overload,
exhaustion, and use discontinuance: Examining the effects of information over-
load, system feature overload, and social overload. Information Processing &
Management, 57(6):102307, 2020.
56. M. Glenski and T. Weninger. Rating effects on social news posts and comments.
TIST, 8(6):1–19, 2017.
63. S. Guo, N. Mokhberian, and K. Lerman. A data fusion framework for multi-
domain morality learning. arXiv preprint arXiv:2304.02144, 2023.
96
64. J. Haidt. The righteous mind: Why good people are divided by politics and
religion. Vintage, 2012.
65. J. C. Harsanyi. Morality and the theory of rational behavior. Social research,
pages 623–656, 1977.
66. S. Hechler and T. Kessler. On the difference between moral outrage and em-
pathic anger: Anger about wrongful deeds or harmful consequences. Journal of
Experimental Social Psychology, 76:270–282, 2018.
67. D. Herman. Basic elements of narrative. John Wiley & Sons, 2009.
69. L. Huddy and N. Khatib. American patriotism, national identity, and political
involvement. American journal of political science, 51(1):63–77, 2007.
70. C. Hutto and E. Gilbert. Vader: A parsimonious rule-based model for senti-
ment analysis of social media text. In Proceedings of the International AAAI
Conference on Web and Social Media, volume 8, 2014.
71. F. Ilievski, P. Vossen, and S. Schlobach. Systematic study of long tail phe-
nomena in entity linking. Proceedings of the 27th International Conference on
Computational Linguistics, 2018.
75. J. Jung, B. Son, and S. Lyu. Attnio: Knowledge graph exploration with in-and-
out attention flow for knowledge-grounded dialogue. Proceedings of the 2020
Conference on Empirical Methods in Natural Language Processing (EMNLP),
2020.
97
78. B. F. Keith Norambuena and T. Mitra. Narrative maps: An algorithmic ap-
proach to represent and extract information narratives. Proceedings of the ACM
on Human-Computer Interaction, 4(CSCW3):1–33, 2021.
79. D. Kempe, J. Kleinberg, and É. Tardos. Maximizing the spread of influence
through a social network. Proceedings of the ninth ACM SIGKDD international
conference on Knowledge discovery and data mining, 2003.
82. N. Kolitsas, O.-E. Ganea, and T. Hofmann. End-to-end neural entity linking.
arXiv preprint arXiv:1808.07699, 2018.
88. D.-H. Lee et al. Pseudo-label: The simple and efficient semi-supervised learning
method for deep neural networks. In Workshop on challenges in representation
learning, ICML, 2013.
98
92. M. Mateas and P. Sengers. Narrative intelligence. J. Benjamins Pub., 2003.
98. K. Nigam and R. Ghani. Analyzing the effectiveness and applicability of co-
training. In Proceedings of the ninth international conference on Information
and knowledge management, pages 86–93, 2000.
102. R. Page. The narrative dimensions of social media storytelling. The handbook
of narrative analysis, pages 329–347, 2015.
104. L. Prechelt. Early stopping-but when? In Neural Networks: Tricks of the trade,
pages 55–69. Springer, 1998.
99
106. A. Radford, K. Narasimhan, T. Salimans, I. Sutskever, et al. Improving language
understanding by generative pre-training. 2018.
107. A. Ram, R. Prasad, C. Khatri, A. Venkatesh, R. Gabriel, Q. Liu, J. Nunn,
B. Hedayatnia, M. Cheng, A. Nagar, et al. Conversational ai: The science
behind the alexa prize. arXiv preprint arXiv:1801.03604, 2018.
108. C. Ran, W. Shen, and J. Wang. An attention factor graph model for tweet
entity linking. Proceedings of the 2018 World Wide Web Conference, 2018.
109. A. Rao, F. Morstatter, M. Hu, E. Chen, K. Burghardt, E. Ferrara, K. Lerman,
et al. Political partisanship and antiscience attitudes in online discussions about
covid-19: Twitter content analysis. Journal of medical Internet research, 23(6):
e26692, 2021.
110. N. Reimers and I. Gurevych. Sentence-bert: Sentence embeddings using siamese
bert-networks. In Proceedings of the 2019 Conference on Empirical Methods
in Natural Language Processing and the 9th International Joint Conference on
Natural Language Processing (EMNLP-IJCNLP), pages 3982–3992, 2019.
111. T. Reynolds, C. Howard, H. Sjåstad, L. Zhu, T. G. Okimoto, R. F. Baumeister,
K. Aquino, and J. Kim. Man up and take it: Gender bias in moral typecasting.
Organizational Behavior and Human Decision Processes, 161:120–141, 2020.
112. M. N. Rizve, K. Duarte, Y. S. Rawat, and M. Shah. In defense of pseudo-
labeling: An uncertainty-aware pseudo-label selection framework for semi-
supervised learning. arXiv preprint arXiv:2101.06329, 2021.
113. M. G. Rodriguez, K. Gummadi, and B. Schoelkopf. Quantifying information
overload in social media and its impact on social contagions. Eighth Interna-
tional AAAI Conference on Weblogs and Social Media, 2014.
114. K. Rost, L. Stahel, and B. S. Frey. Digital social norm enforcement: Online
firestorms in social media. PLoS one, 11(6):e0155923, 2016.
115. O. Rubin, J. Herzig, and J. Berant. Learning to retrieve prompts for in-
context learning. In Proceedings of the 2022 Conference of the North American
Chapter of the Association for Computational Linguistics: Human Language
Technologies, pages 2655–2671, 2022.
116. A. Sablayrolles, M. Douze, C. Schmid, and H. Jégou. Spreading vectors for
similarity search. In ICLR 2019-7th International Conference on Learning
Representations, pages 1–13, 2019.
117. E. Sagi and M. Dehghani. Measuring moral rhetoric in text. Social Science
Computer Review, 32(2):132–144, Apr 2014. ISSN 0894-4393.
118. T. Sawaoka and B. Monin. The paradox of viral outrage. Psychological science,
29(10):1665–1678, 2018.
100
119. C. Schein and K. Gray. The theory of dyadic morality: Reinventing moral
judgment by redefining harm. Personality and Social Psychology Review, 22
(1):32–70, 2018.
123. D. Shahaf and C. Guestrin. Connecting the dots between news articles. Pro-
ceedings of the 16th ACM SIGKDD international conference on Knowledge
discovery and data mining, 2010.
125. W. Shen, J. Wang, and J. Han. Entity linking with a knowledge base: Is-
sues, techniques, and solutions. IEEE Transactions on Knowledge and Data
Engineering, 27(2):443–460, 2014.
101
131. S. Subramani, H. Q. Vu, and H. Wang. Intent classification using feature sets
for domestic violence discourse on social media. In 2017 4th Asia-Pacific World
Congress on Computer Science and Engineering (APWC on CSE), pages 129–
136. IEEE, 2017.
139. Z. Wang and D. Jurgens. It’s going to be okay: Measuring access to support
in online communities. In Proceedings of the 2018 Conference on Empirical
Methods in Natural Language Processing, pages 33–45, 2018.
102
143. E. Wulczyn, N. Thain, and L. Dixon. Ex machina: Personal attacks seen at
scale. In Proceedings of the 26th international conference on world wide web,
pages 1391–1399, 2017.
147. S. Yardi and D. Boyd. Dynamic debates: An analysis of group polarization over
time on twitter. Bulletin of science, technology & society, 30(5):316–327, 2010.
149. W. Yu, C. Zhu, Z. Li, Z. Hu, Q. Wang, H. Ji, and M. Jiang. A survey of
knowledge-enhanced text generation. arXiv preprint arXiv:2010.04389, 2020.
151. L.-M. Zhan, H. Liang, B. Liu, L. Fan, X.-M. Wu, and A. Lam. Out-of-scope
intent detection with self-supervision and discriminative training. arXiv preprint
arXiv:2106.08616, 2021.
155. H. Zhang, H. Xu, and T.-E. Lin. Deep open intent classification with adaptive
decision boundary. In AAAI, 2021.
103
156. J. Zhang, J. Chang, C. Danescu-Niculescu-Mizil, L. Dixon, Y. Hua, D. Tara-
borelli, and N. Thain. Conversations gone awry: Detecting early signs of conver-
sational failure. In Proceedings of the 56th Annual Meeting of the Association
for Computational Linguistics (Volume 1: Long Papers), pages 1350–1361, 2018.
157. Y. Zhou, P. Liu, and X. Qiu. Knn-contrastive learning for out-of-domain intent
classification. In ACL, 2022.
159. Z. Zhu, Z. Dong, and Y. Liu. Detecting corrupted labels without training a
model to predict. In ICML, 2022.
162. Y. Zou, Z. Yu, B. Kumar, and J. Wang. Unsupervised domain adaptation for
semantic segmentation via class-balanced self-training. In Proceedings of the
European conference on computer vision (ECCV), pages 289–305, 2018.
This document was prepared & typeset with pdfLATEX, and formatted with
nddiss2ε classfile (v3.2017.2[2017/05/09])
104