0% found this document useful (0 votes)
5 views23 pages

Beyond Sentiment An Algorithmic Strategy For Identifying Evaluations Within Large Text Corpora

This paper presents a novel algorithmic strategy for identifying evaluations in large text corpora using supervised machine learning (SML), challenging traditional sentiment analysis methods. The authors argue that existing sentiment measures often fail to accurately capture object-specific evaluations due to their inability to establish a semantic relationship between evaluative expressions and the objects they assess. By training a classifier on election-related texts, the study demonstrates improved accuracy in identifying evaluative functions, thereby addressing significant limitations in current evaluative measurement approaches.

Uploaded by

2223621512
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views23 pages

Beyond Sentiment An Algorithmic Strategy For Identifying Evaluations Within Large Text Corpora

This paper presents a novel algorithmic strategy for identifying evaluations in large text corpora using supervised machine learning (SML), challenging traditional sentiment analysis methods. The authors argue that existing sentiment measures often fail to accurately capture object-specific evaluations due to their inability to establish a semantic relationship between evaluative expressions and the objects they assess. By training a classifier on election-related texts, the study demonstrates improved accuracy in identifying evaluative functions, thereby addressing significant limitations in current evaluative measurement approaches.

Uploaded by

2223621512
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 23

Communication Methods and Measures

ISSN: (Print) (Online) Journal homepage: www.tandfonline.com/journals/hcms20

Beyond sentiment: an algorithmic strategy for


identifying evaluations within large text corpora

Maximilian Overbeck, Christian Baden, Tali Aharoni, Eedan Amit-Danhi &


Keren Tenenboim-Weinblatt

To cite this article: Maximilian Overbeck, Christian Baden, Tali Aharoni, Eedan Amit-Danhi
& Keren Tenenboim-Weinblatt (07 Dec 2023): Beyond sentiment: an algorithmic strategy for
identifying evaluations within large text corpora, Communication Methods and Measures, DOI:
10.1080/19312458.2023.2285783

To link to this article: https://doi.org/10.1080/19312458.2023.2285783

© 2023 The Author(s). Published with


license by Taylor & Francis Group, LLC.

View supplementary material

Published online: 07 Dec 2023.

Submit your article to this journal

Article views: 1493

View related articles

View Crossmark data

Full Terms & Conditions of access and use can be found at


https://www.tandfonline.com/action/journalInformation?journalCode=hcms20
COMMUNICATION METHODS AND MEASURES
https://doi.org/10.1080/19312458.2023.2285783

Beyond sentiment: an algorithmic strategy for identifying


evaluations within large text corpora
Maximilian Overbeck , Christian Baden , Tali Aharoni , Eedan Amit-Danhi ,
and Keren Tenenboim-Weinblatt
Department of Communication and Journalism, The Hebrew University of Jerusalem, Jerusalem, Israel

ABSTRACT
In this paper, we propose a new strategy for classifying evaluations in large
text corpora, using supervised machine learning (SML). Departing from
a conceptual and methodological critique of the use of sentiment measures
to recognize object-specific evaluations, we argue that a key challenge
consists in determining whether a semantic relationship exists between
evaluative expressions and evaluated objects. Regarding sentiment terms
as merely potentially evaluative expressions, we thus use a SML classifier to
decide whether recognized terms have an evaluative function in relation to
the evaluated object. We train and test our classifier on a corpus of 10,004
segments of election coverage from 16 major U.S. news outlets and Tweets
by 10 prominent U.S. politicians and journalists. Specifically, we focus on
evaluations of political predictions about the outcomes and implications of
the 2016 and 2020 U.S. presidential elections. We show that our classifier
consistently outperforms both off-the-shelf sentiment tools and a pre-
trained transformer-based sentiment classifier. Critically, our classifier cor­
rectly discards numerous non-evaluative uses of common sentiment terms,
whose inclusion in conventional analyses generates large amounts of false
positives. We discuss contributions of our approach to the measurement of
object-specific evaluations and highlight challenges for future research.

Introduction
Evaluations are often the focal point of social science textual research. Textually expressed evaluations are
investigated in various fields, from consumer research (Hu & Li, 2011) to health (Zimmermann et al.,
2019) and political communication (Boomgaarden et al., 2012; Lühiste & Banducci, 2016; van Spanje &
de Vreese, 2014), and they are key components of complex social science constructs such as framing
(Entman, 1993) or attitudes (Voas, 2014). Compared to related constructs such as sentiments (Pang et al.,
2002), stances (Bestvater & Monroe, 2022), and opinions (Liu & Zhang, 2012), evaluations constitute the
most explicit expression of judgment about a given object in text (Alba-Juez & Thompson, 2014, p. 10).
So far, textually expressed evaluations have been mostly captured using manual annotations
(Boomgaarden et al., 2012; Lühiste & Banducci, 2016; van Spanje & de Vreese, 2014). However, such
analyses remain limited in scope, prohibiting researchers from annotating more than a few hundred
texts. Accordingly, computational strategies have recently attracted growing scholarly attention in
communication research (Baden et al., 2022), enabling the processing of evaluative text at scale. Much

CONTACT Maximilian Overbeck [email protected] Department of Communication and Journalism, Mount Scopus,
Hebrew University of Jerusalem, Jerusalem 91905, Israel
This article was originally published with errors, which have now been corrected in the online version. Please see Correction
(http://dx.doi.org/10.1080/19312458.2023.2296192).
Supplemental data for this article can be accessed online at https://doi.org/10.1080/19312458.2023.2285783
© 2023 The Author(s). Published with license by Taylor & Francis Group, LLC.
This is an Open Access article distributed under the terms of the Creative Commons Attribution-NonCommercial-NoDerivatives License (http://
creativecommons.org/licenses/by-nc-nd/4.0/), which permits non-commercial re-use, distribution, and reproduction in any medium, provided the
original work is properly cited, and is not altered, transformed, or built upon in any way. The terms on which this article has been published allow the
posting of the Accepted Manuscript in a repository by the author(s) or with their consent.
2 M. OVERBECK ET AL.

existing work (e.g., Liu & Zhang, 2012; Rudkowsky et al., 2018; Young & Soroka, 2012) has focused on
measuring overall textual sentiment (i.e., the tone or polarity of a piece of text) rather than object-specific
evaluations. In this vein, an impressive amount of work has been dedicated in recent years to developing
new and better tools, ranging from improved sentiment dictionaries (Rauh, 2018; Young & Soroka, 2012)
to machine learning classifiers (e.g., Pang et al., 2002; Rudkowsky et al., 2018). Owing to the abundance
and accessibility of tools, the sentiment expressed in texts about specific issues is frequently employed as
a proxy for the detection of object-specific evaluations (Fogel-Dror et al., 2019).
Yet, textual sentiment remains a flawed proxy for such evaluations, owing to two main limitations:
first, as aptly noted by proponents of stance detection or opinion mining strategies (Bestvater &
Monroe, 2022), sentiment tools disregard the critical semantic relation between evaluative expressions
(the “sentiment terms”) and those targets that they evaluate (Fogel-Dror et al., 2019). Instead of
ascertaining that expressed evaluations indeed refer to evaluated objects, most studies merely establish
a statistical connection between topically relevant passages and the presence of sentiment expressions
within these. While such strategies may occasionally yield satisfactory results if evaluated constructs
constitute the primary topic of extracted passages (such as movie reviews), they are bound to generate
numerous false positive measurements (e.g., when specific objects are evaluated in the context of other
focal topics, or if multiple evaluated objects are contrasted). Second, sentiment analysis tends to
confound different variants of charged expressions, subsumed under broad notions such as “tone”
(Young & Soroka, 2012) or the “polarity of emotion in a piece of text” (Aldayel & Magdy, 2021,
p. 4). However, there are many uses of valenced expressions that do not serve to evaluate but rather
express a specific emotional tendency of a text (Chan et al., 2020).
Moreover, we hereunder distinguish object-specific evaluations from broader notions such as
stances or opinions, which focus on the overall evaluative tendency expressed by an actor toward an
object (Alba-Juez & Thompson, 2014, p. 10). Expressed stances can manifest themselves in many ways,
which may or may not contain direct evaluative statements (Aldayel & Magdy, 2021). In addition, they
can comprise multiple object-specific evaluations, which may evaluate the same object against
different evaluative standards (Baden & Springer, 2017; Garten et al., 2016) and express diverse
valence (e.g., one may evaluate a policy as expensive but ultimately necessary, thus expressing an
ambivalent, overall slightly positive stance).
In this article, we propose a shift in the classification of evaluative texts, to enable a more nuanced and
comprehensive analysis of object-specific evaluations. Specifically, we develop a novel computational
strategy that departs from existing sentiment measures, using machine classification to ascertain that
present evaluative expressions are indeed applied in an evaluative function, and in relation to an evaluated
object. While SML has been widely utilized for the study of textual sentiment (Rosenthal et al., 2019;
Rudkowsky et al., 2018), we refocus the use of machine classification from the relatively “easy” task of
detecting sentiment indicators (which can be performed adequately by inclusive sentiment dictionaries) to
the much “harder” task of determining the evaluative usage of present indicators.
In what follows, we define object-specific evaluations and delineate them from related concepts
such as stances, opinions, or emotions. Departing from this conceptual framework, we then analyze
why sentiment measures fail to adequately operationalize evaluations. Instead, we regard sentiment
terms as potentially evaluative expressions and focus on their relationships with specific objects of
interest. Adjusting existing sentiment measures, we thereby propose a valid measure of evaluations,
and set up a measurement pipeline. To empirically validate our approach, we focus on evaluations of
election-related predictions presented in relation to the 2016 and 2020 U.S. presidential elections.
Election news coverage is replete with statements and opinions addressing the desirability of various
possible outcomes and scenarios, offering rich textual material on which to train and test the classifier.
Our corpus comprises election-related news coverage from 16 major U.S. media outlets (including
print, online, radio, and television), as well as tweets by 10 prominent U.S. politicians and journalists.
We train a SML-classifier based on 10,004 manually annotated segments. The classification task
focuses on whether an annotated sentiment term is used to evaluate a given election prediction, and
whether the evaluative use cast these predictions as either desirable or undesirable. We assess the
COMMUNICATION METHODS AND MEASURES 3

performance of our approach against four off-the-shelf sentiment dictionaries and a transformer-
based roBERTa sentiment classifier. We conclude by discussing the implications of the study for
developing a valid and reliable measurement of object-specific evaluations.

Defining evaluations
Evaluations are communicative acts that express an author’s viewpoints, stances, or attitudes toward
a discussed entity (Thompson & Hunston, 2000). Evaluative expressions are ubiquitous in spoken and
written discourse, concerning both private and public domains. Journalists evaluate current issues and
policies to inform their audiences about political developments and choices; customers evaluate
products to provide feedback or issue recommendations; individuals perform identities by evaluating
their perceived in- and outgroups (De Cremer & Oosterwegel, 1999). In political communication,
scholars are interested in the various evaluations of covered objects or topics, from specific policies to
political candidates and election campaigns (Boomgaarden et al., 2012; Lühiste & Banducci, 2016) to
international organizations, such as the European Union (van Spanje & de Vreese, 2014).
To identify evaluations in text, we first need to understand their constitutive elements and main
characteristics (Thompson & Hunston, 2000). To begin, evaluations require an evaluated object,
which can be virtually anything, but must be recognizable in a statement. Second, evaluations require
some expression that conveys valuation, which can range from strongly positive to strongly negative
valence (Thompson & Hunston, 2000, pp. 14–15). Evaluations may be expressed explicitly (e.g., good/
bad, right/wrong, beautiful/ugly), by means of connoted expressions (e.g., disaster, miracle), or by
positioning oneself (e.g., as a supporter or opponent) or expressing one’s emotions (e.g., happiness,
frustration) in relation to an object. Finally, there needs to be some form of relation that attributes the
expressed valuation to the object. By themselves, expressions that carry valence do not necessarily
express an evaluation (e.g., “There was a car accident;” “These are beauty products”), they only
assume this function when they are used to evaluate a given object (e.g., “hiring her was an
accident;” “he’s a beauty”).
In addition, evaluations are grounded in a wide variety of valuation standards according to which an
object can be qualified as positive or negative, desirable, or undesirable. For example, evaluations can
refer to different kinds of normative or value-based standards (e.g., justice, efficiency, economic value,
aesthetics; Boltanski & Thévenot, 2006), they may be based on subjective affect (e.g., liking, disgust;
Papacharissi, 2015; Tenenboim-Weinblatt et al., 2022a), or they can refer to particular purposes or
interests (e.g., “good for killing brain cells,” “setback for his campaign;” Müller, 2004).
Evaluations can be expressed in a wide variety of ways. Explicit evaluations can be conveyed via
nouns, verbs, adjectives, adverbs alike, and the same is even more true for implicit ones. Evaluations
may invoke metaphors and analogies (e.g., “a butcher;” Lakoff & Johnson, 2008), rely on cultural or
intertextual references (e.g., “Kafkaesque”), employ irony and figurative speech, and many other
strategies of discursive expression. In fact, Hunston (2010, p. 13) suggested that almost any word can
perform an evaluative function, resulting in “such a large range of lexical and other items that it
would be pointless to try and list them.” In addition, evaluative expressions are highly context
dependent (Hunston, 2010, p. 13). While in one context an “elephant” might just be an animal, in
another sentence it may be used to criticize (“he’s such an elephant”), but also to praise (“she has
a memory like an elephant”).
While evaluations share many of these qualities with other, related constructs, such as sentiment or
tone, opinions, or stances, they are different in conceptually and operationally consequential ways.
Evaluations overlap with tone or sentiment in their reliance on valenced expressions. However, unlike
evaluations, sentiment is conceptualized as object nonspecific and refers to the overall positive,
negative, or neutral “tone” of a text (Aldayel & Magdy, 2021; Bestvater & Monroe, 2022).
Accordingly, sentiment subsumes both evaluative and non-evaluative uses of connoted expressions,
including object-specific evaluations alongside positively connoted events and labels (e.g., “an
accident,” “election winner”) and emotional expressions (“I feel confident”). Evaluations, by
4 M. OVERBECK ET AL.

contrast, are necessarily tied to an object that is being evaluated and must thus be conceptualized as
a relational construct.1
By comparison, both opinions and stances refer to object-specific evaluations, and thus include the
same three constitutive elements (evaluative expression, object, relation between these) as discussed
above (Bestvater & Monroe, 2022). However, unlike evaluations, opinions and stances are anchored by
the person expressing these: They additionally require a person whose opinion or stance is expressed
and aim to summarize this person’s overall evaluation of an object, which may comprise multiple
specific evaluations (Aldayel & Magdy, 2021; Küçük & Can, 2021). By contrast, evaluations refer to any
specific evaluative considerations that are expressed toward an object, regardless of whether these are
endorsed by a given (identifiable or unidentified) author. This distinction is critical especially in the
context of journalistic news reporting, where many conveyed evaluations do not purport to reflect the
author’s opinion (Baden & Tenenboim-Weinblatt, 2018).
In addition, evaluations and stances differ in terms of their object-specific relationships. In stance
detection, the relevant object can be “latent” or “external” to the text. As Küçük and Can (2021,
p. 2) explain, stance detection aims to distill the author’s opinion about a target which is determined by
the analyst and “may or may not be explicitly mentioned in the text”. For instance, an author’s
stance toward environmental protection may be inferred also from statements that do not address the
studied object at all, and merely bear on the measured stance indirectly (e.g., “soon, most glaciers will
be gone”). Evaluations, by contrast, require the object of evaluation to be referenced within the text
(Hunston, 2010, p. 17). For instance, the sentence “Some say there are also opportunities in climate
change, but I think that is short-sighted, the effects will be devastating” may simultaneously express
mildly negative sentiment (one positive term: “opportunities”; two negative terms: “short-
sighted” & “devastating”), a supportive stance toward environmental protection, and three
evaluations: two evaluations of climate change (opportunities, devastating; attributed to different
authors) and one evaluation of others’ judgment (short-sighted). In this sense, evaluations are the
most direct and explicit way of expressing an opinion about a targeted object.

Present strategies for the analysis of evaluative text


Given these conceptual differences, it is not surprising that each construct requires distinct measure­
ments and operationalizations. If sentiment analysis takes a survey of potentially evaluative expres­
sions present in a text, stance detection relies on a broad range of textual and non-textual features (well
beyond expressed evaluations) to make inferences on a person’s stance about a given topic or entity
(Aldayel & Magdy, 2021; Mohammad et al., 2016). The measurement of evaluations, in turn, focuses
specifically on those textual segments that express a valenced judgment about a present object
(Thompson & Hunston, 2000).

Sentiment analysis
While much manual research has classified evaluations directly, most computational approaches to
measuring evaluations have relied on textual sentiment as a proxy. For the measurement of textual
sentiment, researchers have either relied on extensive dictionaries of expressly valenced and connoted
expressions or trained supervised machine classifiers on diverse textual corpora (Chan et al., 2020;
Young & Soroka, 2012). Both approaches, however, come with important limitations. Dictionaries
may generally be capable of identifying the most used expressions conveying positive or negative
valence and can also be augmented to distinguish various kinds of evaluative standards, or expressions.
However, given the noted complexity of evaluative expressions, they invariably miss any expressions
1
Relational constructs are fairly common in social scientific research and comprise all cases where one object is qualified (e.g.,
evaluated, or attributed specific qualities: object – qualified link – attribute) or tied to another object, (e.g., a cause or effect, parent
or child, or alter: object – qualified relationship – object).
COMMUNICATION METHODS AND MEASURES 5

whose predominant use is not evaluative, or which only convey valence under specific circumstances.
Moreover, dictionaries are mostly unable to distinguish evaluative from non-evaluative uses of
recognized terms, often resulting in considerable amounts of false positives as positive/negative
referred-to events are confounded with positive/negative evaluations (Chan et al., 2020, p. 6; van
Atteveldt et al., 2021, p. 12).
In the supervised approach, by contrast, some differentiations between evaluative and non-
evaluative uses, as well as some rare expressions may in principle be learned by a well-trained
classifier – although recent validation studies nurture doubt that said classifiers are presently capable
of such nuanced distinctions (van Atteveldt et al., 2021, p. 13). In addition, the reliance on segment- or
document-level labeling inevitably shifts the measurement away from specific evaluations, sacrificing
the capacity to separate multiple evaluations. This is true even for classifiers trained on sentence data,
as the same sentence often contains multiple evaluations of an object. Both approaches, crucially,
neglect the constitutive relationship between such expressions and their evaluated objects.
In practice, many studies reconcile their reliance on sentiment with their primary interest in object-
specific evaluations by applying sentiment measures selectively to pre-filtered text segments that are
selected according to the objects of interest of the study (Oliveira et al., 2017; Pang et al., 2002). Such
procedures enable obtaining a statistical, probabilistic association between an object and prevalent
evaluations – at the risk of also registering numerous unrelated and non-evaluative expressions). To go
beyond such probabilistic measurement, it is critical to not only distinguish non-evaluative from
evaluative uses of recognized expressions, but also ascertain that an evaluatively used expression
indeed pertains to a given object in the text.
Recognizing these limitations, researchers have primarily relied on two broad strategies for bring­
ing sentiment tools closer to a valid measurement of object-specific evaluations. A first approach is to
pre-screen sentiment items from existing off-the-shelf dictionaries (Muddiman et al., 2019), thus
ascertaining that the used indicators make sense as evaluative expressions in each context of use. That
said, the high variability of language use limits our capacity to confidently predict which expressions
will, and will not, be used to evaluate a given object.
A second strategy takes into account the grammatical context of sentiment terms (Fogel-Dror et al.,
2019; van Atteveldt et al., 2017). In this strategy, natural language processing (NLP) tools are leveraged
to recognize the grammatical and syntactic structure of text, including part-of-speech, dependency
parsing, and co-reference resolution. Based on this additional data, it is then possible to determine
which sentiment words are indeed related to a given object (Fogel-Dror et al., 2019). This promising –
though computationally demanding – strategy offers many avenues for further development.
However, it remains limited by its reliance on a full deductive modeling of different possible forms
of association and dissociation in natural language – a daunting task. Another difficulty is that
sufficiently accurate NLP tools are unavailable for many languages and discourse genres, limiting
this strategy to only specific applications.

Stance detection
Other than common sentiment-based measures, recent advances in opinion mining and stance
detection are generally aware of this challenge of the missing object relationship (Aldayel & Magdy,
2019, 2021; Bestvater & Monroe, 2022; Küçük & Can, 2021; Mohammad et al., 2016). However,
owing to their focus on inferring authors’ stances toward a given object, present stance-detection
procedures do not limit themselves to expressed evaluations, but rely on a broad range of other
textual and non-textual features that are known to correlate with a given stance (Mohammad et al.,
2016). Accordingly, texts can be classified by stance even if they never expressly evaluate or even
discuss a given object, as long as they contain expressions that are indicative of a specific stance (e.g.,
endorsement of far-right parties can indicate a negative stance toward immigration; Aldayel &
Magdy, 2019, 2021). In addition, stance detection can utilize a broad range of contextual informa­
tion to make more accurate predictions about a person’s stance, ranging from users’ social media
6 M. OVERBECK ET AL.

profile information to their past comments or retweets on related or unrelated topics (Aldayel &
Magdy, 2021; Darwish et al., 2017; Lahoti et al., 2018). In consequence, using stance detection to
classify evaluations in text most likely produces considerable amounts of false positives whenever
a stance is merely inferred from circumstantial evidence, but not based on an expressed evaluation.
Moreover, (especially the better performing) stance detection algorithms should systematically
disregard evaluations that are merely referenced, but not endorsed by the author, potentially adding
substantial amounts of false negatives where such distancing practices are common (notably, in
journalism). While sentiment analysis should generally record a superset of potentially evaluative
expressions, only some of which express actual evaluations, stance detection relates in complicated
ways to the intended measurement of evaluations. Despite the greater conceptual proximity between
evaluation and stance, sentiment analysis thus presents the superior point of departure for our
endeavor. However, given that evaluations are one central way of expressing a person’s stance on
something (Aldayel & Magdy, 2021), developing a supervised classifier for identifying evaluations
could be useful for stance detection tasks.2 Considering the documented limitations of existing
strategies for the measurement of object-specific evaluations, we will now continue to propose
a novel, somewhat different strategy.

Going beyond sentiment: a shift of perspectives


As a point of departure, our strategy is based upon three key insights. First, it is not difficult to collate
a list of common, potentially evaluative terms. Of course, any known strategy still misses
a considerable share of expressions that convey evaluative valence. However, this recall problem is
somewhat mitigated by the pronounced tendency to use multiple redundant markers to pass evalua­
tion. Accordingly, most evaluations can still be validly recognized even if some markers are missed. In
fact, overly ambitious attempts to increase recall by casting an extra-wide net of potentially evaluative
terms tend to degrade precision rapidly and dramatically (e.g., Boukes et al., 2020, p. 97). Most
additional indicators are typically such that they are frequently used in a non-evaluative function (e.g.,
“pretty many”) and are often ambiguous with regard to their evaluative tendency (e.g., “funny”
can express either positive or negative evaluation, depending on the context). Instead, we suggest that
a strategy aimed to measure textual, object-specific evaluations may plausibly build upon existing
inventories of potentially evaluative expressions.
Second, to distinguish evaluative from non-evaluative uses, it is usually necessary to identify the
object of evaluation. Even for studies that are not interested in object-specific evaluations but content
themselves with a measure of aggregate tone, it is thus necessary to distinguish between evaluated and
evaluating expressions to avoid confounding topic and evaluation, and to correctly identify the
expressed valence (e.g., “infections [negative] are decreasing [negative]” is arguably positive).
Therefore, there is a need to proceed beyond a mere recognition of evaluative expressions toward
a valid measurement of evaluative judgment. Accordingly, distinguishing, and then linking evaluative
expressions and evaluated objects is critical. For the study of object-specific evaluations, this means
that the object of evaluation should be identified first, and any evaluative expressions must be
appraised in relation to that object.
Third, collocation is an insufficient measure of this association, as it is unable to capture the quality
of the link between referenced objects and expressed evaluations. Already as a probabilistic proxy of
association, collocation raises important questions, as it is unclear how close both expressions need to
be to one another: semantic associations that span entire texts are often implausible, but so is the
assumption that evaluations are generally expressed within the same sentence where an object is
referenced. Any unitization strategy faces considerable uncertainty about the appropriate balance
between precision (which suggests narrower units) and recall (which requires wider ones). Even if an
2
Especially if one expands the relational construct to also consider to whom an evaluation is attributed: author – [object – link –
evaluation] (Takens et al., 2013; van Atteveldt et al., 2008).
COMMUNICATION METHODS AND MEASURES 7

association is correctly detected, collocation-based strategies remain blind to the specific quality of the
relation. For instance, collocation does not distinguish between evaluated objects and evaluating
authors. However, as Fogel-Dror et al. (2019) compellingly put it, it makes a difference whether
Israel accuses Hamas or Hamas accuses Israel. Moreover, numerous qualifiers and especially negations
(e.g., “not funny,” “what is so bad about . . . ”) alter the evaluative tendency conveyed by
recognized expressions. Accordingly, to confidently establish how an evaluative expression evaluates
a given object, the grammatical embedding of both evaluation and expression must be considered.
Given the complexity of possible associations and the limitations of existing NLP technologies, we
believe that this task can be best addressed by relying on human judgment, which can be scaled up with
the help of supervised machine learning.
Importantly, however, our proposed use of machine learning deviates from existing uses in the
study of textual sentiment. Traditionally, supervised machine learning is used to recognize evaluative
expressions without the need for deductively constructed sentiment dictionaries. Accordingly,
machine classification essentially acts as an alternative to a tailored dictionary, and neither approach
has been found to systematically outperform the other: While machine classifiers may recognize some
expressions missed by a dictionary, their inclusion of any lexical variation between labeled texts also
generates additional false positives (van Atteveldt et al., 2021). By contrast, what we propose is to first
recognize potentially evaluative terms using widely available sentiment dictionaries, and then focus the
machine classification on the more difficult, but much narrower task of determining whether a given
sentiment term is used to evaluate a present object.3 In fact, SML is uniquely suited to this specific task.
SML’s extrapolation of complex regularities exactly excels at recognizing patterns that should affect
the presence and quality of a semantic connection (e.g., negations, recurring phrasings that associate
or dissociate both expressions in relevant or irrelevant ways). In this application, the machine classifier
can be focused on the highly specific task of evaluating the link between a given potentially evaluative
expression and a given object, using nearby context to render a decision. Any content remote from
both object and evaluative expression can be safely disregarded, as it is unlikely to inform this link
(Baden & Stalpouskaya, 2015): The more remote from both object and evaluative expression, the less
likely are other contents to inform this link, enabling us to focus the classification on a relatively
narrow window of likely informative text. In addition, multiple evaluative expressions found near
a relevant object reference can be considered one at a time, breaking down the complex task of
determining overall evaluative tendencies into a sequence of separate, relatively simple classification
tasks (i.e., whether a given word is used to evaluate a given object). Overlapping passages of text are
handed to the classifier multiple times, each time with a different expression singled out for evaluation,
enabling the classifier to recognize subtle variations in the phrasing that link one, but not another
evaluative expression to the evaluated object. To recognize relevant patterns, the classifier needs to
consider any evidence of grammatical and semantic relatedness, including word order, inflectional
and other grammatical information. Accordingly, our approach requires an expression-by-expression
classification using a representation of windowed context, with minimal preprocessing. Given
a sufficiently large, heterogeneous training corpus of manually labeled evaluative language uses, we
build on SML to decide whether the specific use of a known, potentially evaluative term evaluates
a given object.

Workflow
To demonstrate the proposed approach, we draw upon election-related political discourse in the US,
focusing on both news coverage and social media communications in the context of the 2016 and 2020
U.S. presidential elections. For the present study, we were interested in the evaluation of political
3
SML has been used in the past in commercial discourse to determine whether specific sentiments are related to specific entities
(Ben-Ami et al., 2014). Sentiment-terms can, however, relate to entities in various ways without necessarily evaluating it (e.g., by
conveying positive or negative tone).
8 M. OVERBECK ET AL.

projections, defined as scenarios about the outcomes and implications of political events (Tenenboim-
Weinblatt et al., 2022b). Our corpus covers two periods of approximately two years each, ranging from
March 2015 (when Ted Cruz declared his candidacy as first Republican contender) until January 2017
(Donald Trump’s inauguration), and from January 2019 (when Tulsi Gabbard announced her candidacy
as first Democratic contender) until January 2021 (Joe Biden’s inauguration). Content was sampled from
16 major U.S. news outlets (varying ideological positioning and professional styles) and 10 Twitter feeds
of prominent political and media actors.4 The rationale behind creating such a diverse sample of election
related U.S. media material was to develop a classifier that could capture the variety of evaluative
expressions regarding the predicted outcomes of the 2016 and 2020 presidential elections. While this
diverse sample was intended to present a challenging test for our measurement, it allowed us to evaluate
performance separately for social media and news discourse in this article. To retrieve any election-
related content, we relied on a carefully validated set of context-disambiguated keywords.5 Implementing
our strategy for the measurement of object-specific evaluations, the workflow consisted of seven
consecutive steps. Figure 1 presents an overview of the workflow.
In Step 1, it was necessary to recognize any instances of references to the object (or objects) whose
evaluation we intended to measure in the text. For this purpose, we identified any manifest references
to future events, using a second, carefully validated set of context-disambiguated keywords. The
keywords comprised a wide range of grammatical tense markers, references to prediction, and future-
oriented speech, as well as other markers of semantic future-orientation (Neiger & Tenenboim-
Weinblatt, 2016).6 Around each core sentence that included at least one future reference, we con­
stituted a window of up to two preceding and two succeeding sentences. Following Baden and
Stalpouskaya’s (2015) recommendation, we manually validated that including less than two sentences
tended to miss important information, while widening the window inflated false-positive rates at little
added benefit. For each segment, we subsequently applied our first set of election-related sampling
keywords again to remove any future references that were unrelated to the elections. As a result, we
obtained a total of 5,597,270 relevant windowed segments, each centered upon a core sentence
containing a future reference.7
In Step 2, we constituted our dictionary of potentially evaluative terms (PET). For this purpose, we
screened several existing off-the-shelf sentiment dictionaries (for an overview of available tools, see
Chan et al., 2020) as well as the moral foundation dictionary (Garten et al., 2016), assessing their
collections of indicators based on their potential use to evaluate a given target. For the present study,
the Lexicoder (Young & Soroka, 2012) sentiment dictionary was selected as a point of departure due
to its being constructed and validated on U.S. political news; to this, we added Hu and Liu’s (2004)
sentiment dictionary, which was developed to classify consumer reviews thus contained a wide

4
Our news corpus includes transcripts from ABC, CNN, Fox News, MSNBC, NPR, and Rush Limbaugh’s show, as well as news articles
from AP, New York Times, New York Post, USA Today, Wall Street Journal, Washington Post, 538, Huffington Post, Breitbart, and
Christianity Today. Outlets were sampled from a list of wide reach outlets to cover the broad ideological spectrum while varying
channel type and journalistic style (TV, radio, print, online native; high-brow, popular, specialized). The social media corpus
includes Twitter feeds from the main contenders (Joe Biden, Hillary Clinton, Donald Trump) as well as 7 additional widely followed
accounts of key actors on the media, varying political leaning, role (anchor, reporter, commentator, data journalist) and media
affiliation (Donna Brazile, Thomas Friedman, Paul Krugman, John Levine, Rachel Maddow, Rick Santorum, Nate Silver).
5
The retrieval keywords included an extensive list of general election related terms alongside specific references to the to the 2016
and 2020 U.S. presidential elections including the major candidates and other important entities (N = 144). The list was validated
and refined multiple times by both scrutinizing retrieved documents for relevance and examining relevant coverage for its
inclusion of any combinations of keywords satisfying the sampling criteria. For the full list of retrieval keywords and descriptive
sample statistics, please refer to supplementary material S1.
6
The dictionary of future references is built upon the future-related categories included in the INFOCORE dictionary (Baden et al.,
2018). Both dictionaries were carefully constructed and validated using a combination of qualitative discourse analysis, back-and-
forth translation, and manual quality control. The future dictionary was further refined by adding and altering keywords and
disambiguation criteria based on close readings of relevant contents in a qualitative pilot study (Tenenboim-Weinblatt et al., 2021).
7
Each core sentence constituted a separate windowed segment, but multiple segments could overlap if multiple successive
sentences were recognized as expressing relevant future orientation. For Twitter, the windowing procedure typically returned
the entire tweet (tweets rarely exceed three sentences), however, the same tweet could be included multiple times, with different
core sentences highlighted.
COMMUNICATION METHODS AND MEASURES 9

Step 3: Automated
Step 1: Sampling and Step 2: Construcon of Step 4: Coder training &
annotaon of PET in
segmentaon of relevant diconary of potenally parallel manual annotaon:
sampled windowed text
instances of evaluaon evaluave terms (PET) Evaluave usage of PET
segments

Step 6: Pre-processing of
Step 5: Large scale
annotated segments for Step 7: Machine
individual annotaon of
Supervised Machine Classificaon + Validaon:
news and social media
Learning procedure Evaluave usage of PET
content

Figure 1. The seven-step workflow for detecting evaluations in news and social media corpora.

range of additional potential evaluative expressions relevant for the study of online discourse.
Merging the indicator sets included in both dictionaries, we found several indicators that were
listed as both positive and negative (e.g., “envious,” “cheap”) and decided for each such
indicator which was the more relevant evaluative tendency in our context of application. Aiming
for a broad inclusion of potentially relevant terms, we only removed the keyword “tasty” from our
dictionary since it appeared highly unlikely to be used to evaluate our object of interest. Based on
a manual validation, we decided to add “nazi” as a negative PET to the combined dictionary. The
resulting dictionary comprised a total of 15,508 potentially evaluative terms (9,102 negative and
6,407 positive indicators; see supplementary material S3 for the full dictionary of positive and
negative expressions).8
In Step 3, we applied this dictionary to recognize any potentially evaluative expressions
included in each windowed text segment. Owing to our deliberately widely inclusive approach
to identifying PETs, virtually every text segment contained at least one recognized term (less than
0.004% of all election- and future-filtered segments did not contain any PET).9 Each instance of
a sentiment term was marked and automatically labeled as carrying positive or negative valence,
based on the term’s classification in the dictionary (e.g., “[invalid][-]”). Next, separate coding
units were created for each recognized sentiment term, such that segments that contained multiple
potentially evaluative terms were entered multiple times, each time with a different sentiment
term highlighted.
Table 1 illustrates the segmentation and highlighting procedure for the subsequent coding task,
based on an extract from the Hannity Show shown on U.S. television channel Fox News. The segment
displays a window of four sentences,10 including a core-sentence with a predicted state (marked by the
pointed brackets), in which Sean Hannity predicts that Joe Biden will likely announce his candidacy in
the U.S. 2016 presidential primaries. While the pointed brackets mark across all segments the same
core sentence that contains the future-oriented scenario, each row highlights a different PET in square
brackets. Asking Laura Ingraham whether Biden might be able to win the elections, Hannity uses
several sentiment words. While the snippet mostly contains positive sentiment words (“right,”
“articulate,” “bright,” “clean,” “nice,” “win,” “win”), it is the only negative word
“gaffe”11 machine (highlighted in the 7th row) which evaluates the predicted state, criticizing Biden

8
From the positive indicators 4,401 words originated from the Lexicoder dictionary, 1,838 indicators from Hu & Liu, and 168 terms
appeared in both dictionaries. From the list of negative indicators, 4,319 terms originated from Lexicoder, 4,523 items originated
from Hu & Liu, and 260 entries appeared in both lists.
9
We evaluated the exclusion of segments through our selection of PETs on a large subsample of our election and future filtered
corpus. The sampling procedure showed that the filtering excluded less than 40 segments in one million (less than 0.004%).
10
This four-sentence example deviates from the standard five-sentence segment size because our segmentation algorithm did not
include sentences that were situated before the beginning of a new paragraph in the segment.
11
The term “gaffe” refers to an unintentional and embarrassing act or remark (i.e. blunder). In the context of the U.S. 2020
presidential elections, the term “gaffe-machine” has been extensively used in the U.S. media to describe Biden’s alleged
tendency to commit such gaffes.
10 M. OVERBECK ET AL.

Table 1. Illustration of the segmentation and highlighting of sentiment words.


Number of Evaluative Usage of
sentiment words Segment Sentiment Word
1/8 “All [right][+], Laura, next question. >>> Joe Biden I think is going to get 0
into the race. <<< Can the guy that actually said about Barack Obama, he
(is) a mainstream African American who is articulate and bright and clean,
a nice looking guy, can that guy win? Can that gaffe machine win?”
2/8 “All right, Laura, next question. >>> Joe Biden I think is going to get into 0
the race. <<< Can the guy that actually said about Barack Obama, he (is)
a mainstream African American who is [articulate][+] and bright and
clean, a nice looking guy, can that guy win? Can that gaffe machine
win?”
... ... ...
7/8 “All right, Laura, next question. >>> Joe Biden I think is going to get into 1
the race. <<< Can the guy that actually said about Barack Obama, he (is)
a mainstream African American who is articulate and bright and clean,
a nice looking guy, can that guy win? Can that [gaffe][-] machine win?”
8/8 “All right, Laura, next question. >>> Joe Biden I think is going to get into 0
the race. <<< Can the guy that actually said about Barack Obama, he (is)
a mainstream African American who is articulate and bright and clean,
a nice looking guy, can that guy win? Can that gaffe machine [win][+]?”

for his numerous blunders and many other flaws, and thus revealing Hannity’s negative opinion about
Biden’s prospective candidacy.
In Step 4, a group of six coders were thoroughly trained on a random selection of 1,633 such
segments from 59 randomly sampled news articles, along with television and radio transcripts.
For each coding unit, three coding decisions were required. First, coders ascertained that the
highlighted core sentence indeed contained a relevant political projection. Segments that did not
meet this criterion were coded as irrelevant. Second, coders decided for all relevant segments
whether the highlighted sentiment-term was used to evaluate the projection expressed in the core
sentence. Instances could be coded as a) “used to evaluate the projection,” b) “not used to
evaluate the projection,” or c) “unclear,” in rare cases when the semantic connection
between both entities could not be interpreted. Where the text supported multiple readings,
coders were instructed to code potentially evaluative terms as “used to evaluate” whenever
this was at least one possible reading of the text, treating as unrelated only those cases where no
evaluative use could be identified. Third, coders decided for each evaluatively used expression
whether it evaluated the object a) in line with the indicator’s innate evaluative tendency (e.g.,
“pretty” used to evaluate positively); b) in a way that inverted their innate evaluative tendency
(e.g., “pretty” used to evaluate negatively, such as “anything but pretty”); or c) in an
unclear or ambiguous fashion. Taken together, this three-step nested classification task thus
constituted six alternative classifications. Table 2 presents all six categories with two illustrative
examples per category (see supplementary material S4 for the full codebook).
Based on the same 1,633 segments used for the training of the manual coders, the authors
obtained a separate gold standard by deliberatively deciding the true classification of each segment.
During eleven rounds of coder training, intercoder-reliability was consecutively calculated, and
necessary adjustments to the coding instructions were added after every coding round. Coder-
training was completed after agreement scores ceased to rise in relation to previous rounds. For the
consecutive stage of individual coding, we selected the four coders with highest agreement scores
COMMUNICATION METHODS AND MEASURES 11

Table 2. Overview of the coding instructions for identifying evaluations of political projections.
Code short definition Examples
0 No political projection present in “They celebrated like there is no “Who knows what the future will
core sentence tomorrow.” bring?”
1.0 Potentially Evaluative Term (PET) not “The recent [terror][-] attacks could “Following a [turbulent][-] discussion it
used to evaluate the projection affect the outcome of the looks like the protests will
elections.” continue.”
1.1.1 PET used to evaluate the projection, “We will make America [great][+] Losing these elections would be
consistently with marked again.” [devastating][-] for the whole
tendency country.
1.1.2 PET used to evaluate the projection, “One cannot be seriously [happy][+] Every other outcome would be less than
opposite to marked tendency about the prospective Trump [fortunate][-].
victory”
1.1.3 PET used to evaluate the projection, “Looks like these elections are going “It is yet [unclear][-] how this scenario
with unclear/ambiguous tendency to be [interesting][+].” would affect the economy.”
1.2 not clear whether the PET evaluates “These are [scary][-] times. Well, “She is a [successful][+] businessperson.
the projection I don’t expect he will win.” Recent polls predict her victory.
Note: Future markers highlighted in italic; projections (evaluated object) in bold; potentially evaluative terms (PET) in [brackets], with
labeled innate valence [+]/[-].

(Krippendorff’s Alpha scores for .65 for identifying election-related projections, and .41 for classify­
ing evaluations).12
In Step 5, these four thoroughly trained coders manually classified 10,004 randomly sampled
coding units in the individual coding phase. 8,000 segments were randomly sampled from 1,014
news articles and transcript in 16 major U.S. media outlets (including print, online, radio, and
television news). In addition, 2,004 segments were extracted from 568 tweets in Twitter feeds of ten
prominent U.S. politicians and journalists (see footnote 3 for the list of media outlets and Twitter
feeds).
We are consequently using two datasets within this paper (see Table 3 for a summary). The first,
carefully curated gold standard dataset contains 1,633 segments coded based on the authors’ con­
sensual decisions, which were then used to train our four coders. Since these segments were used in the
coder-training phase, they were not used as training data for the automated classifier. Given its high
validity, the gold standard data was used to evaluate the performance of our own classifier, and
compare it to the quality of annotations obtained from the respective sentiment dictionaries and
existing classifiers. The second dataset consists of 10,004 segments coded individually by our trained
independent coders to obtain a sufficiently large dataset to train our evaluations classifier. The
evaluations classifiers’ internal performance was evaluated on this dataset, using 10-fold cross-
validation.
In Step 6, all segments from the individual coding dataset (N = 10,004) were minimally pre-
processed for the machine learning procedure. Considering the key role of grammatical information
for our classification task, we decided to neither stem nor lemmatize our corpus, but removed only
punctuation as well as a short list of stop-words judged to be uninformative toward the classification
task. Next, we created a document feature matrix that included unigrams, bigrams, and trigrams, to
capture the sequential order of words required to extract the grammatical embedding. We then pruned

12
For the classification of evaluative uses, intercoder reliability is bounded by the presence of textual ambiguity, which results in valid
disagreement among coders (Baden et al., 2023): While conventional alpha levels are based on the assumption that a unique
classification exists for every textual unit (Krippendorff, 2018), many coded segments validly permit multiple readings that result in
different, equally valid classification decisions. Especially in transcripts of spoken language, coding often depended on whether
coders interpreted adjacent sentences as about one another (e.g., consider the classification of “scary” in “These are scary
times. Well, I don’t expect he will win,” which may or may not also evaluate the prediction; please see supplementary material S5
for additional detail). After extensive coder training, almost all remaining disagreement was reduced to such cases of valid
ambiguity, meaning that we are confident that coded classifications are valid even if competing classifications remain available.
The observed degree of disagreement thus considerably overestimates the amount of coding error, suggesting that our reliability
is in fact much better than the weak alpha level of .41 indicates. Adding to this Geiß (2021) argument that high alpha levels are
required mostly for small sample sizes and weak effects, our large sample (of 10,004 coded segments) should ensure that our data
still suffices to draw meaningful conclusions.
12 M. OVERBECK ET AL.

Table 3. Description of the datasets used for training and testing the evaluations classifier.
Dataset Classification Size Used for
Gold standard carefully curated dataset based on consensual 1,633 segments (1) eleven rounds of coder training
classification by the authors, to guarantee a high (2) comparative tests of classifier
degree of validity performance
Individual coding independent classification by four trained coders 10,004 segments training/evaluation of internal
classifier performance via 10-fold
cross-validation

the 1% most common n-grams, as well as any n-grams with a frequency of less than five cases (which
effectively excludes most trigrams, retaining only common expressions). From the raw corpus, we
created four variants: one version with stop-words excluded and common trigrams included (SE-TI),
one version with stop-words included and trigrams included (SI-TI), one version with stop-words
removed and trigrams excluded (SE-TE) and one version with stop-words included and trigrams
excluded (SI-TE). Segments were passed to the classifier with two important edits: (a) in each coding
unit, the potentially evaluative term (PET) under consideration was augmented by a binary marker
expressing its evaluative valence ([+] or [-]) to focus the classifier on the constructed relationship
between evaluative expressions and evaluated objects; (b) the PET was thus added as a separate feature
to the classifier. In our application, we additionally passed one manually coded feature to the classifier,
which captured whether the automatically recognized future reference indeed expressed a projection –
a step that would be arguably unnecessary if the evaluated object can be directly recognized by an
algorithmic procedure (e.g., for named entities). The target variable was recoded to distinguish only
evaluative uses consistent with (1) and opposite to (2) PETs’ innate valence from all kinds of non-
evaluative usage (0: no or undecidable evaluative use, or no relevant projection present).
Finally, in Step 7, we trained and compared the performance of three supervised machine
classification algorithms to determine the evaluative usage of potentially evaluative expressions.
Specifically, we trained a Naïve Bayes (NB) and a SVM classifier, using the quanteda R-package
(Benoit et al., 2018),13 as well as a transformer-based roBERTa classifier using the Simple Transformers
library14 to represent the latest generation of neural network algorithms.15 The NB and SVM classifiers
were applied to each of the four variants of the preprocessed corpus (SE-TI, SI-TI, SE-TE, SI-TE),
while roBERTa includes its own preprocessing pipeline and was therefore applied to the un-
preprocessed data. Classifier performance was evaluated twice: Once using 10-fold cross-validation
against held-out data on the individual coding dataset (N = 10,004, splits: 90% training data, 10% test
data; the test data was additionally broken down to evaluate whether classifier performance differed
between news texts and Twitter content), to obtain average precision, recall, and macro F1-scores; and
once on the separate gold standard data (N = 1,633).
In addition, we compared the performance of our classifier against five freely available, commonly
used sentiment tools: The sentiment dictionaries included in Lexicoder (Young & Soroka, 2012, using
its “positive” and “negative” categories), Hu and Liu (2004), the General Inquirer (Stone et al.,
1966, using its “positive” and “negative” categories), and Loughran and McDonald (2011), all of
these available through the quanteda R-package (Benoit et al., 2018); as well as the neural Twitter-
roBERTa base sentiment classifier developed by Barbieri et al. (2020).16 Following a suggestion from

13
Both Naïve Bayes and SVM belong to the class of linear classifiers that have shown to perform well for text classification both with
regards to speed and accuracy (Wiedemann, 2019, p. 141)
14
https://simpletransformers.ai/
15
roBERTa is a transformer-based classifier which can be deployed as a supervised machine classifier following additional training on
user data.
16
The Twitter-roBERTa-base for Sentiment Analysis classifier is a transformer-based language model that comes pre-trained on
a corpus of ~124 M tweets in English language (https://huggingface.co/cardiffnlp/twitter-roberta-base-sentiment-latest). The
classifier includes a built-in capacity for sentiment classification that predicts a probability score for positive, neutral, and negative
tone, as well as an overall classification of prevalent sentiment.
COMMUNICATION METHODS AND MEASURES 13

an anonymous reviewer,17 we conducted a final validation check to compare our classifier with a more
conventional use of our training data. To achieve this, we first converted our PET-level annotations to
obtain the overall evaluative tendency for each coded segment (the aggregation process is described in
detail on p. 29). Subsequently, we fine-tuned a roBERTa classifier on this re-aggregated representation.

Results
Classifier performance
In a first step, we compared the performance of the trained SVM, Naïve Bayes (NB), and roBERTa
classifiers based on the individual coding dataset (N = 10,004), using the averaged macro F1 scores
across the 10 cross-validation runs. For both SVM and Naïve Bayes classifiers, the pre-processing
pipeline including both stopwords and trigrams (SI-TI) in the classification resulted in the highest
performance, and it is this variant that we refer to in all further results.18
Table 4 presents the confusion matrices for all three classifiers. The results show that the SVM
classifier clearly outperformed both the NB and the roBERTa classifiers in almost all categories (Macro
F1: 0.56, SD: .06 with a slightly higher performance for Twitter data [0.59] compared to news data
[0.56]). The NB classifier (Macro F1: .45, SD: .03) tended to under-annotate consistent evaluative uses
less than SVM, which improved recall, but at considerable cost for precision. The predictions of the
roBERTa model (Macro F1: .40, SD: .08) were least accurate in all metrics save for recall for non-
evaluative uses, owing to its tendency to default to the majority class (0).

Additional validation
Next, we evaluated the best-performing model (SVM-classification with trigrams and stopwords
included (SI-TI)) against our separate gold-standard data set consisting of 1,633 coded segments,
with disagreements resolved through discussion, ensuring highest quality standards. In addition, we
compared the performance of our classifier against the six alternative classifiers presented above: four
off-shelf sentiment dictionaries including Lexicoder (Young & Soroka, 2012), Hu and Liu (2004), the
General Inquirer (Stone et al., 1966), and Loughran and McDonald (2011), the pre-trained Twitter-
roBERTa base sentiment classifier developed by Barbieri et al. (2020), and the segment-level roBERTa
model fine-tuned on our own re-aggregated training data.
To align our classification scheme, which categorizes (non-)evaluative uses of potentially evaluative
expressions, with the more conventional classification into positive, neutral, and negative valence, we
multiplied the valence innate to each PET (+1: positive, −1: negative) with its form of use (0: non-
evaluative use, +1: consistent use, −1: opposite use). For example, if a PET was classified as positive [+] by
our sentiment dictionary, and its use was classified as opposite, the resulting valence would be negative
(−1). Subsequently, we aggregated any recognized evaluative uses per coded segment to return from our
term-specific measurement (which required multiple classifications per segment whenever more than one
PET was included) to a segment-level sentiment score. For each segment, we obtained the mean valence
over all evaluatively used PET (i.e., ignoring all PET not used in an evaluative manner; for example,
a segment with two positive (2× +1), one negative (1× −1) and any number of non-evaluative PET would
obtain a score of 0.33). From the 1,633 coded gold standard segments this returned a total of 298 segments
with overall sentiment scores ranging from −1 to + 1. The same segments were subsequently fed into the
segment-level roBERTa classifier for fine-tuning and processed with each of the named sentiment
dictionaries and the pre-trained sentiment classifier. Table 5 presents the correlation scores between
17
We thank the reviewer for this valuable suggestion.
18
Overall, stopwords appear to provide relevant contextual information that help the algorithm make more accurate predictions so
including them slightly improves performance especially for the less common categories. By comparison, excluding versus
including trigrams has only a marginal impact on the performance.
14 M. OVERBECK ET AL.

Table 4. Confusion matrix of SVM, Naïve Bayes, and roBerta classifications of evaluative uses of PET (individual coding dataset, PET-
level aggregation, N = 10,004).
SVM Naïve Bayes
Predicted class 0 1 2 Recall Predicted class 0 1 2 Recall
True class 0 8,945 257 28 0.97 True class 0 7,764 1,278 188 0.84
1 319 299 22 0.47 1 235 384 21 0.60
2 79 32 23 0.17 2 61 50 23 0.17
Precision 0.96 0.51 0.32 Precision 0.96 0.22 0.10
F1/Macro F1 0.96 0.49 0.22 0.56 F1/Macro F1 0.90 0.33 0.13 0.45
roBERTa
Predicted class 0 1 2 Recall
True class 0 9,130 100 0 0.99
1 536 104 0 0.16
2 113 21 0 0.00
Precision 0.93 0.46 0.00
F1/Macro F1 0.96 0.24 0.00 0.40
Note: 0 – non-evaluative use; 1 – evaluative use, consistent with innate valence; 2 – evaluative use, opposite to innate valence. The
absolute numbers in the confusion matrices are the sum of predicted values generated during the 10-fold cross validation.

the results obtained by our two SML-based approaches, the manual gold standard, the-shelf sentiment
dictionaries, and the pre-trained neural sentiment classifier.
As seen in the first row, which assessed the various computational approaches against the manual
gold standard, our two SML-based classifiers (the SVM-model trained on PET-level annotation data
and the roBERTa model fine-tuned on our segment-level re-aggregated data) by far outperformed all
other sentiment measures, including the pre-trained neural classifier. While our two SML-models
show similar performance, the simpler and more transparent SVM-model based on our PET-level
annotation data still outperformed the transformer-based roBERTa model fine-tuned on our segment-
aggregated data.
While we still see only a moderate positive correlation between our SVM-classifier and the gold
standard (r = .40), the best-performing sentiment measure, Hu and Liu (2004)’s dictionary, reaches
only half that value (r = .22). Correlating the classification achieved by each approach with our SVM-
based classifier (second row), the segment-level roBERTa drawing upon the same data as our classifier
is at least moderately similar; among the off-the-shelf classifiers, the pre-trained roBERTa comes only
marginally closer to our own predictions (r = .31) than Loughran and McDonald’s (2011) most-similar
dictionary (r = .26), which correlates only weakly.
Adding the 95% confidence intervals, Figure 2 shows that our SVM classifier significantly outper­
forms both the pre-trained roBERTa, as well as the General Inquirer and Loughland & McDonald
dictionary.
Table 6 presents the confusion matrices for the performance comparisons of all classifiers in
relation to the manual gold standard. Overall, our approach achieved a much higher macro F1 (.58)
than any sentiment dictionary and pre-trained sentiment tool. It was outperformed only by the variant
that relied on the same coding process and data but aggregated the annotations to segment level prior
to training the machine classifier (macro F1 = .63). Specifically, the segment-level model shows slightly
better recall for both positive and negative evaluations but lags in its capacity to distinguish non-
evaluative contents – the key weakness of commonly used sentiment tools. The overall superior
performance can likely be attributed to the more balanced class distribution in the training data
compared to the PET-level annotation.
The pre-trained Twitter-roBERTa-base sentiment classifier came in next, but well behind (macro F1
= .43). The sentiment dictionaries performed worst, with Loughran and McDonald’s sentiment dic­
tionary about half as accurate as our approach (macro F1 = .33), while Lexicoder was least able to
accurately predict evaluations (macro F1 = .24) Our classifier consistently achieved superior precision
within each coded category, and especially for both negative (−1) and positive evaluations (+1),
reflecting sentiment dictionaries’ heavily over-inclusive annotation that recognized many indicators
COMMUNICATION METHODS AND MEASURES 15

Figure 2. Correlations and confidence intervals for the comparison between automated measures and manual classification (2.1),
and between sentiment measures and our SVM classifier (2.2) (gold standard dataset, re-aggregated on segment-level, N = 298).

Table 5. Correlations of sentiment scores between the two SML-based classifiers, the manual gold standard, four off-the-shelf
sentiment dictionaries, and the pre-trained twitter-roBerta-sentiment classifier (gold standard dataset, re-aggregated on segment-
level, N = 298).
Pre-trained
SML classifiers based on Sentiment
our coding process Sentiment Dictionaries Classifier
roBERTa
Manual Gold SVM (segment-level Hu & General Loughran & Twitter-roBERTa-
Standard (SI-TI) data) Lexicoder Liu Inquirer McDonald base sentiment
Manual Gold x 0.40** 0.39** 0.21** 0.22** 0.15* 0.14* 0.14*
Standard
SML-based 0.40** x 0.45** 0.19** 0.18** 0.05 0.26** 0.31**
classifier (SVM,
SI-TI)
*Correlation is significant at the 0.05 level (2-tailed).
**Correlation is significant at the .01 level (2-tailed).

that are not used in an evaluative fashion. For the same reason, of course, sentiment dictionaries tended
to miss fewer positive (1) and negative evaluations (−1). The Lexicoder sentiment dictionary correctly
identified the largest share of negative evaluations (.69), while the dictionary by Hu & Liu caught
positive evaluations best (.80), in both cases outperforming our SML-based classifier. By contrast, our
SVM-classifier by far outperformed all other tools in correctly identifying non-evaluative segments
(.91), followed at some distance by the neural classifier roBERTa (.65), which also scored notably worse
on precision. Of course, with a macro F1 of .58, also the performance of our own SVM-classifier
continued to lag far behind the quality obtained, albeit at considerably higher cost, by manual coding.
To summarize, both classifiers trained on our data exhibited significantly higher accuracy in predict­
ing evaluations, compared to all tested sentiment measures, including both dictionaries and neural
classifiers. Particularly for our proposed approach, which focuses on the evaluative usage of potentially
evaluative terms, this superior performance can be attributed to its capability to accurately identify non-
evaluative uses of sentiment words, along with its ability to predict negative and positive evaluations
more precisely. Conversely, sentiment dictionaries demonstrated better performance in identifying
positive and negative evaluations, albeit at the expense of noisy and over-inclusive annotations. The
segment-level model, fine-tuned on our data, offers a viable compromise, maintaining high recall for
evaluative expressions while eliminating much (but not all) of the over-inclusiveness of the dictionaries.
16 M. OVERBECK ET AL.

Table 6. Performance of the SML-based classifier and sentiment measures against the manual gold standard (gold standard dataset,
re-aggregated on segment-level, N = 298).
SML Classifier (SVM, SI-TI) SML Classifier (roBERTa, segment-level data)
Predicted class −1 0 1 Recall Predicted class −1 0 1 Recall
True class −1 11 16 2 0.38 True class −1 19 6 4 0.66
0 12 209 8 0.91 0 31 189 9 0.83
1 1 24 15 0.38 1 7 11 22 0.55
Precision 0.46 0.84 0.60 Precision 0.33 0.92 0.63
F1/Macro F1 0.42 0.87 0.46 0.58 F1/Macro F1 0.44 0.87 0.59 0.63
Lexicoder Hu & Liu
Predicted class −1 0 1 Recall Predicted class −1 0 1 Recall
True class −1 20 1 8 0.69 True class −1 12 7 10 0.41
0 74 21 134 0.09 0 50 33 146 0.14
1 9 4 27 0.68 1 4 4 32 0.80
Precision 0.19 0.81 0.16 Precision 0.18 0.75 0.17
F1/Macro F1 0.30 0.16 0.26 0.24 F1/Macro F1 0.25 0.24 0.28 0.26
General Inquirer Loughran & McDonald
Predicted class −1 0 1 Recall Predicted class −1 0 1 Recall
True class −1 15 5 9 0.52 True Class −1 17 7 5 0.59
0 61 39 129 0.17 0 97 82 50 0.36
1 9 7 24 0.60 1 14 10 16 0.40
Precision 0.18 0.76 0.15 0.13 0.83 0.23
F1/Macro F1 0.26 0.28 0.24 0.26 F1/Macro F1 0.22 0.50 0.29 0.33
Twitter-roBERTa-base sentiment
Predicted class −1 0 1 Recall
True Class −1 10 15 4 0.34
0 57 148 24 0.65
1 10 15 15 0.38
Precision 0.13 0.83 0.35
F1/Macro F1 0.19 0.73 0.36 0.43

Discussion
In this article, we have proposed a consequential shift in perspectives upon conventional strategies in
machine classification, to improve performance and measurement validity in the detection of object-
specific evaluations in large political text corpora. Based on a discussion of the conceptual character­
istics of evaluations (Hunston & Thompson, 2000), we have argued that commonly relied-upon
proxies – notably, sentiment and stance/opinion – are unsuitable as valid measures of evaluations,
echoing recent criticism (Boukes et al., 2020; Chan et al., 2020; Rauh, 2018; van Atteveldt et al., 2021).
In their place, we have developed an alternative strategy that identified the presence of a semantic
relation between potentially evaluative terms and an evaluated object as the Achilles heel of valid
measurement. By distinguishing the (relatively easy) task of recognizing potentially evaluative terms
from the (much harder) task of determining whether an expression evaluates a given object, we
developed a novel approach to the measurement of textually expressed evaluations that combined
existing dictionary methods with a recast use of supervised machine learning. While sentiment
dictionaries are generally suitable to extract sentiment-laden terms that might be used to evaluate
specific objects, evaluations require sentiment terms to perform an evaluative function with regards to
a given target concept, which can be recognized in a separate step. By designing a classifier that focused
exactly on this semantic relationship between evaluative expressions and evaluated objects, we not
only markedly improved the performance of measurement beyond commonly used measures, but we
also paved the way toward a more general discussion in the use of supervised machine classification.
Given the wealth of relational constructs measured in social science textual research, we believe that
deploying the power of machine classification specifically to evaluate the presence of a semantic
relationship between constitutive components holds ample promise for future development: In many
such measurement challenges, the presence of required components of a construct – be that a named
COMMUNICATION METHODS AND MEASURES 17

entity and a potential attribute (Boomgaarden et al., 2012; Fridkin & Kenney, 2011; van Spanje & de
Vreese, 2014); a grievance and a potentially blamed actor (Busby et al., 2019; Hameleers et al., 2018);
a policy and a potentially evaluative term (Pieri, 2019; Rölle, 2017); or two parties in a potentially
antagonistic confrontation (Fogel-Dror et al., 2019) – is comparatively easy to ascertain. The challenge
lies in determining that the recognized components are actually semantically related to one another,
i.e., that an actor co-present in a text with a grievance is indeed blamed, or that two co-present actors
are indeed presented as in conflict with one another. Where the recognition of components – as in our
case, potentially evaluative terms – is facilitated by a certain predictability of word meanings, language
knows numerous, often subtle ways of expressing a semantic relation, which are extremely difficult to
model (e.g., van Atteveldt et al., 2017), but which may be effectively learned by a well-focused machine
classifier.
Evaluating the potential of our approach for improving the computational measurement of object-
specific evaluations, results showed our classifier to considerably outperform existing sentiment
measures, both regarding the precision of annotated evaluations, and the correct identification of non-
evaluative uses. It is especially the latter finding – our classifier’s far superior capacity to discriminate
potentially, but not presently evaluative expressions (the most common category by far in our
application) – that demonstrates the important validity gain offered by our approach. Where all tested
off-the-shelf sentiment dictionaries, and also the deployed pre-trained neural classifier generated large
numbers of false positives, our multi-step approach appeared overall capable of focusing the analysis
on expressions that were indeed used in an evaluative function.
Furthermore, our classifier did so despite the considerable imbalance in class distribution, raising
hopes that further performance gains can be actualized by using more extensive and balanced training
sets. By contrast, similar performance boosts are less likely for the similar-performing classifier fine-
tuned on the segment-aggregated data, which was notably less imbalanced. Representing a more
conventional use of SML that focuses on holistic document classification, this variant also proved less
capable of distinguishing non-evaluative uses of potentially evaluative terms, which is the main
weakness of existing sentiment measures.
Despite the significant gain in measurement validity offered by the capacity to distinguish non-
evaluative uses, also our classifier’s performance remains far from the quality of manual coding.
Misclassifying roughly one in five coded segments, the obtained performance may be sufficient for
some uses (e.g., detection of over-time or comparative differences), but raises due concerns in others.
In part, this limitation may be attributed to the relatively difficult classification task of evaluative uses,
which additionally depended on the presence of a valid projection, and thus relied on relatively few
and unevenly distributed entries for the main coded categories – a problem that other applications
may easily avoid by focusing on more common and more easily recognized evaluated objects. Even
more so, our focus on evaluated future scenarios permits an unusually wide range of plausible
evaluative expressions, while for most objects, the range of relevant evaluations might be more
constrained, and more easily predicted.
That said, especially for highly complex evaluations, our strategy of breaking down the classifica­
tion of overall evaluative tendencies into a sequence of much simpler, binary choices may considerably
facilitate the manual annotation task. In our experience from numerous applications of textual
analysis, it is often difficult even for human coders to determine whether a text segment overall
evaluates a given object in a specific way (Baden et al., 2023; Kantner & Overbeck, 2020). By
comparison, deciding whether a pre-identified, potentially evaluative expression is used to evaluate
a given object is possible with much higher confidence and speed, replacing one tough and ambivalent
classification task with several much simpler ones.
While we readily acknowledge that our application leaves many avenues for further improvement,
we believe that our distinctive approach of determining the presence of a specific kind of relationship
between pre-identified expressions offers valuable directions for future development. Future research
should be able to easily adapt our approach to other kinds of evaluative discourse, but also to the
18 M. OVERBECK ET AL.

measurement of related evaluative constructs such as opinions, author- and object-specific stances
(Aldayel & Magdy, 2021; Liu & Zhang, 2012), or other relational constructs in a wider sense.
That said, our approach comes with several important limitations. In comparison to existing
resources, and especially to off-the-shelf sentiment dictionaries or pre-trained language models, the
proposed approach requires considerably more resources, given its multi-stage classification process
and reliance on manually annotated training data. It may be worthwhile exploring the use of active
learning strategies, which permit a more efficient usage of annotation resources: smaller amounts of
annotated segments might be sufficient to train an initial classifier, which can then be improved and
enriched iteratively. In addition, such an approach may offer deeper insights into those grammatical
rules that distinguish evaluative from non-evaluative uses of recognized sentiment terms.
Additionally, while our approach already outperforms conventional sentiment-based
approaches and, for some labels, segment-level SML-classifiers, there is still lots of room for
improving classifier performance. We would like to encourage future research to further
explore the opportunities afforded by our approach of focusing on the grammatical and
syntactic relations required to express relational and other complex constructs in text. While
we believe that this strategy facilitates manual annotation (it is easier to code if one word is
about another than to classify overall meanings) and makes better use of the specific strengths
of machine learning (notably, its reliance on distinctive patterns in language use), both
contentions are in need of systematic empirical assessment.
Another limitation results from our reliance on well-validated, inclusive sentiment diction­
aries to pre-recognize potentially evaluative expressions. Adequate resources may be unavail­
able for some languages and applications, and not easily replaced by inferior keyword lists or
machine-translated English dictionaries, which may overlook important indicators. Even where
suitable dictionaries exist, any non-sentiment terms used to express evaluations (e.g., via
figurative speech, analogies, context-sensitive qualifications) will not be considered in the
subsequent classification task. While our classifier effectively tackles the inflation of false
positive annotations inherent to conventional sentiment measures – which we believe is the
dominant validity challenge in the measurement of evaluations – it is unable to address the
challenge presented by false negatives. That said, this limitation might be less pressing at least
for most major, well-resourced languages, as many available sentiment dictionaries already
encompass many thousands of sentiment indicators. In addition, manually augmenting used
dictionaries might be sufficient to include key indicators needed for the study of specific
discourse genres and evaluated objects. By contrast, ever expanding the already large word
lists is unlikely to substantively improve recall, at the expense of further increasing the
number of irrelevant annotations. Moreover, many adaptations of our approach for the
study of other constructs may require much shorter dictionaries to ensure acceptable accuracy
(e.g., there are only so many expressions used to refer to an actor’s attributed competence, or
a policy’s perceived capacity to solve a given problem).
To conclude, the recognition of textually expressed evaluations stands at the core of numer­
ous research fields and agendas in the social sciences, warranting any efforts at advancing the
validity and sensitivity of available measurement strategies. As we have shown, the theoretically
informed, semantically valid operationalization of textually expressed evaluations remains
a major methodological challenge. The computational strategy presented in this article, as well
as its underlying, multi-stage philosophy, offer a valuable avenue for dealing more adequately
with the conceptual exigencies of evaluations, leading toward a more valid measurement of
object-specific evaluations.

Acknowledgments
The study is part of the PROFECI research project (Mediating the Future: The Social Dynamics of Public Projections,
http://profeci.net) funded by the ERC (Starting Grant 802990). We are grateful for the exceptionally committed, valuable
COMMUNICATION METHODS AND MEASURES 19

discussions with three anonymous reviewers and the Associate Editor Marko Bachl, which have contributed much to
improving this article. We are indebted to our research assistants Shachar Birotker, Keshet Galili, Dina Michaely, Aviv
Mor, Sonia Olvovsky, and Michal Salamon for their contributions to the development of the codebook, and their work as
manual coders. Finally, we owe a big thank you to Guy Mor for his support in setting up the transformer-based
classifiers.

Disclosure statement
No potential conflict of interest was reported by the author(s).

Funding
The work was supported by the European Research Council [802990].

Notes on contributors
Maximilian Overbeck is a Postdoctoral researcher at the Department of Communication and Journalism at the Hebrew
University of Jerusalem. His research is situated at the intersection of Political Science, Communication, and
Computational Social Science, with a special interest in religion, collective identities, and political behavior.
Christian Baden is an Associate Professor in the Department of Communication and Journalism at the Hebrew
University of Jerusalem. His research focuses on the collaborative construction of meaning in controversial public
debates, including the construction, negotiation, and contestation of frames in journalism and political communication.
Tali Aharoni is a PhD candidate at the Department of Communication and Journalism at the Hebrew University of
Jerusalem. Her research interests include journalistic production, news audiences, social media, trust, and the various
intersections of media and psychology.
Eedan Amit-Danhi is a Postdoctoral researcher at the Centre for Media and Journalism Studies at the University of
Groningen. Her research primarily focuses on the role of visual communication in the informational landscape of digital
politics.
Keren Tenenboim-Weinblatt is Full Professor at the Department of Communication and Journalism at the Hebrew
University of Jerusalem and Co-Editor-in-Chief of the Journal of Communication. Her research is in the fields of
journalism and political communication, with a particular interest in the role played by the news media in constructing
and negotiating collective pasts and futures. She is leading the ERC-funded project “Mediating the Future: The Social
Dynamics of Public Projections” (PROFECI).

ORCID
Maximilian Overbeck http://orcid.org/0000-0003-3658-5584
Christian Baden http://orcid.org/0000-0002-3771-3413
Tali Aharoni http://orcid.org/0000-0002-2138-8329
Eedan Amit-Danhi http://orcid.org/0000-0002-7029-218X
Keren Tenenboim-Weinblatt http://orcid.org/0000-0001-9268-3969

References
Alba-Juez, L., & Thompson, G. (2014). The many faces and phases of evaluation. In L. Alba-Juez & G. Thompson (Eds.),
Evaluation in context (Vol. 242, pp. 3–26). John Benjamins.
Aldayel, A., & Magdy, W. (2019). Assessing sentiment of the expressed stance on social media. ArXiv:1908 03181 [Cs].
https://arxiv.org/abs/1908.03181
Aldayel, A., & Magdy, W. (2021). Stance detection on social media: State of the art and trends. Information Processing &
Management, 58(4), 1–22. https://doi.org/10.1016/j.ipm.2021.102597
Baden, C., Boxman-Shabtai, L., Tenenboim-Weinblatt, K., Overbeck, M., & Aharoni, T. (2023). Meaning multiplicity
and valid disagreement in textual measurement: A plea for a revised notion of reliability. Studies in Communication
and Media, 12(4), 305–326, https://doi.org/10.5771/2192-4007-2023-4-305
20 M. OVERBECK ET AL.

Baden, C., Jungblut, M., Micevski, I., Stalpouskaya, K., Tenenboim Weinblatt, K., Berganza Conde, R.,
Dimitrakopoulou, D., & Fröhlich, R. (2018). The INFOCORE Dictionary. A Multilingual Dictionary for
Automatically Analyzing Conflict-Related Discourse. https://osf.io/f5u8h/
Baden, C., Pipal, C., Schoonvelde, M., & van der Velden, M. A. G. (2022). Three gaps in computational text analysis
methods for social sciences: A research agenda. Communication Methods and Measures, 16(1), 1–18. https://doi.org/
10.1080/19312458.2021.2015574
Baden, C., & Springer, N. (2017). Conceptualizing viewpoint diversity in news discourse. Journalism, 18(2), 176–194.
https://doi.org/10.1177/1464884915605028
Baden, C., & Stalpouskaya, K. (2015). Common methodological framework: Content analysis. A mixed-methods strategy
for comparatively, diachronically analyzing conflict discourse. Ludwig Maximilian University Munich: INFOCORE
Working Paper, 2015/10 , 1–53. https://www.infocore.eu/wp-content/uploads/2016/02/Methodological-Paper-
MWG-CA_final.pdf
Baden, C., & Tenenboim-Weinblatt, K. (2018). Viewpoint, testimony, action. Journalism Studies, 19(1), 143–161. https://
doi.org/10.1080/1461670X.2016.1161495
Barbieri, F., Camacho-Collados, J., Neves, L., & Espinosa-Anke, L. (2020). TweetEval: Unified benchmark and compara­
tive evaluation for tweet classification. arXiv. arXiv:2010.12421. https://doi.org/10.18653/v1/2020.findings-emnlp.148
Ben-Ami, Z., Feldman, R., & Rosenfeld, B. (2014). Entities’ sentiment relevance. Proceedings of the 52nd Annual Meeting
of the Association for Computational Linguistics (Volume 2: Short Papers), 87–92. https://doi.org/10.3115/v1/P14-2015
Benoit, K., Watanabe, K., Wang, H., Nulty, P., Obeng, A., Müller, S., & Matsuo, A. (2018). Quanteda: An R package for
the quantitative analysis of textual data. Journal of Open Source Software, 3(30), 1–4. https://doi.org/10.21105/joss.
00774
Bestvater, S. E., & Monroe, B. L. (2022). Sentiment is not stance: Target-aware opinion classification for political text
analysis. Political Analysis, 31(2), 235–256. https://doi.org/10.1017/pan.2022.10
Boltanski, L., & Thévenot, L. (2006). On justification: Economies of worth (Vol. 27). Princeton University Press.
Boomgaarden, H. G., Vliegenthart, R., & de Vreese, C. H. (2012). A worldwide presidential election: The impact of the
media on candidate and campaign evaluations. International Journal of Public Opinion Research, 24(1), 42–61.
https://doi.org/10.1093/ijpor/edr041
Boukes, M., van de Velde, B., Araujo, T., & Vliegenthart, R. (2020). What’s the tone? Easy doesn’t do it: Analyzing
performance and agreement between off-the-shelf sentiment analysis tools. Communication Methods and Measures,
14(2), 83–104. https://doi.org/10.1080/19312458.2019.1671966
Busby, E. C., Gubler, J. R., & Hawkins, K. A. (2019). Framing and blame attribution in populist rhetoric. The Journal of
Politics, 81(2), 616–630. https://doi.org/10.1086/701832
Chan, C., Bajjalieh, J., Auvil, L., Wessler, H., Althaus, S., Welbers, K., van Atteveldt, W., & Jungblut, M. (2020). Four best
practices for measuring news sentiment using ‘off-the-shelf’ dictionaries: A large-scale p-hacking experiment.
SocArXiv, 1–60. https://doi.org/10.31235/osf.io/np5wa
Darwish, K., Magdy, W., & Zanouda, T. (2017). Trump vs. Hillary: What went viral during the 2016 US presidential
election. International Conference on Social Informatics, 143–161. https://doi.org/10.1007/978-3-319-67217-5_10
De Cremer, D., & Oosterwegel, A. (1999). Collective self-esteem, personal self-esteem, and collective efficacy in in-group
and outgroup evaluations. Current Psychology, 18(4), 326–339. https://doi.org/10.1007/s12144-999-1007-1
Entman, R. M. (1993). Framing: Toward clarification of a fractured paradigm. Journal of Communication, 43(4), 51–58.
https://doi.org/10.1111/j.1460-2466.1993.tb01304.x
Fogel-Dror, Y., Shenhav, S. R., Sheafer, T., & Van Atteveldt, W. (2019). Role-based association of verbs, actions, and
sentiments with entities in political discourse. Communication Methods and Measures, 13(2), 69–82. https://doi.org/
10.1080/19312458.2018.1536973
Fridkin, K. L., & Kenney, P. J. (2011). The role of candidate traits in campaigns. The Journal of Politics, 73(1), 61–73.
https://doi.org/10.1017/S0022381610000861
Garten, J., Boghrati, R., Hoover, J., Johnson, K. M., & Dehghani, M. (2016). Morality between the lines: Detecting moral
sentiment in text. Proceedings of IJCAI 2016 Workshop on Computational Modeling of Attitudes.
Geiß, S. (2021). Statistical power in content analysis designs: How effect size, sample size and coding accuracy jointly
affect hypothesis testing ‐ a Monte Carlo simulation approach. Computational Communication Research, 3(1), 61–89.
https://doi.org/10.5117/CCR2021.1.003.GEIS
Hameleers, M., Bos, L., & de Vreese, C. H. (2018). Selective exposure to populist communication: How attitudinal
congruence drives the effects of populist attributions of blame. Journal of Communication, 68(1), 51–74. https://doi.
org/10.1093/joc/jqx001
Hu, Y., & Li, X. (2011). Context-dependent product evaluations: An empirical analysis of internet book reviews. Journal
of Interactive Marketing, 25(3), 123–133. https://doi.org/10.1016/j.intmar.2010.10.001
Hu, M., & Liu, B. (2004). Mining and summarizing customer reviews. Proceedings of the Tenth ACM SIGKDD
International Conference on Knowledge Discovery and Data Mining, 168–177.
Hunston, S. (2010). Corpus approaches to evaluation: Phraseology and evaluative language (Vol. 13). Routledge.
Hunston, S., & Thompson, G. (2000). Evaluation in text: Authorial stance and the construction of discourse. Oxford
University Press.
COMMUNICATION METHODS AND MEASURES 21

Kantner, C., & Overbeck, M. (2020). Exploring soft concepts with hard corpus-analytic methods. In N. Reiter, A. Pichler,
& J. Kuhn (Eds.), Reflektierte Algorithmische Textanalyse: Interdisziplinäre Arbeiten in der Creta-Werkstatt (pp.
169–190). De Gruyter.
Krippendorff, K. (2018). Content analysis: An introduction to its methodology (3rd ed.). Sage.
Küçük, D., & Can, F. (2021). Stance detection: a survey. ACM Computing Surveys, 53(1), 1–37. https://doi.org/10.1145/
3369026
Lahoti, P., Garimella, K., & Gionis, A. (2018). Joint non-negative matrix factorization for learning ideological leaning on
twitter. Proceedings of the Eleventh ACM International Conference on Web Search and Data Mining, 351–359.
Lakoff, G., & Johnson, M. (2008). Metaphors we live by. University of Chicago press.
Liu, B., & Zhang, L. (2012). A survey of opinion mining and sentiment analysis. In C. C. Aggarwal & Z. ChengXiang
(Eds.), Mining text data (pp. 415–463). Springer.
Loughran, T., & McDonald, B. (2011). When is a liability not a liability? Textual analysis, dictionaries, and 10-ks. The
Journal of Finance, 66(1), 35–65. https://doi.org/10.1111/j.1540-6261.2010.01625.x
Lühiste, M., & Banducci, S. (2016). Invisible women? Comparing candidates’ news coverage in Europe. Politics &
Gender, 12(2), 223–253. https://doi.org/10.1017/S1743923X16000106
Mohammad, S., Kiritchenko, S., Sobhani, P., Zhu, X., & Cherry, C. (2016). Semeval-2016 task 6: detecting stance in
tweets. Proceedings of the 10th International Workshop on Semantic Evaluation (SemEval-2016), 31–41.
Muddiman, A., McGregor, S. C., & Stroud, N. J. (2019). (Re)claiming our expertise: Parsing large text corpora with
manually validated and organic dictionaries. Political Communication, 36(2), 214–226. https://doi.org/10.1080/
10584609.2018.1517843
Müller, H. (2004). Arguing, bargaining and all that: Communicative action, rationalist theory and the logic of appro­
priateness in international relations. European Journal of International Relations, 10(3), 395–435. https://doi.org/10.
1177/1354066104045542
Neiger, M. & Tenenboim-Weinblatt, K. (2016). Understanding journalism through a Nuanced Deconstruction of
Temporal Layers in news narratives: Temporal Layers in news narratives. Journal of Communication, 66(1), 139–
160. https://doi.org/10.1111/jcom.12202
Oliveira, D. J. S., Bermejo, P. H. D. S., & dos Santos, P. A. (2017). Can social media reveal the preferences of voters?
A comparison between sentiment analysis and traditional opinion polls. Journal of Information Technology & Politics,
14(1), 34–45. https://doi.org/10.1080/19331681.2016.1214094
Pang, B., Lee, L., & Vaithyanathan, S. (2002). Thumbs up? Sentiment classification using machine learning techniques.
Proceedings of the 2002 Conference on Empirical Methods in Natural Language Processing (EMNLP 2002),
Philadelphia, (pp. 79–86). Association for Computational Linguistics. https://doi.org/10.3115/1118693.1118704
Papacharissi, Z. (2015). Affective publics: Sentiment, technology, and politics. Oxford University Press.
Pieri, E. (2019). Media framing and the threat of global pandemics: The Ebola crisis in UK media and policy response.
Sociological Research Online, 24(1), 73–92. https://doi.org/10.1177/1360780418811966
Rauh, C. (2018). Validating a sentiment dictionary for German political language—a workbench note. Journal of
Information Technology & Politics, 15(4), 319–343. https://doi.org/10.1080/19331681.2018.1485608
Rölle, D. (2017). Mass media and bureaucracy-bashing: Does the media influence public attitudes towards public
administration? Public Policy and Administration, 32(3), 232–258. https://doi.org/10.1177/0952076716658798
Rosenthal, S., Farra, N., & Nakov, P. (2019). SemEval-2017 Task 4: Sentiment Analysis in Twitter Proceedings of the 11th
International Workshop on Semantic Evaluation (SemEval-2017), Vancouver, August 2017 (pp. 502–518). Association
for Computational Linguistics. https://doi.org/10.18653/v1/S17-2088
Rudkowsky, E., Haselmayer, M., Wastian, M., Jenny, M., Emrich, Š., & Sedlmair, M. (2018). More than bags of words:
Sentiment analysis with word embeddings. Communication Methods and Measures, 12(2–3), 140–157. https://doi.
org/10.1080/19312458.2018.1455817
Stone, P. J., Dunphy, D. C., & Smith, M. S. (1966). The general inquirer: A computer approach to content analysis. M.I.T.
Press.
Takens, J., van Atteveldt, W., van Hoof, A., & Kleinnijenhuis, J. (2013). Media logic in election campaign coverage.
European Journal of Communication, 28(3), 277–293. https://doi.org/10.1177/0267323113478522
Tenenboim-Weinblatt, K., Baden, C., Aharoni, T., & Overbeck, M. (2022a). Affective forecasting in elections: A
socio-communicative perspective. Human Communication Research, 48(4), 553–566. https://doi.org/10.1093/hcr/
hqac007
Tenenboim-Weinblatt, K., Baden, C., Aharoni, T., & Overbeck, M. (2022b). Persistent optimism under political
uncertainty. In M. Shamir & G. Rahat (Eds), The elections in Israel, 2019–2021 (pp. 163–189). Routledge. https://
doi.org/10.4324/9781003267911-11
Tenenboim-Weinblatt, K., Baden, C., Aharoni, T., Overbeck, M. & Amit-Danhi, E. R. (2021). PROFECI codebook:
Israeli elections 2019-2021 & US elections 2016/2020. Hebrew University of Jerusalem: Working Paper Series of the
ERC-Funded Project: Mediating the Future: The Social Dynamics of Public Projections (PROFECI), 1–56. http://
profeci.net/PROFECI%20Working%20Paper%2002-2021%20Codebook.pdf
Thompson, G., & Hunston, S. (2000). Evaluation: an introduction. In G. Thompson & S. Hunston (Eds.), Evaluation in
text: Authorial stance and the construction of discourse (pp. 1–28). Oxford University Press.
22 M. OVERBECK ET AL.

van Atteveldt, W., Kleinnijenhuis, J., Ruigrok, N., & Schlobach, S. (2008). Good news or bad news? Conducting
sentiment analysis on Dutch text to distinguish between positive and negative relations. Journal of Information
Technology & Politics, 5(1), 73–94. https://doi.org/10.1080/19331680802154145
van Atteveldt, W., Sheafer, T., Shenhav, S. R., & Fogel-Dror, Y. (2017). Clause analysis: Using syntactic information to
automatically extract source, subject, and predicate from texts with an application to the 2008–2009 Gaza War.
Political Analysis, 25(2), 207–222. https://doi.org/10.1017/pan.2016.12
van Atteveldt, W., van der Velden, M. A. C. G., & Boukes, M. (2021). The validity of sentiment analysis: Comparing
manual annotation, crowd-coding, dictionary approaches, and machine learning algorithms. Communication
Methods and Measures, 15(2), 121–140. https://doi.org/10.1080/19312458.2020.1869198
van Spanje, J., & de Vreese, C. (2014). Europhile media and eurosceptic voting: Effects of news media coverage on
eurosceptic voting in the 2009 European parliamentary elections. Political Communication, 31(2), 325–354. https://
doi.org/10.1080/10584609.2013.828137
Voas, D. (2014). Towards a sociology of attitudes. Sociological Research Online, 19(1), 132–144. https://doi.org/10.5153/
sro.3289
Wiedemann, G. (2019). Proportional classification revisited: Automatic content analysis of political manifestos using
active learning. Social Science Computer Review, 37(2), 135–159. https://doi.org/10.1177/0894439318758389
Young, L., & Soroka, S. (2012). Affective news: The automated coding of sentiment in political texts. Political
Communication, 29(2), 205–231. https://doi.org/10.1080/10584609.2012.671234
Zimmermann, B. M., Aebi, N., Kolb, S., Shaw, D., & Elger, B. S. (2019). Content, evaluations and influences in newspaper
coverage of predictive genetic testing: A comparative media content analysis from the United Kingdom and
Switzerland. Public Understanding of Science, 28(3), 256–274. https://doi.org/10.1177/0963662518816014

You might also like