Lexicon-Based Methods For SA
Lexicon-Based Methods For SA
Sentiment Analysis
Maite Taboada
Simon Fraser University
Julian Brooke
University of Toronto
Milan Toloski
Simon Fraser University
Kimberly Voll
University of British Columbia
Manfred Stede
University of Potsdam
We present a lexicon-based approach to extracting sentiment from text. The Semantic Orientation CALculator (SO-CAL) uses dictionaries of words annotated with their semantic orientation
(polarity and strength), and incorporates intensication and negation. SO-CAL is applied to the
polarity classication task, the process of assigning a positive or negative label to a text that
captures the texts opinion towards its main subject matter. We show that SO-CALs performance
is consistent across domains and on completely unseen data. Additionally, we describe the process
of dictionary creation, and our use of Mechanical Turk to check dictionaries for consistency and
reliability.
1. Introduction
Semantic orientation (SO) is a measure of subjectivity and opinion in text. It usually
captures an evaluative factor (positive or negative) and potency or strength (degree
to which the word, phrase, sentence, or document in question is positive or negative)
Corresponding author. Department of Linguistics, Simon Fraser University, 8888 University Dr., Burnaby,
B.C. V5A 1S6 Canada. E-mail: [email protected].
Department of Computer Science, University of Toronto, 10 Kings College Road, Room 3302, Toronto,
Ontario M5S 3G4 Canada. E-mail: [email protected].
School of Computing Science, Simon Fraser University, 8888 University Dr., Burnaby, B.C. V5A 1S6
Canada. E-mail: [email protected].
Department of Computer Science, University of British Columbia, 201-2366 Main Mall, Vancouver, B.C.
V6T 1Z4 Canada. E-mail: [email protected].
Department of Linguistics, University of Potsdam, Karl-Liebknecht-Str. 24-25. D-14476 Golm, Germany.
E-mail: [email protected].
Submission received: 14 December 2009; revised submission received: 22 August 2010; accepted for
publication: 28 September 2010.
Computational Linguistics
towards a subject topic, person, or idea (Osgood, Suci, and Tannenbaum 1957). When
used in the analysis of public opinion, such as the automated interpretation of on-line
product reviews, semantic orientation can be extremely helpful in marketing, measures
of popularity and success, and compiling reviews.
The analysis and automatic extraction of semantic orientation can be found under
different umbrella terms: sentiment analysis (Pang and Lee 2008), subjectivity (Lyons
1981; Langacker 1985), opinion mining (Pang and Lee 2008), analysis of stance (Biber
and Finegan 1988; Conrad and Biber 2000), appraisal (Martin and White 2005), point
of view (Wiebe 1994; Scheibman 2002), evidentiality (Chafe and Nichols 1986), and a
few others, without expanding into neighboring disciplines and the study of emotion
(Ketal 1975; Ortony, Clore, and Collins 1988) and affect (Batson, Shaw, and Oleson 1992).
In this article, sentiment analysis refers to the general method to extract subjectivity
and polarity from text (potentially also speech), and semantic orientation refers to the
polarity and strength of words, phrases, or texts. Our concern is primarily with the
semantic orientation of texts, but we extract the sentiment of words and phrases towards
that goal.
There exist two main approaches to the problem of extracting sentiment automatically.1 The lexicon-based approach involves calculating orientation for a document from
the semantic orientation of words or phrases in the document (Turney 2002). The text
classication approach involves building classiers from labeled instances of texts or
sentences (Pang, Lee, and Vaithyanathan 2002), essentially a supervised classication
task. The latter approach could also be described as a statistical or machine-learning
approach. We follow the rst method, in which we use dictionaries of words annotated
with the words semantic orientation, or polarity.
Dictionaries for lexicon-based approaches can be created manually, as we describe
in this article (see also Stone et al. 1966; Tong 2001), or automatically, using seed words
to expand the list of words (Hatzivassiloglou and McKeown 1997; Turney 2002; Turney
and Littman 2003). Much of the lexicon-based research has focused on using adjectives as indicators of the semantic orientation of text (Hatzivassiloglou and McKeown
1997; Wiebe 2000; Hu and Liu 2004; Taboada, Anthony, and Voll 2006).2 First, a list
of adjectives and corresponding SO values is compiled into a dictionary. Then, for
any given text, all adjectives are extracted and annotated with their SO value, using
the dictionary scores. The SO scores are in turn aggregated into a single score for
the text.
The majority of the statistical text classication research builds Support Vector
Machine classiers, trained on a particular data set using features such as unigrams
or bigrams, and with or without part-of-speech labels, although the most successful features seem to be basic unigrams (Pang, Lee, and Vaithyanathan 2002; Salvetti,
Reichenbach, and Lewis 2006). Classiers built using supervised methods reach quite a
high accuracy in detecting the polarity of a text (Chaovalit and Zhou 2005; Kennedy
and Inkpen 2006; Boiy et al. 2007; Bartlett and Albright 2008). However, although
such classiers perform very well in the domain that they are trained on, their performance drops precipitously (almost to chance) when the same classier is used in
1 Pang and Lee (2008) provide an excellent recent survey of the opinion mining or sentiment analysis
problem and the approaches used to tackle it.
2 With some exceptions: Turney (2002) uses two-word phrases; Whitelaw, Garg, and Argamon (2005)
adjective phrases; and Benamara et al. (2007) adjectives with adverbial modiers. See also Section 2.1.
We should also point out that Turney does not create a static dictionary, but rather scores two-word
phrases on the y.
268
Taboada et al.
a different domain (Aue and Gamon [2005]; see also the discussion about domain
specicity in Pang and Lee [2008, section 4.4]).3 Consider, for example, an experiment using the Polarity Dataset, a corpus containing 2,000 movie reviews, in which
Brooke (2009) extracted the 100 most positive and negative unigram features from an
SVM classier that reached 85.1% accuracy. Many of these features were quite predictable: worst, waste, unfortunately, and mess are among the most negative, whereas
memorable, wonderful, laughs, and enjoyed are all highly positive. Other features are
domain-specic and somewhat inexplicable: If the writer, director, plot, or script are
mentioned, the review is likely to be disfavorable towards the movie, whereas the
mention of performances, the ending, or even aws, indicates a good movie. Closedclass function words appear frequently; for instance, as, yet, with, and both are all
extremely positive, whereas since, have, though, and those have negative weight. Names
also gure prominently, a problem noted by other researchers (Finn and Kushmerick
2003; Kennedy and Inkpen 2006). Perhaps most telling is the inclusion of unigrams
like 2, video, tv, and series in the list of negative words. The polarity of these words
actually makes some sense in context: Sequels and movies adapted from video games
or TV series do tend to be less well-received than the average movie. However, these
real-world facts are not the sort of knowledge a sentiment classier ought to be
learning; within the domain of movie reviews such facts are prejudicial, and in other
domains (e.g., video games or TV shows) they are either irrelevant or a source of
noise.
Another area where the lexicon-based model might be preferable to a classier
model is in simulating the effect of linguistic context. On reading any document, it
becomes apparent that aspects of the local context of a word need to be taken into
account in SO assessment, such as negation (e.g., not good) and intensication (e.g.,
very good), aspects that Polanyi and Zaenen (2006) named contextual valence shifters.
Research by Kennedy and Inkpen (2006) concentrated on implementing those insights.
They dealt with negation and intensication by creating separate features, namely,
the appearance of good might be either good (no modication) not good (negated good),
int good (intensied good), or dim good (diminished good). The classier, however, cannot
determine that these four types of good are in any way related, and so in order to train
accurately there must be enough examples of all four in the training corpus. Moreover,
we show in Section 2.4 that expanding the scope to two-word phrases does not deal
with negation adequately, as it is often a long-distance phenomenon. Recent work has
begun to address this issue. For instance, Choi and Cardie (2008) present a classier
that treats negation from a compositional point of view by rst calculating polarity of
terms independently, and then applying inference rules to arrive at a combined polarity
score. As we shall see in Section 2, our lexicon-based model handles negation and
intensication in a way that generalizes to all words that have a semantic orientation
value.
A middle ground exists, however, with semi-supervised approaches to the problem.
Read and Carroll (2009), for instance, use semi-supervised methods to build domainindependent polarity classiers. Read and Carroll built different classiers and show
that they are more robust across domains. Their classiers are, in effect, dictionarybased, differing only in the methodology used to build the dictionary. Li et al. (2010)
use co-training to incorporate labeled and unlabeled examples, also making use of
3 Blitzer, Dredze, and Pereira (2007) do show some success in transferring knowledge across domains, so
that the classier does not have to be re-built entirely from scratch.
269
Computational Linguistics
a distinction between sentences with a rst person subject and with other subjects.
Other hybrid methods include those of Andreevskaia and Bergler (2008), Dang, Zhang,
and Chen (2010), Dasgupta and Ng (2009), Goldberg and Zhu (2006), or Prabowo and
Thelwall (2009). Wan (2009) uses co-training in a method that uses English labeled data
and an English classier to learn a classier for Chinese.
In our approach, we seek methods that operate at a deep level of analysis, incorporating semantic orientation of individual words and contextual valence shifters, yet
do not aim at a full linguistic analysis (one that involves analysis of word senses or
argument structure), although further work in that direction is possible.
In this article, starting in Section 2, we describe the Semantic Orientation CALculator (SO-CAL) that we have developed over the last few years. We rst extract
sentiment-bearing words (including adjectives, verbs, nouns, and adverbs), and use
them to calculate semantic orientation, taking into account valence shifters (intensiers,
downtoners, negation, and irrealis markers). We show that this lexicon-based method
performs well, and that it is robust across domains and texts. One of the criticisms
raised against lexicon-based methods is that the dictionaries are unreliable, as they
are either built automatically or hand-ranked by humans (Andreevskaia and Bergler
2008). In Section 3, we present the results of several experiments that show that our
dictionaries are robust and reliable, both against other existing dictionaries, and as
compared to values assigned by humans (through the use of the Mechanical Turk
interface). Section 4 provides comparisons to other work, and Section 5 conclusions.
270
Taboada et al.
and Hovy 2004; Esuli and Sebastiani 2006). The dictionary may also be produced
automatically via association, where the score for each new adjective is calculated using
the frequency of the proximity of that adjective with respect to one or more seed words.
Seed words are a small set of words with strong negative or positive associations, such
as excellent or abysmal. In principle, a positive adjective should occur more frequently
alongside the positive seed words, and thus will obtain a positive score, whereas
negative adjectives will occur most often in the vicinity of negative seed words, thus
obtaining a negative score. The association is usually calculated following Turneys
method for computing mutual information (Turney 2002; Turney and Littman 2003),
but see also Rao and Ravichandran (2009) and Velikovich et al. (2010) for other methods
using seed words.
Previous versions of SO-CAL (Taboada and Grieve 2004; Taboada, Anthony, and
Voll 2006) relied on an adjective dictionary to predict the overall SO of a document,
using a simple aggregate-and-average method: The individual scores for each adjective
in a document are added together and then divided by the total number of adjectives
in that document.4 As we describe subsequently, the current version of SO-CAL takes
other parts of speech into account, and makes use of more sophisticated methods to
determine the true contribution of each word.
It is important to note that how a dictionary is created affects the overall accuracy
of subsequent results. In Taboada, Anthony, and Voll (2006) we report on experiments
using different search engines and operators in trying to create dictionaries semiautomatically. We found that, although usable, dictionaries created using the Google
search engine were unstable. When rerun, the results for each word were subject to
change, sometimes by extreme amounts, something that Kilgarriff (2007) also notes,
arguing against the use of Google for linguistic research of this type. An alternative
would be to use a sufciently large static corpus, as Turney (2006) does to measure
relational similarity across word pairs.
Automatically or semi-automatically created dictionaries have some advantages.
We found many novel words in our initial Google-generated dictionary. For instance,
unlistenable was tagged accurately as highly negative, an advantage that Velikovich
et al. (2010) point out. However, in light of the lack of stability for automatically
generated dictionaries, we decided to create manual ones. These were produced by
hand-tagging all adjectives found in our development corpus, a 400-text corpus of
reviews (see the following) on a scale ranging from 5 for extremely negative to +5 for
extremely positive, where 0 indicates a neutral word (excluded from our dictionaries).
Positive and negative were decided on the basis of the words prior polarity, that
is, its meaning in most contexts. We do not deal with word sense disambiguation
but suspect that using even a simple method to disambiguate would be benecial.
Some word sense ambiguities are addressed by taking part of speech into account.
For instance, as we mention in Section 3.4, plot is only negative when it is a verb,
but should not be so in a noun dictionary; novel is a positive adjective, but a neutral
noun.
To build the system and run our experiments, we use the corpus described in
Taboada and Grieve (2004) and Taboada, Anthony, and Voll (2006), which consists of a
400-text collection of Epinions reviews extracted from eight different categories: books,
cars, computers, cookware, hotels, movies, music, and phones, a corpus we named
Epinions 1. Within each collection, the reviews were split into 25 positive and 25
271
Computational Linguistics
negative reviews, for a total of 50 in each category, and a grand total of 400 reviews in
the corpus (279,761 words). We determined whether a review was positive or negative
through the recommended or not recommended feature provided by the reviews
author.
Although the sentences have comparable literal meanings, the plus-marked nouns,
verbs, and adverbs in Example (1a) indicate the positive orientation of the speaker
towards the situation, whereas the minus-marked words in Example (1b) have the
opposite effect. It is the combination of these words in each of the sentences that conveys
the semantic orientation for the entire sentence.5
In order to make use of this additional information, we created separate noun, verb,
and adverb dictionaries, hand-ranked using the same +5 to 5 scale as our adjective
dictionary. The enhanced dictionaries contain 2,252 adjective entries, 1,142 nouns, 903
verbs, and 745 adverbs.6 The SO-carrying words in these dictionaries were taken from
a variety of sources, the three largest being Epinions 1, the 400-text corpus described
in the previous section; a 100-text subset of the 2,000 movie reviews in the Polarity
Dataset (Pang, Lee, and Vaithyanathan 2002; Pang and Lee 2004, 2005);7 and positive
and negative words from the General Inquirer dictionary (Stone et al. 1966; Stone 1997).8
The sources provide a fairly good range in terms of register: The Epinions and movie
reviews represent informal language, with words such as ass-kicking and nifty; at the
other end of the spectrum, the General Inquirer was clearly built from much more
formal texts, and contributed words such as adroit and jubilant, which may be more
useful in the processing of literary reviews (Taboada, Gillies, and McFetridge 2006;
Taboada et al. 2008) or other more formal texts.
Each of the open-class words was assigned a hand-ranked SO value between 5
and 5 (neutral or zero-value words were excluded) by a native English speaker. The
numerical values were chosen to reect both the prior polarity and the strength of the
word, averaged across likely interpretations. The dictionaries were later reviewed by a
committee of three other researchers in order to minimize the subjectivity of ranking SO
by hand. Examples are shown in Table 1.
One difculty with nouns and verbs is that they often have both neutral and nonneutral connotations. In the case of inspire (or determination), there is a very positive
meaning (Example (2)) as well as a rather neutral meaning (Example (3)).
(2)
(3)
5 Something that Turney (2002) already partially addressed, by extracting two-word phrases.
6 Each dictionary also has associated with it a stop-word list. For instance, the adjective dictionary has a
stop-word list that includes more, much, and many, which are tagged as adjectives by the Brill tagger.
7 Available from www.cs.cornell.edu/People/pabo/movie-review-data/.
8 Available from www.wjh.harvard.edu/inquirer/.
272
Taboada et al.
Table 1
Examples of words in the noun and verb dictionaries.
Word
monstrosity
hate (noun and verb)
disgust
sham
fabricate
delay (noun and verb)
determination
inspire
inspiration
endear
relish (verb)
masterpiece
SO Value
5
4
3
3
2
1
1
2
2
3
4
5
Except when one sense was very uncommon, the value chosen reected an averaging
across possible interpretations. In some cases, the verb and related noun have a different
SO value. For instance, exaggerate is 1, whereas exaggeration is 2, and the same
values are applied to complicate and complication, respectively. We nd that grammatical
metaphor (Halliday 1985), that is, the use of a noun to refer to an action, adds a more
negative connotation to negative words.
All nouns and verbs encountered in the text are lemmatized,9 and the form (singular
or plural, past tense or present tense) is not taken into account in the calculation of SO
value. As with the adjectives, there are more negative nouns and verbs than positive
ones.10
The adverb dictionary was built automatically using our adjective dictionary, by
matching adverbs ending in -ly to their potentially corresponding adjective, except
for a small selection of words that were added or modied by hand. When SO-CAL
encountered a word tagged as an adverb that was not already in its dictionary, it
would stem the word and try to match it to an adjective in the main dictionary. This
worked quite well for most adverbs, resulting in semantic orientation values that seem
appropriate (see examples in Table 2).
In other casesfor example, essentiallythere is a mismatch between the meaning
(or usage pattern) of the adverb when compared to the adjective it is based on, and the
value was manually corrected.
Although the vast majority of the entries are single words, SO-CAL allows for
multi-word entries written in a regular expressionlike language; in particular, the verb
dictionary contains 152 multi-word expressions (mostly phrasal verbs, e.g., fall apart),
and the intensier dictionary, described subsequently, contains 35 multi-word entries
(e.g., a little bit). Multi-word expressions take precedence over single-word expressions;
for instance, funny by itself is positive (+2), but if the phrase act funny appears, it is given
a negative value (1).
9 Lemmatization is a simple process of stripping any endings from words not in the dictionary, according
to their part of speech. After stripping, we perform a new dictionary look-up.
10 The ratio for adjectives is 47:53 positive to negative, and for nouns it is 41:59.
273
Computational Linguistics
Table 2
Examples from the adverb dictionary.
Word
SO Value
excruciatingly
inexcusably
foolishly
satisfactorily
purposefully
hilariously
5
3
2
1
2
4
274
Taboada et al.
Table 3
Percentages for some intensiers.
Intensier
slightly
somewhat
pretty
really
very
extraordinarily
(the) most
Modier (%)
50
30
10
+15
+25
+50
+100
intensiers using simple addition and subtractionthat is, if a positive adjective has
an SO value of 2, an amplied adjective would have an SO value of 3, and a downtoned
adjective an SO value of 1. One problem with this kind of approach is that it does
not account for the wide range of intensiers within the same subcategory. Extraordinarily, for instance, is a much stronger amplier than rather. Another concern is that
the amplication of already loud items should involve a greater overall increase in
intensity when compared to more subdued counterparts (compare truly fantastic with
truly okay); in short, intensication should also depend on the item being intensied.11
In SO-CAL, intensication is modeled using modiers, with each intensifying word
having a percentage associated with it; ampliers are positive, whereas downtoners are
negative, as shown in Table 3.
For example, if sleazy has an SO value of 3, somewhat sleazy would have an SO
value of: 3 (100% 30%) = 2.1. If excellent has a SO value of 5, most excellent would
have an SO value of: 5 (100% + 100%) = 10. Intensiers are applied recursively starting from the closest to the SO-valued word: If good has an SO value of 3, then really very
good has an SO value of (3 [100% + 25%]) (100% + 15%) = 4.3.
Because our intensiers are implemented using a percentage scale, they are able
to fully capture the variety of intensifying words as well as the SO value of the item
being modied. This scale can be applied to other parts of speech, given that adjectives,
adverbs, and verbs use the same set of intensiers, as seen in Example (4), where really
modies an adjective ( fantastic), an adverb (well), and a verb (enjoyed).
(4)
Nouns, however, are modied exclusively by adjectives. We are able to take into
account some kinds of modication using our main adjective dictionary; there is a
small class of adjectives which would not necessarily amplify or downtone correctly if
considered in isolation, however, as seen in the following (invented) examples. Here,
adjectives such as total do not have a semantic orientation of their own, but, just
like adverbial intensiers, contribute to the interpretation of the word that follows
them; total failure is presumably worse than just failure. Thus, we have a separate
dictionary for adjectival intensiers. When an intensifying adjective appears next to
11 Martin and White (2005, page 139) also suggest that the effect is different according to the polarity of the
item being intensied. We have not explored that possibility.
275
Computational Linguistics
Besides adverbs and adjectives, other intensiers are quantiers (a great deal of ). We
also included three other kinds of intensication that are common within our genre:
the use of all capital letters, the use of exclamation marks, and the use of discourse
connective but to indicate more salient information (e.g., ...but the movie was GREAT!).12
In all, our intensier dictionary contains 177 entries, some of them multi-word
expressions.
2.4 Negation
The obvious approach to negation is simply to reverse the polarity of the lexical item
next to a negator, changing good (+3) into not good (3). This we may refer to as switch
negation (Saur 2008). There are a number of subtleties related to negation that need
to be taken into account, however. One is the fact that there are negators, including
not, none, nobody, never, and nothing, and other words, such as without or lack (verb
and noun), which have an equivalent effect, some of which might occur at a signicant
distance from the lexical item which they affect; a backwards search is required to nd
these negators, one that is tailored to the particular part of speech involved. We assume
that for adjectives and adverbs the negation is fairly local, though it is sometimes
necessary to look past determiners, copulas, and certain verbs, as we see in Example (6).
(6)
Negation search in SO-CAL includes two options: Look backwards until a clause
boundary marker is reached;13 or look backwards as long as the words/tags found are
in a backward search skip list, with a different list for each part of speech. The rst
approach is fairly liberal, and allows us to capture the true effects of negation raising
(Horn 1989), where the negator for a verb moves up and attaches to the verb in the
matrix clause. In the following examples the dont that negates the verb think is actually
negating the embedded clause.
(7)
I dont wish to reveal much else about the plot because I dont think it is worth
mentioning.
12 The discourse connective but plays a role in linking clauses and sentences in a rhetorical relation (Mann
and Thompson 1988). There are more sophisticated ways of making use of those relations, but we have
not implemented them yet.
13 Clause boundary markers include punctuation and sentential connectives, including some ambiguous
ones such as and and but.
276
Taboada et al.
(8) Based on other reviews, I dont think this will be a problem for a typical household environment.
The second approach is more conservative. In Example (7), the search would only go as
far as it, because adjectives (worth), copulas, determiners, and certain basic verbs are on
the list of words to be skipped (allowing negation of adjectives within VPs and NPs),
but pronouns are not. Similarly, verbs allow negation on the other side of to, and nouns
look past adjectives as well as determiners and copulas. This conservative approach
seems to work better, and is what we use in all the experiments in this article.14
Another issue is whether a polarity ip (switch negation) is the best way to quantify
negation. Though it seems to work well in certain cases (Choi and Cardie 2008), it fails
miserably in others (Liu and Seneff 2009). Consider excellent, a +5 adjective: If we negate
it, we get not excellent, which intuitively is a far cry from atrocious, a 5 adjective. In fact,
not excellent seems more positive than not good, which would negate to a 3. In order
to capture these pragmatic intuitions, we implemented another method of negation,
a polarity shift (shift negation). Instead of changing the sign, the SO value is shifted
toward the opposite polarity by a xed amount (in our current implementation, 4). Thus
a +2 adjective is negated to a 2, but the negation of a 3 adjective (for instance, sleazy)
is only slightly positive, an effect we could call damning with faint praise. Here are a
few examples from our corpus.
(9)
In each case, the negation of a strongly positive or negative value reects a mixed
opinion which is correctly captured in the shifted value. Further (invented) examples
are presented in Example (10).
(10)
a.
b.
c.
d.
As shown in the last example, it is very difcult to negate a strongly positive word
without implying that a less positive one is to some extent true, and thus our negator
becomes a downtoner.
A related problem for the polarity ip model, as noted by Kennedy and Inkpen
(2006), is that negative polarity items interact with intensiers in undesirable ways. Not
very good, for instance, comes out more negative than not good. Another way to handle
this problem while preserving the notion of a polarity ip is to allow the negative item
to ip the polarity of both the adjective and the intensier; in this way, an amplier
becomes a downtoner:
Not good = 3 1 = 3
Not very good = 3 (100% + 25% 1) 1 = 2.25
14 Full parsing is also an option, but the speed of the parser could pose problems if the goal is to process
text on-line. Parsing would still produce ambiguities, and may not be able to correctly interpret scope.
Another option is to use parser results to learn the scope (Councill, McDonald, and Velikovich 2010).
277
Computational Linguistics
Compare with the polarity shift version, which is only marginally negative:
Not good = 3 4 = 1
Not very good = 3 (100% + 25%) 4 = 0.25
The problems with polarity shift could probably be resolved by ne-tuning SO
values and modiers; the polarity ip model seems fundamentally awed, however.
Polarity shifts seem to better reect the pragmatic reality of negation, and is supported by Horn (1989), who suggests that afrmative and negative sentences are not
symmetrical.
One other interesting aspect of the pragmatics of negation is that negative statements tend to be perceived as more marked than their afrmative counterparts, both
pragmatically and psychologically (Osgood and Richards 1973; Horn 1989, chapter 3).
This markedness is true in terms of linguistic form, with negative forms being marked
across languages (Greenberg 1966), and it is also manifested as (token) frequency distribution, with negatives being less frequent (Boucher and Osgood 1969).15 Negation tends
to be expressed in euphemistic ways, which makes negative sentiment more difcult to
identify in general.
In our treatment of negation, we consider mostly negators, but not negative polarity
items (NPIs), such as any, anything, ever, or at all. In some cases, searching for an NPI
would be more effective than searching for a negator. NPIs occur in negative sentences,
but also in nonveridical contexts (Zwarts 1995; Giannakidou 1998), which also affect
semantic orientation. For instance, any occurs in contexts other than negative sentences,
as shown in Example (11), from Giannakidou (2001, page 99), where in all cases the
presence of any is due to a nonveridical situation. Using NPIs would allow us to reduce
semantic orientation values in such contexts. We address some of these issues through
irrealis blocking, as we explain in the next section.
(11)
a.
b.
c.
d.
Similarly, negation calculation does not include what Choi and Cardie (2008) term
content word negators, words such as eliminate. Most of those are included in the
respective dictionaries (i.e., the verb dictionary for eliminate) with negative polarity.
When they occur in a sentence, aggregation with other sentiment words in the sentence
would probably yield a result similar to the compositional approach of Choi and Cardie
or Moilanen and Pulman (2007).
2.5 Irrealis Blocking
There are a number of markers that indicate that the words appearing in a sentence
might not be reliable for the purposes of sentiment analysis. We refer to these using
the linguistic term irrealis, usually applied in non-factual contexts. English does not
make extensive use of the subjunctive for this purpose, as opposed to other languages,
such as Spanish, which tend to use the subjunctive mood to indicate that what is being
15 Some researchers argue that there is a negative bias in the human representation of experience (negative
events are more salient), and the positive bias found by Boucher and Osgood is the result of euphemisms
and political correctness in language (Jing-Schmidt 2007).
278
Taboada et al.
expressed is not factual. However, English has a few other mechanisms to convey
irrealis. Word order, modal verbs, and private-state verbs related to expectation fulll
that function. The imperative mood also conveys that the action expressed has not
occurred.
Irrealis markers can change the meaning of sentiment-bearing words in very subtle
ways. In some cases, such as Example (12), the right approach is to reverse the SO of
good, which is in the scope of the modal verb would. This interpretation is supported by
the contrast in the but clause. In Example (13), on the other hand, the modal should not
reverse the positive evaluation conveyed by best.
(12)
I thought this movie would be as good as the Grinch, but unfortunately, it wasnt.
(13)
But for adults, this movie could be one of the best of the holiday season.
The approach currently implemented consists of ignoring the semantic orientation of any word in the scope of an irrealis marker (i.e., within the same clause). In
Example (14), the positive value of great is simply ignored.
(14)
Our list of irrealis markers includes modals, conditional markers (if ), negative
polarity items like any and anything, certain (mostly intensional) verbs (expect, doubt),
questions, and words enclosed in quotes (which may be factual, but not necessarily
reective of the authors opinion).
There is good reason to include NPIs as irrealis blockers rather than as full-edged
negators: NPIs often appear in embedded alternatives which are not generally marked
with question marks and where negation would not be appropriate. In the following
example, any is part of the complement of wonder, which has an implicit alternative
(whether there are going to be any problems with that ... or not).
(15)
There is one case, at least, where it is clear that the SO value of a term should
not be nullied by an irrealis blocker, as in Example (16), where the question mark
currently blocks the negative orientation of amateurish crap. The question is rhetorical
in this case, but we have no way of distinguishing it from a real question. Although not
very common, this kind of off-hand opinion, buried in a question, imperative, or modal
clause, is often quite strong and very reliable. SO-CAL looks for markers of deniteness
within close proximity of SO-carrying words (within the NP, such as the determiner
this), and ignores irrealis blocking if an irrealis marker is found.
(16)
... he can get away with marketing this amateurish crap and still stay on the
bestseller list?
279
Computational Linguistics
different approach, instead supposing that negative expressions, being relatively rare,
are given more cognitive weight when they do appear. Thus we increase the nal SO
of any negative expression (after other modiers have applied) by a xed amount
(currently 50%). This seems to have essentially the same effect in our experiments, and
is more theoretically satisfying.
Pang, Lee, and Vaithyanathan (2002) found that their machine-learning classier
performed better when a binary feature was used indicating the presence of a unigram
in the text, instead of a numerical feature indicating the number of appearances. Counting each word only once does not seem to work equally well for word-counting models.
We have, however, improved overall performance by decreasing the weight of words
that appear more often in the text: The nth appearance of a word in the text will have
only 1/n of its full SO value.17 Consider the following invented example.
(17)
Overall, the lm was excellent. The acting was excellent, the plot was excellent,
and the direction was just plain excellent.
Pragmatically, the repetition of excellent suggests that the writer lacks additional substantive commentary, and is simply using a generic positive word. We could also impose
an upper limit on the distance of repetitions, and decrease the weight only when they
appear close to each other. Repetitive weighting does not apply to words that have
been intensied, the rationale being that the purpose of the intensier is to draw special
attention to them.
Another reason to tone down words that appear often in a text is that a word that
appears regularly is more likely to have a neutral sense. This is particularly true of
nouns. In one example from our corpus, the words death, turmoil, and war each appear
twice. A single use of any of these words might indicate a comment (e.g., I was bored to
death), but repeated use suggests a descriptive narrative.
2.7 Other Features of SO-CAL
Two other features merit discussion: weighting and multiple cut-offs. First of all, SOCAL incorporates an option to assign different weights to sentences or portions of a text.
Taboada and Grieve (2004) improved performance of an earlier version of the SO calculator by assigning the most weight at the two-thirds mark of a text, and signicantly
less at the beginning. The current version has a user-congurable form of this weighting
system, allowing any span of the text (with the end points represented by fractions of
the entire text) to be given a certain weight. An even more exible and powerful system
is provided by the XML weighting option. When this option is enabled, XML tag pairs in
the text (e.g., <topic>, </topic>) can be used as a signal to the calculator that any words
appearing between these tags should be multiplied by a certain given weight. This gives
SO-CAL an interface to outside modules. For example, one module could pre-process
the text and tag spans that are believed to be topic sentences, another module could
provide discourse information such as rhetorical relations (Mann and Thompson 1988),
and a third module could label the sentences that seem to be subjective. Armed with
this information, SO-CAL can disregard or de-emphasize parts of the text that are less
relevant to sentiment analysis. This weighting feature is used in Taboada, Brooke, and
17 One of the reviewers points out that this is similar to the use of term frequency (tf-idf) in information
retrieval (Salton and McGill 1983). See also Paltoglou and Thelwall (2010) for a use of information
retrieval techniques in sentiment analysis.
280
Taboada et al.
18 One hundred reviews of the Polarity Dataset were used for development, and thus those are excluded
from our testing. The performance difference between using the full 2,000 texts or 1,900 is negligible.
19 The development corpus (Epinions 1) and two annotated versions of it, for rhetorical relations and
Appraisal, are available from the projects Web site: www.sfu.ca/mtaboada/research/nsercproject.html.
281
Computational Linguistics
in the dictionaries. The performance is constant across review domains, however, and
remains very good in completely new domains, which shows that there was no overtting of the original set.
Table 4 shows a comparison using the current version of SO-CAL with various
dictionary alternatives. These results were obtained by comparing the output of SOCAL to the recommended or not recommended eld of the reviews. An output
above zero is considered positive (recommended), and negative if below zero.
The Simple dictionary is a version of our main dictionary that has been simplied to
2/2 values, switch negation, and 1/1 intensication, following Polanyi and Zaenen
(2006). Only-Adj excludes dictionaries other than our main adjective dictionary, and
the One-Word dictionary uses all the dictionaries, but disregards multi-word expressions. Asterisks indicate a statistically-signicant difference using chi-square tests, with
respect to the full version of SO-CAL, with all features enabled and at default settings.
These results indicate a clear benet to creating hand-ranked, ne-grained,
multiple-part-of-speech dictionaries for lexicon-based sentiment analysis; the full dictionary outperforms all but the One-Word dictionary to a signicant degree (p < 0.05) in
the corpora as a whole. It is important to note that some of the parameters and features
that we have described so far (the xed number for negative shifting, percentages for
intensiers, negative weighting, etc.) were ne-tuned in the process of creating the
software, mostly by developing and testing on Epinions 1. Once we were theoretically
and experimentally satised that the features were reasonable, we tested the nal set of
parameters on the other corpora.
Table 5 shows the performance of SO-CAL with a number of different options, on
all corpora (recall that all but Epinions 1 are completely unseen data). Neg w and
rep w refer to the use of negative weighting (the SO of negative terms is increased
by 50%) and repetition weighting (the nth appearance of a word in the text has 1/n of
its full SO value). Space considerations preclude a full discussion of the contribution of
each part of speech and sub-feature, but see Brooke (2009) for a full range of tests using
these data. Here the asterisks indicate a statistically-signicant difference compared to
the preceding set of options.
As we can see in the table, the separate features contribute to performance. Negation and intensication together increase performance signicantly. One of the most
important gains comes from negative weighting, with repetition weighting also contributing in some, but not all, of the corpora. Although the difference is small, we
see here that shifted polarity negation does, on average, perform better than switched
polarity negation. We have not presented all the combinations of features, but we know
from other experiments, that, for instance, basic negation is more important than basic
Table 4
Comparison of performance using different dictionaries.
Dictionary
Simple
Only-Adj
One-Word
Full
Epinions 2
Movie
Camera
Overall
76.75
72.25*
80.75
80.25
76.50
74.50
80.00
80.00
69.79*
76.63
75.68
76.37
78.71
71.98*
79.54
80.16
75.11*
73.93*
78.23
78.74
282
Taboada et al.
Table 5
Performance of SO-CAL using various options.
SO-CAL options
Epinions 2
Movie
Camera
Overall
65.50
65.25
68.05
64.70
66.04
67.75
69.25
67.25
71.50
70.10
73.47*
67.25*
70.00*
68.35*
71.35*
71.00
71.25
74.95
71.37
72.66*
81.50*
78.25*
75.08
78.24*
77.32*
80.25
80.00
76.37
80.16*
78.74*
80.00
80.00
75.57
80.04
78.37
*Statistically signicant compared to the preceding set of options (Table 4) p < 0.05.
283
Computational Linguistics
Table 6
Performance across review types and on positive and negative reviews.
Subcorpus
Books
Cars
Computers
Cookware
Hotels
Movies
Music
Phones
Total
Epinions 1
Epinions 2
Pos-F
Neg-F
Accuracy
Pos-F
Neg-F
Accuracy
0.69
0.90
0.94
0.74
0.76
0.84
0.82
0.81
0.81
0.74
0.89
0.94
0.58
0.67
0.84
0.82
0.78
0.79
0.72
0.90
0.94
0.68
0.72
0.84
0.82
0.80
0.80
0.69
0.80
0.90
0.79
0.80
0.76
0.83
0.85
0.81
0.77
0.75
0.89
0.76
0.70
0.79
0.81
0.83
0.79
0.74
0.78
0.90
0.78
0.76
0.78
0.82
0.84
0.80
284
Taboada et al.
extracted all the sentences that contained subjective positive and negative expressions,
in all levels of intensity (low, medium, high, and extreme). The extracted set contains
663 positive and 1,211 negative sentences.
The data from Mike Thelwall consists of comments posted on MySpace.com. The
annotation is done on a 1 to 5 scale, where 1 indicates no emotion. As a consequence,
we focused on the comments with scores of 4 and 5. Because each comment had both
a positive and negative label, we labeled positive those with a higher positive score
and vice versa for negative labels, and excluded comments with the same score for both
(i.e., neutral). This yielded a total of 83 comments (59 positive, 24 negative).
The data from Alina Andreevskaia consist of individual sentences from both news
and blogs, annotated according to whether they are negative, positive, or neutral. We
used only the negative and positive sentences (788 from news, and 802 from blogs,
equally divided between positive and negative).
The Affective Text data from Rada Mihalcea and Carlo Strappavara was used in the
2007 SemEval task. It contains 1,000 news headlines annotated with a range between
100 (very negative) and 100 (very positive). We excluded six headlines that had been
labeled as 0 (therefore neutral), yielding 468 positive and 526 negative headlines. In
addition to the full evaluation, Strappavara and Mihalcea (2007) also propose a coarse
evaluation, where headlines with scores 100 to 50 are classied as negative, and
those 50 to 100 as positive. Excluding the headlines in the middle gives us 155 positive
headlines and 255 negative ones.
Table 7 shows the results of the evaluation. Included in the table is a baseline for
each data set, assigning polarity to the most frequent class for the data. These data sets
include much smaller spans of text than are found in consumer reviews, with some
sentences or headlines not containing any words from the SO-CAL dictionaries. This
ranged from about 21% of the total in the MySpace comments to 54% in the headlines.21
Two approaches were used in this cross-domain evaluation when SO-CAL encountered
texts for which it found no words in its dictionaries (SO-empty texts). First, the backoff method involves using the most frequent polarity for the corpus (or positive, when
they are equal), and assigning that polarity to all SO-empty texts. This method provides
results that can be directly compared to other results on these data sets, although, like
the baseline, it assumes some knowledge about the polarity balance of the corpus. The
gures in the rst section of Table 7 suggest robust performance as compared to a mostfrequent-class baseline, including modest improvement over the relevant cross-domain
results of Andreevskaia and Bergler (2008).22 Moilanen, Pulman, and Zhang (2010) also
use the headlines data, and obtain a polarity classication accuracy of 77.94% below our
results excluding empty.23
21 By default, SO-CAL assigns a zero to such texts, which is usually interpreted to mean that the text is
neither positive nor negative. However, in a task where we know a priori that all texts are either positive
or negative, this can be a poor strategy, because we will get all of these empty texts wrong: When there
are a signicant number of empty texts, performance can be worse than guessing. Note that the problem
of how to interpret empty texts is not a major issue for the full text reviews where we typically apply
SO-CAL, because there are very few of them; for instance, out of the 2,400 texts in the Camera corpus,
only 4 were assigned a zero by SO-CAL. Guessing the polarity or removing those four texts entirely has
no effect on the accuracy reported in Table 5.
22 Their ensemble classier had 73.3% accuracy in news, but only 70.9% in blogs, and their performance in
the Polarity Dataset was 62.1%, or over 14% lower than ours.
23 Our results are not comparable to those of Thelwall et al. (2010) on the MySpace comments, as they
classify the comments on a 15 scale (obtaining average accuracy of 60.6% and 72.8% in positive and
negative comments, respectively), whereas we have a much simpler two-point scale (positive or
negative).
285
Computational Linguistics
Table 7
Performance of SO-CAL in other domains.
SO-CAL options
Blogs
SO-CAL (back-off)
Baseline
(most frequent class)
73.64
64.64
81.93
71.08
71.57
50.00
75.31
50.00
62.17
52.92
74.63
62.20
SO-CAL
(excluding empty)
Baseline
(most frequent class,
excluding empty)
79.38
78.69
77.76
82.33
79.83
88.98
66.94
69.93
51.10
50.00
59.87
67.37
% SO-empty
28.61
21.12
25.25
21.70
54.00
43.98
The second method used in evaluating SO-CAL on SO-empty texts is to only classify
texts for which it has direct evidence to make a judgment. Thus, we exclude such
SO-empty texts from the evaluation. The second part of Table 7 shows the results of
this evaluation. The results are strikingly similar to the performance we saw on full
review texts, with most attaining a minimum of 7580% accuracy. Although missing
vocabulary (domain-specic or otherwise) undoubtedly plays a role, the results provide
strong evidence that relative text size is the primary cause of SO-empty texts in these
data sets. When the SO-empty texts are removed, the results are entirely comparable to
those that we saw in the previous section. Although sentence-level polarity detection is
a more difcult task, and not one that SO-CAL was specically designed for, the system
has performed well on this task, here, and in related work (Murray et al. 2008; Brooke
and Hurst 2009).
3. Validating the Dictionaries
To a certain degree, acceptable performance across a variety of data sets, and, in particular, improved performance when the full granularity of the dictionaries is used (see
Table 5), provides evidence for the validity of SO-CALs dictionary rankings. Recall also
that the individual word ratings provided by a single researcher were reviewed by a
larger committee, mitigating some of the subjectivity involved. Nevertheless, some independent measure of how well the dictionary rankings correspond to the intuitions of
English speakers would be valuable, particularly if we wish to compare our dictionaries
with automatically generated ones.
The most straightforward way to investigate this problem would be to ask one or
more annotators to re-rank our dictionaries, and compute the inter-annotator agreement. However, besides the difculty and time-consuming nature of the task, any
simple metric derived from such a process would provide information that was useful
only in the context of the absolute values of our 5 to +5 scale. For instance, if our reranker is often more conservative than our original rankings (ranking most SO 5 words
as 4, SO 4 words as 3, etc.), the absolute agreement might approach zero, but we would
like to be able to claim that the rankings actually show a great deal of consistency given
286
Taboada et al.
24 www.mturk.com/.
287
Computational Linguistics
corpus), including a word in the analysis only if it appeared at least ve times in both
corpora. That gave us a collection of 483 commonly occurring adjectives; these, as well
as our intensier dictionary, were the focus of our evaluation. We also investigated a
set of nouns chosen using the same rubric. In most cases, the results were comparable
to those for adjectives; however, with a smaller test set (only 184 words), the data were
generally messier, and so we omit the details here.
The basic premise behind our evaluation technique can be described as follows: The
distributional spread of answers in a simple, three-way decision task should directly
reect the relative distance of words on an SO (semantic orientation) scale. In particular,
we can validate our dictionaries without forcing Turkers to use our 11-point scale
(including zero), making their task signicantly less onerous as well as less subject to
individual bias. We chose to derive the data in two ways: one task where the goal is to
decide whether a word is positive, negative, or neutral; and another where two polar
words are presented for comparison (Is one stronger, or are they the same strength?). In
the former task, we would expect to see more errors, that is, cases where a polar term
in our dictionary is labeled neutral, in words that are in fact more neutral (SO value 1
versus SO value 5). Similarly, in the second task we would expect to see more equal
judgments of words which are only 1 SO unit apart than those which are 3 SO units
apart. More formally, we predict that a good ranking of words subjected to this testing
should have the following characteristics:
r
r
For the rst task, that is, the neutral/negative/positive single word decision task,
we included neutral words from our Epinions corpus which had originally been excluded from our dictionaries. Purely random sampling would have resulted in very
little data from the high SO end of the spectrum, so we randomly selected rst by SO
(neutral SO = 0) and then by word, re-sampling from the rst step if we had randomly
selected a word that had been used before. In the end, we tested 400 adjectives chosen
using this method. For each word, we solicited six judgments through Mechanical Turk.
Preparing data for the word comparison task was slightly more involved, because
we did not want to remove words from consideration just because they had been used
once. Note that we rst segregated the words by polarity: We compared positive words
with positive words and negative words with negative words. This rst test yielded a
nice wide range of comparisons by randomly selecting rst by SO and then by word, as
well as by lowering the probability of picking high (absolute) SO words, and discounting words which had been used recently (in early attempts we saw high absolute SO
words like great and terrible appearing over and over again, sometimes in consecutive
queries). Though the odds of this occurring were low, we explicitly disallowed duplicate
288
Taboada et al.
pairs. Once we settled on a method, we created 500 pairs of positive adjectives and
500 pairs of negative adjectives. Again, for each pair we solicited six judgments.
In addition to our standard pair comparison data sets, we also created, using
the same method, four data sets which compared negated words to words of opposite
polarity (i.e., not bad and good). The primary goal here was not to evaluate the ranking
of the words, but rather to see how well our two models of negation (switch and shift)
correspond to human intuition across a wide variety of cases. To do so, we assume that
our dictionary is generally correct, and then use the SO values after negation as input.
Finally, we wanted to evaluate the intensier dictionary, again using pair-wise
comparisons of strength. To this end, we selected 500 pairs of adverbial modiers (e.g.,
very). Similar to the main dictionary pairs, we randomly selected rst by modier value,
and then by word, discounting a pair if one of the words had appeared in the 10 most
recently selected pairs. These words were presented with a uniform adjective pairing
(likeable), to assist the Turkers in interpreting them.
3.2 Evaluation
Figure 1 shows the distribution of responses by SO value in the single-word identication task. The graph is very close to what we predicted. Neutral judgments peak at
0 SO, but are also present for those SO values in the neighborhood of 0, decreasing as we
increase our SO distance from the original. The effect is not quite linear, which might
reect either on our scale (it is not as linear as we presume) or, more likely, the fact
that the distance between 0 and 1/1 is simply a much more relevant distance for the
purposes of the task; unlike words that differ in strength, the difference between 0 and
1 is theoretically a difference in kind, between a word that has a positive or negative
connotation, and one that is purely descriptive. Another limitation of this method, with
respect to conrming our dictionary rankings, is the fact that it does not illuminate the
edges of the spectrum, as the distributions hit their maximums before the 5/5 extreme.
Because we asked six Turkers to provide responses for each word, we can also
calculate average percentage of pairwise agreement (the number of pairs of Turkers
who agreed, divided by the total number of possible pairings), which for this task was
67.7%, well above chance but also far from perfect agreement. Note that we are not
trying to establish reliability in the traditional sense. Our method depends on a certain
Figure 1
Distribution of responses by SO value, single-word task.
289
Computational Linguistics
amount of disagreement; if it were simply the case that some 1 words were judged
by all rankers as neutral (or positive), the best explanation for that fact would be errors
in individual SO values. If Turkers, however, generally agree about SO 5/5 words,
but generally disagree about SO 1/1 words, this unreliability actually reects the
SO scale. This is indeed the pattern we see in the data: Average pairwise agreement is
60.1% for 1/1 words, but 98.2% for 5/5 (see Andreevskaia and Bergler [2006] for
similar results in a study of inter-annotator agreement in adjective dictionaries).25
Interestingly, although we expected relatively equal numbers of positive and negative judgments at SO = 0, that was not the result. Instead, words with SO = 0 were
sometimes interpreted as positive, but almost never interpreted as negative. This is
mostly likely attributable to the default status of positive, and the marked character of
negative expression (Boucher and Osgood 1969; Horn 1989; Jing-Schmidt 2007); neutral
description might be taken as being vaguely positive, but it would not be mistaken for
negative expression.26
For the word-pair task, we categorize the distribution data by the difference in SO
between the two words, putting negative and positive words in separate tables. For
instance, a 4 difference with negative adjectives means the result of comparing a 1
word to a 5 word, and a +2 difference corresponds to comparing a 5 with a negative
3, a 4 with a 2, or a 3 with a 1. (We always compare words with the same sign,
i.e., negative to negative.) In early testing, we found that Turkers almost completely
ignored the same category, and so we took steps (changing the instructions and the
order of presentation) to try to counteract this effect. Still, the same designation was
underused. There are a number of possible explanations for this, all of which probably
have some merit. One is that our scale is far too coarse-grained, that we are collapsing
distinguishable words into a single classication. The trade-off here is with ease of
ranking; if we provided, for instance, 20 SO values instead of 10, it would be more
difcult to provide condent rankings, and would probably yield little in the way of
tangible benets.27 Another potential confounding factor is that words within the same
SO category often vary considerably on some other dimension, and it is not natural to
think of them as being equivalent. For instance, we judged savory, lush, and jolly to be
equally positive, but they are applied to very different kinds of things, and so are not
easily compared. And, of course, even assuming our 10-point scale, there are words in
our dictionary that do not belong together in the same category; our focus here is on
the big picture, but we can use this data to identify words which are problematic and
improve the dictionary in the next iteration.
The results in Figures 2 and 3 for the adjective word pair task are otherwise very
encouraging. Unlike the single word task, we see a clear linear pattern that covers
the entire SO spectrum (though, again, there is noise). At SO value difference = 0,
same reaches a maximum, and positive and negative judgments are almost evenly
distributed. The average pairwise agreement on this task was somewhat lower, 60.0%
25 Note also that another drawback with pairwise agreement is that agreement does not change linearly
with respect to the number of dissenters. For example, in the six-rater task, a single disagreement drops
agreement from 100% to 66.7%; a second disagreement drops the score to 40% if different than the rst
agreement, or 46.7% if the same.
26 This is the opposite result from the impressions reported by Cabral and Hortacsu (2010), where, in an
evaluation of comments for eBay sellers, neutral comments were perceived as close to negative.
27 Other evidence that suggests making our scale more ne-grained is unlikely to help: When two words
were difcult to distinguish, we often saw three different answers across the six Turkers. For example, for
3 SO words fat and ignorant, three Turkers judged them the same, two judged ignorant as stronger, and
one judged fat as stronger.
290
Taboada et al.
Figure 2
Distribution of responses by SO difference for positive adjectives, word-pair task.
and 63.7% for Figure 2 and Figure 3, respectively. This is not surprising, because the
vast majority of the data come from the difcult-to-judge range between +2 and 2.
Outside of this range, agreement was much higher: Both experiments showed roughly
50% agreement when the SO value of the two words was the same, and an increase of
approximately 10% for each point difference in SO value.
Figure 4 shows the results for adverbial intensiers. Pairwise agreement here was
higher, at 68.4%. The basic trends are visible; there is, however a lot of noise throughout.
This is the drawback of having a relatively ne-grained scale, and in retrospect we
perhaps should have followed the model for adjectives and split our words further
into downplayers and ampliers. The other reason for uctuations, particularly at the
extremes, was our inclusion of comparative and pragmatic intensiers like more, less, the
most, barely, hardly, almost, and not only, which, unlike regular scalar intensiers (very,
immensely), are very difcult to interpret outside of a discourse context, and are not
easily compared.
For the negation evaluation task, we can state directly the result of comparing a
positive word with a negated negative word: Outside of colloquial expressions such as
not bad, it is nearly impossible to express any positive force by negating a negative; the
percentage of negated negatives that were ranked higher than positives was about 5%
Figure 3
Distribution of responses by SO difference for negative adjectives, word-pair task.
291
Computational Linguistics
Figure 4
Distribution of responses by modier value. Difference for adverbial intensiers.
Table 8
Distribution percentages for the negative/negated positive SO comparison task.
Word
1
2
3
4
5
SO
pos neg neu pos neg neu pos neg neu pos neg neu pos neg neu
1
2
3
4
5
50
47
39
36
25
19
32
26
36
45
32
21
35
29
30
16
11
17
31
25
54
66
67
45
58
30
23
17
23
17
8
7
10
20
17
78
67
84
68
61
14
27
5
12
22
4
8
3
12
0
82
72
95
82
95
14
20
3
5
5
0
4
0
0
6
95
86
9
93
72
5
11
10
7
22
(a result that was duplicated for noun negation), concentrated mostly on SO 1 positive
words (which, as we have seen, are sometimes viewed as neutral). This result is not
predicted by either of our models of negation (switch and shift), but it may be somewhat
irrelevant because negated negatives, being essentially a double negative, are fairly rare.
The main use of negation, we have found, is to negate a positive word.
Table 8 shows the distribution percentages for the negative/negated positive SO
comparison task. Here, pos refers to the percentage of people who rated the negated
positive word as being stronger, neg refers to the percentage of people who rated
the negative word as being stronger, and neu refers to a judgment of same. Pairwise
agreement across raters on this task was only 51.2%, suggesting that the comparisons
involving negatives are the most difcult of our tasks.28
As the SO value of the negative word increases, we of course expect that it is judged
stronger, a pattern visible from left to right in Table 8. The more interesting direction
is from top to bottom: If the switch model is correct, we expect increasing judgments
in favor of the (negated) positive word, but if the shift model is correct, we would
see the opposite. The results in Table 8 are not conclusive. There are aspects of the
28 We in fact saw even lower agreement than this after our initial data collection. We investigated the low
agreement, and attributed it to a single Turker (identied by his/her ID) who exhibited below-chance
agreement with other Turkers. A visual inspection of the data also indicated that this Turker, who
provided more responses than any other, either did not understand the task, or was deliberately
sabotaging our results. We removed this Turkers data, and solicited a new set of responses.
292
Taboada et al.
table consistent with our shift model, for instance a general decrease in pos judgments
between pos SO 3 and 5 for lower neg SO. However, there are a number of discrepancies.
For instance, the shift model would predict that a negated +1 word is stronger than a
2 word (+1 becomes 3 under negation), which is often not the case. Note also that
the shift trend seems somewhat reversed for higher negative SOs. In general, the shift
model accounts for 45.2% of Mechanical Turk (MT) judgments (see our denition of MT
correspondence, in the following section), whereas the switch model accounts for 33.4%.
Negation is clearly a more complicated phenomenon than either of these simple models
can entirely represent, although shifting does a somewhat better job of capturing the
general trends.
3.3 Dictionary Comparisons
We now turn to using the data that we have collected to evaluate other dictionaries
and scales. We use the same Mechanical Turk judgments as in the previous section,
with six Turkers per word or pair. For simplicity, we look only at the single-word task
and pairwise comparison of negative adjectives. We chose negative words because
they are better distinguished by our automatic classiers. Note that our denition of
negative adjective is tied to our original SO ratings, and has been integrated into the
selection of pairs for Mechanical Turk. At this stage, we hope to have shown that the
adjective dictionary does a sufciently accurate job of distinguishing positive, negative,
and neutral words, and provides a good range of words within those categories, with
which other dictionaries can be tested. Here, we will use the term Mechanical Turk
(MT) correspondence as follows:
MT correspondence =
For example, if one rater thought A was more positive than B, and the other thought
they were of the same strength, then an SO dictionary which predicts either of these
results would have 50% MT correspondence (on this word), whereas a dictionary where
the SO value of B is greater than A would have 0% MT correspondence. As an absolute
measure, correspondence is somewhat misleading: Because there are disagreements
among Turkers, it is impossible for a dictionary to reach 100% MT correspondence.
For instance, the highest possible MT correspondence in the single-word task is 79%,
and the highest possible MT correspondence for the negative adjective task is 76.8%.
MT correspondence is useful as a relative measure, however, to compare how well the
dictionaries predict MT judgments.
Our rst comparison is with the dictionary of adjectives that was derived using
the SO-PMI method (Turney 2002), using Google hit counts (Taboada, Anthony, and
Voll 2006). The SO values for the words tested here vary from 8.2 to 5.74. We have
already noted in Section 2 that using this dictionary instead of our manually-ranked
one has a strongly negative effect on performance. Because the Google dictionary is
continuous, we place the individual SO values into evenly spaced buckets so that we
can graph their distribution. For easy comparison with our dictionary, we present the
results when buckets equivalent to our 11-point SO scale are used. The results for the
single-word task are given in Figure 5.29
29 When bucketing for the single word task, we used a neutral (zero) bucket that was twice the size of the
other buckets, reecting the fact that zero is a more signicant point on the scale in this context.
293
Computational Linguistics
Figure 5
Distribution of responses by adjective SO value for Google PMI dictionary, single-word task.
Figure 6
Distribution of responses by adjective SO value for Google PMI dictionary, negative word-pair
task.
30 Distinguishing neutral and polar terms, sentences, or texts is, in general, a hard problem (Wilson, Wiebe,
and Hwa 2004; Pang and Lee 2005).
294
Taboada et al.
nearly 55% is possible if the number of buckets is increased signicantly, an effect which
is due at least partially to the fact that the same designation is so underused that it is
generally preferable to always guess that one of the adjectives is stronger than the other.
Along with the results in the previous gure, this suggests that this method actually
performs fairly well at distinguishing the strength of negative adjectives; the problem
with automated methods in general seems to be that they have difculty properly
distinguishing neutral and positive terms.
Our next comparison is with the Subjectivity dictionary of Wilson, Wiebe, and
Hoffmann (2005). Words are rated for polarity (positive or negative) and strength
(weak or strong), meaning that their scale is much more coarse-grained than ours.
The dictionary is derived from both manual and automatic sources. It is fairly comprehensive (over 8,000 entries), so we assume that any word not mentioned in the dictionary is neutral. Figure 7 shows the result for the single word task.
The curves are comparable to those in Figure 1; the neutral peak is signicantly
lower, however, and the positive and negative curves do not reach their maximum. This
is exactly what we would expect if words of varying strength are being collapsed into a
single category. The overall MT Correspondence, however, is comparable (71.8%).
The negative adjective pair comparison task (shown in Figure 8) provides further
evidence for this (Strong/Weak means a weak negative word compared with a strong
negative word).
The MT correspondence is only 48.7% in this task. There is a clear preference for the
predicted judgment in weak/strong comparisons, although the distinction is far from
unequivocal, and the overall change in neutrality across the options is minimal. This
may be partially attributed to the fact that the strong/weak designation for this dictionary is dened in terms of whether the word strongly or weakly indicates subjectivity,
not whether the term itself is strong or weak (a subtle distinction). However, the results
suggest that the scale is too coarse to capture the full range of semantic orientation.
Another publicly available corpus is SentiWordNet (Esuli and Sebastiani 2006;
Baccianella, Esuli, and Sebastiani 2010), an extension of WordNet (Fellbaum 1998)
where each synset is annotated with labels indicating how objective, positive, and
negative the terms in the synset are. We use the average across senses for each word
given in version 3.0 (see discussion in the next section). Figure 9 gives the result for the
Figure 7
Distribution of responses by adjective SO value for Subjectivity dictionary, single-word task.
295
Computational Linguistics
Figure 8
Distribution of responses by adjective SO value for Subjectivity dictionary, negative word-pair
task.
Figure 9
Distribution of responses by adjective SO value for SentiWordNet, single-word task.
31 Although the values in SentiWordNet itself are calculated automatically, they are based on the
knowledge from WordNet. This is why we believe it is not a fully automatic dictionary. In addition,
version 3.0 includes the possibility of user feedback.
296
Taboada et al.
Figure 10
Distribution of responses by adjective SO value for SentiWordNet, negative word-pair task.
by hand is not necessarily a subjective task. Reecting on our experience, we can say
that the manually created dictionaries are superior for two main reasons. First of all,
we tended to exclude words with ambiguous meaning, or that convey sentiment only
in some occasions, but not in most. Secondly, judicious restraint is necessary when
expanding the dictionaries. We found that adding more words to the dictionaries did
not always help performance, because new words added noise. The type of noise that
we refer to is that deriving from problems with word sense disambiguation, part-ofspeech tagging, or simply strength of the word itself. Words with stronger positive or
negative connotations tend to be more informative. We made a decision to exclude, of
course, all neutral words (those that would have a value of 0), but also words with only a
mild sentiment orientation, although there are some 1 and 1 words in the dictionaries.
3.4 SO-CAL with Other Dictionaries
The previous section provided comparisons of our dictionary to existing dictionaries.
In this section, we use those lexicons and others to carry out a full comparison using
SO-CAL. For each of the dictionaries discussed below, we used it instead of our set
of manually ranked dictionaries as a source of SO values, and tested accuracy across
different corpora. Accuracy in this case is calculated for the polarity identication task,
that is, deciding whether a text is negative or positive, using the authors own ranking
(recommended or not recommended, or number of stars).
In our comparisons, we tested two options: Full uses all the default SO-CAL features
described in Section 2, including intensication and negation.32 Basic, on the other hand,
is just a sum of the SO value of words in relevant texts, with none of the SO-CAL features
enabled.
The rst dictionary that we incorporated into SO-CAL was the Google-generated
PMI-based dictionary described in Taboada, Anthony, and Voll (2006), and mentioned
earlier in this article.
32 Except for tests with the Maryland dictionary, where we disabled negative weighting because with
weighting the performance was close to chance (all negative correct, all positive wrong). We believe that
this is because the dictionary contains a disproportionately large number of negative words (likely the
result of expanding existing dictionaries, which also tend to include more negative than positive words).
297
Computational Linguistics
The Maryland dictionary (Mohammad, Dorr, and Dunne 2009) is a very large
collection of words and phrases (around 70,000) extracted from the Macquarie Thesaurus. The dictionary is not classied according to part of speech, and only contains
information on whether the word is positive or negative. To integrate it into our system,
we assigned all positive words an SO value of 3, and all negative words a value of 3.33
We used the same type of quantication for the General Inquirer (GI; Stone et al.
1966), which also has only positive and negative tags; a word was included in the
dictionary if any of the senses listed in the GI were polar.
The Subjectivity dictionary is the collection of subjective expressions compiled by
Wilson, Wiebe, and Hoffmann (2005), also used in our Mechanical Turk experiments in
the previous section. The Subjectivity dictionary only contains a distinction between
weak and strong opinion words. For our tests, weak words were assigned 2 or 2
values, depending on whether they were positive or negative, and strong words were
assigned 4 or 4.
The SentiWordNet dictionary (Esuli and Sebastiani 2006; Baccianella, Esuli, and
Sebastiani 2010), also used in the previous section, was built using WordNet (Fellbaum
1998), and retains its synset structure. There are two main versions of SentiWordNet
available, 1.0 and 3.0, and two straightforward methods to calculate an SO value:34
Use the rst sense SO, or average the SO across senses. For the rst sense method,
we calculate the SO value of a word w (of a given POS) based on its rst sense f as
follows:
SO(w) = 5 (Pos( f ) Neg( f ))
For the averaging across senses method, SO is calculated as
5
SO(w) = |senses
|
xsenses (Pos(x)
Neg(x))
that is, the difference between the positive and negative scores provided by SentiWordNet (each in the 01 range), averaged across all word senses for the desired part of
speech, and multiplied by 5 to provide SO values in the 5 to 5 range. Table 9 contains a
comparison of performance (using simple word averaging, no SO-CAL features) in our
various corpora for each version of SentiWordNet using each method. What is surprising is that the best dictionary using just basic word counts (1.0, rst sense) is actually
the worst dictionary when using SO-CAL, and the best dictionary using all SO-CAL
features (3.0, average across senses) is the worst dictionary when features are disabled.
We believe this effect is due almost entirely to the degree of positive bias in the various
dictionaries. The 3.0 average dictionary is the most positively biased, which results in
degraded basic performance (in the Camera corpus, only 20.5% of negative reviews
are correctly identied). When negative weighting is applied, however, it reaches an
almost perfect balance between positive and negative accuracy (70.7% to 71.3% in the
Camera corpus), which optimizes overall performance. We cannot therefore conclude
denitively that any of the SentiWordNet dictionaries is superior for our task; in fact,
33 We chose 3 as a value because it is the middle value between 2 and 4 that we assigned to strong and weak
words in the Subjectivity Dictionary, as explained subsequently.
34 A third alternative would be to calculate a weighted average using sense frequency information;
SentiWordNet does not include such information, however. Integrating this information from other
sources, though certainly possible, would take us well beyond off-the-shelf usage, and, we believe,
would provide only marginal benet.
298
Taboada et al.
Table 9
Comparison of performance of different dictionaries derived from SentiWordNet.
SWN Dictionary
Ver.
Method
Test
Epinions 1
Epinions 2
Movie
Camera
Overall
1.0
Average
Average
First
First
Basic
Full
Basic
Full
59.25
66.50
60.25
65.00
62.50
66.50
62.75
64.50
62.89
61.89
62.00
62.89
59.92
67.00
60.79
66.67
61.18
65.02
61.35
64.96
3.0
Average
Average
First
First
Basic
Full
Basic
Full
56.75
67.50
61.50
64.50
60.25
71.50
60.75
69.25
60.10
66.21
59.58
65.73
58.37
71.00
61.42
67.37
59.03
68.98
60.69
66.58
it is likely that they are roughly equivalent. We use the top-performing 3.0 average
dictionary here and elsewhere.
Table 10 shows the performance of the various dictionaries when run within
SO-CAL. For all dictionaries and corpora, the performance of the original SO-CAL
dictionary is signicantly better (p < 0.05). We have already discussed the Google
dictionary, which contains only adjectives, and whose results are not reliable (see also
Taboada, Anthony, and Voll 2006). The Maryland dictionary suffers from too much
coverage: Most words in a text are identied by this dictionary as containing some
form of subjectivity or opinion, but a cursory examination of the texts reveals that this
is not the case. In some cases, the problem is part-of-speech assignment (the Maryland
dictionary is not classied according to part of speech). For example, the noun plot was
classied as negative when referring to a movies plot. We imagine this is negative
in the dictionary because of the negative meaning of the verb plot. Similarly, novel as
a noun is classied as positive, although we believe this ought to be the case in the
adjective use only. More problematic is the presence of words such as book, cotton, here,
legal, reading, saying, or year.
Table 10
Comparison of performance using different dictionaries with SO-CAL.
Dictionary
Google-Full
Google-Basic
Maryland-Full-NoW
Maryland-Basic
GI-Full
GI-Basic
SentiWordNet-Full
SentiWordNet-Basic
Subjectivity-Full
Subjectivity-Basic
SO-CAL-Full
SO-CAL-Basic
Epinions 2
Movie
Camera
Overall
62.00
53.25
58.00
56.50
68.00
62.50
66.50
59.25
72.75
64.75
80.25
65.50
58.50
53.50
63.75
56.00
70.50
59.00
66.50
62.50
71.75
63.50
80.00
65.25
66.31
67.42
67.42
62.26
64.21
65.68
61.89
62.89
65.42
68.63
76.37
68.05
61.25
51.40
59.46
53.79
72.33
63.87
67.00
59.92
77.21
64.83
80.16
64.70
62.98
59.25
62.65
58.16
68.02
64.23
65.02
61.47
72.04
66.51
78.74
66.04
299
Computational Linguistics
SentiWordNet performs better than either the Google or Maryland dictionaries, but
it is still somewhat low; again, we believe it suffers from the same problem of too much
coverage: Potentially, every word in WordNet will receive a score, and many of those are
not sentiment-bearing words. The General Inquirer lexicon (Stone et al. 1966), the only
other fully manually dictionary considered here, does comparably quite well despite
being relatively small. Finally, the Subjectivity dictionary, with the added strong/weak
distinctions, is the closest in performance to our dictionary, though signicantly worse
when all features are enabled. The comparison is not completely fair to the Subjectivity
dictionary, as it was built to recognize subjectivity, not polarity.
We must note that the comparison is different for the Maryland dictionary, where
we turned off negative weighting. This resulted in anomalously high performance on
the Movies corpus, despite poor performance elsewhere. In general, there is signicantly less positive bias in the movie review domain, most likely due to the use of
negative terms in plot and character description (Taboada, Brooke, and Stede 2009),
thus the negative weighting that is appropriate for other domains is often excessive for
movie reviews.
Comparing the performance of various dictionaries with or without SO-CAL features, two facts are apparent: First, SO-CAL features are generally benecial no matter
what dictionary is used (in fact, all Overall improvements from Basic to Full in Table 10
are statistically signicant); the only exceptions are due to negative weighting in the
movie domain, which for most of the dictionaries causes a drop in performance.35
Second, the benet provided by SO-CAL seems to be somewhat dependent on the
reliability of the dictionary; in general, automatically derived SO dictionaries derive
less benet from the use of linguistic features, and the effects are, on the whole, much
less consistent; this is in fact the same conclusion we reached in other work where
we compared automatically translated dictionaries to manually built ones for Spanish
(Brooke, Toloski, and Taboada 2009). Interestingly, the Subjectivity dictionary performs
slightly above the SO-CAL dictionary in some data sets when no features are enabled
(which we might attribute to a mixture of basic reliability with respect to polarity and
an appropriate level of coverage), but its lack of granularity seems to blunt the benet
of SO-CAL features, which were designed to take advantage of a ner-grained SO
scale, an effect which is even more pronounced in binary dictionaries like the GI. We
can summarize this result as follows: When using lexical methods, the effectiveness of
any linguistic enhancements will to some extent depend on the characteristics of the
underlying lexicon and, as such, the two cannot be considered in isolation.
4. Other Related Work
The SO-CAL improvements described in this article have been directly inspired by the
work of Polanyi and Zaenen (2006), who proposed that valence shifters change the
base value of a word. We have implemented their idea in the form of intensiers and
downtoners, adding a treatment of negation that does not involve switching polarity,
but instead shifting the value of a word when in the scope of a negator.
The bulk of the work in sentiment analysis has focused on classication at either
the sentence level, for example, the subjectivity/polarity detection of Wiebe and Riloff
(2005), or alternatively at the level of the entire text. With regard to the latter, two major
35 When negative weighting is excluded (for example, the results for the Maryland dictionary in Table 10),
SO-CAL features have a positive effect on performance in the movie domain.
300
Taboada et al.
301
Computational Linguistics
the sentence level, exploring the types of syntactic patterns that indicate subjectivity
and sentiment is also a possibility (Greene and Resnik 2009). Syntactic patterns can also
be used to distinguish different types of opinion and appraisal (Bednarek 2009).
Our current work focuses on developing discourse parsing methods, both general
and specic to the review genre. At the same time, we will investigate different aggregation strategies for the different types of relations in the text (see also Asher, Benamara,
and Mathieu [2008, 2009] for preliminary work in this area), and build on existing
discourse parsing systems and proposals (Schilder 2002; Soricut and Marcu 2003; Subba
and Di Eugenio 2009).
The main conclusion of our work is that lexicon-based methods for sentiment analysis are robust, result in good cross-domain performance, and can be easily enhanced
with multiple sources of knowledge (Taboada, Brooke, and Stede 2009). SO-CAL has
performed well on blog postings (Murray et al. 2008) and video game reviews (Brooke
and Hurst 2009), without any need for further development or training.
In related work, we have also shown that creating a new version of SO-CAL for
a new language, Spanish, is as fast as building text classiers for the new language,
and results in better performance (Brooke 2009; Brooke, Toloski, and Taboada 2009).
SO-CAL has also been successfully deployed for the detection of sentence-level polarity
(Brooke and Hurst 2009).
Acknowledgments
This work was supported by grants to
Maite Taboada from the Natural Sciences
and Engineering Research Council of
Canada (Discovery Grant 261104-2008 and a
University Faculty Award), and from the
Social Sciences and Humanities Research
Council of Canada (410-2006-1009). We thank
members of the Sentiment Research Group
at SFU for their feedback, and in particular
Vita Markman and Ping Yang for their help
in ranking our dictionaries. Thanks to Janyce
Wiebe and to Rada Mihalcea and Carlo
Strappavara for making their data public.
Mike Thelwall and Alina Andreevskaia
generously shared their data with us.
The three anonymous reviewers and
Robert Dale provided detailed comments
and suggestions. Finally, our appreciation
to colloquia audiences in Hamburg and
Saarbrucken.
References
Akkaya, Cem, Alexander Conrad, Janyce
Wiebe, and Rada Mihalcea. 2010. Amazon
Mechanical Turk for subjectivity word
sense disambiguation. In Proceedings of the
NAACL HLT 2010 Workshop on Creating
Speech and Language Data with Amazons
Mechanical Turk, pages 195203,
Los Angeles, CA.
Andreevskaia, Alina and Sabine Bergler.
2006. Mining WordNet for fuzzy
sentiment: Sentiment tag extraction
302
Taboada et al.
303
Computational Linguistics
304
Taboada et al.
parse-and-paraphrase paradigm. In
Proceedings of the 2009 Conference on
Empirical Methods in Natural Language
Processing, pages 161169, Singapore.
Lyons, John. 1981. Language, Meaning and
Context. Fontana, London.
Mann, William C. and Sandra A. Thompson.
1988. Rhetorical structure theory: Toward a
functional theory of text organization. Text,
8(3):243281.
Martin, James R. and Peter R. R. White. 2005.
The Language of Evaluation. Palgrave,
New York.
Mellebeek, Bart, Francesc Benavent,
Jens Grivolla, Joan Codina, Marta R.
Costa-juss`a, and Rafael Banchs. 2010.
Opinion mining of Spanish customer
comments with non-expert annotations
on Mechanical Turk. In Proceedings of the
NAACL HLT 2010 Workshop on Creating
Speech and Language Data with Amazons
Mechanical Turk, pages 114121,
Los Angeles, CA.
Mohammad, Saif, Bonnie Dorr, and Cody
Dunne. 2009. Generating high-coverage
semantic orientation lexicons from overtly
marked words and a thesaurus. In
Proceedings of the Conference on Empirical
Methods in Natural Language Processing
(EMNLP-2009), pages 599608, Singapore.
Mohammad, Saif and Peter Turney. 2010.
Emotions evoked by common words
and phrases: Using Mechanical Turk to
create an emotion lexicon. In Proceedings
of the NAACL HLT 2010 Workshop on
Computational Approaches to Analysis and
Generation of Emotion in Text, pages 2634,
Los Angeles, CA.
Moilanen, Karo and Stephen Pulman. 2007.
Sentiment composition. In Proceedings of
Recent Advances in Natural Language
Processing, pages 2729, Borovets.
Moilanen, Karo, Stephen Pulman, and
Yue Zhang. 2010. Packed feelings and
ordered sentiments: Sentiment parsing
with quasi-compositional polarity
sequencing and compression. In
Proceedings of the 1st Workshop on
Computational Approaches to Subjectivity
and Sentiment Analysis (WASSA 2010),
pages 3643, Lisbon.
Mullen, Tony and Nigel Collier. 2004.
Sentiment analysis using support vector
machines with diverse information
sources. In Proceedings of the Conference on
Empirical Methods in Natural Language
Processing, pages 412418, Barcelona.
Murray, Gabriel, Shaq Joty, Giuseppe
Carenini, and Raymond Ng. 2008. The
305
Computational Linguistics
306
Taboada et al.
307