0% found this document useful (0 votes)
42 views13 pages

Semeval-2015 Task 10: Sentiment Analysis in Twitter: Sara Rosenthal Preslav Nakov Svetlana Kiritchenko

This document describes SemEval-2015 Task 10 on sentiment analysis in Twitter. It consisted of five subtasks: contextual sentiment analysis of phrases, sentiment analysis of individual tweets, topic-based sentiment analysis of tweets, detecting sentiment trends towards topics, and determining the strength of association between terms and sentiment. The document discusses the datasets used, which were collected from Twitter and annotated. It also lists the participating systems and compares the task to other sentiment analysis work.

Uploaded by

azyaaljana
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
42 views13 pages

Semeval-2015 Task 10: Sentiment Analysis in Twitter: Sara Rosenthal Preslav Nakov Svetlana Kiritchenko

This document describes SemEval-2015 Task 10 on sentiment analysis in Twitter. It consisted of five subtasks: contextual sentiment analysis of phrases, sentiment analysis of individual tweets, topic-based sentiment analysis of tweets, detecting sentiment trends towards topics, and determining the strength of association between terms and sentiment. The document discusses the datasets used, which were collected from Twitter and annotated. It also lists the participating systems and compares the task to other sentiment analysis work.

Uploaded by

azyaaljana
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 13

SemEval-2015 Task 10: Sentiment Analysis in Twitter

Sara Rosenthal Preslav Nakov Svetlana Kiritchenko


Columbia University Qatar Computing Research Institute National Research Council Canada
[email protected] [email protected] [email protected]

Saif M Mohammad Alan Ritter Veselin Stoyanov


National Research Council Canada The Ohio State University Facebook
[email protected] [email protected] [email protected]

Abstract Misspellings, poor grammatical structure, emoti-


cons, acronyms, and slang were common in these
In this paper, we describe the 2015 iteration of new media, and were explored by a number of re-
the SemEval shared task on Sentiment Analy-
searchers (Barbosa and Feng, 2010; Bifet et al.,
sis in Twitter. This was the most popular sen-
timent analysis shared task to date with more
2011; Davidov et al., 2010; Jansen et al., 2009;
than 40 teams participating in each of the last Kouloumpis et al., 2011; O’Connor et al., 2010;
three years. This year’s shared task competi- Pak and Paroubek, 2010). Later, specialized shared
tion consisted of five sentiment prediction sub- tasks emerged, e.g., at SemEval (Nakov et al., 2013;
tasks. Two were reruns from previous years: Rosenthal et al., 2014), which compared teams
(A) sentiment expressed by a phrase in the against each other in a controlled environment us-
context of a tweet, and (B) overall sentiment ing the same training and testing datasets. These
of a tweet. We further included three new sub-
shared tasks had the side effect to foster the emer-
tasks asking to predict (C) the sentiment to-
wards a topic in a single tweet, (D) the over- gence of a number of new resources, which eventu-
all sentiment towards a topic in a set of tweets, ally spread well beyond SemEval, e.g., NRC’s Hash-
and (E) the degree of prior polarity of a phrase. tag Sentiment lexicon and the Sentiment140 lexicon
(Mohammad et al., 2013).1
1 Introduction Below, we discuss the public evaluation done as
part of SemEval-2015 Task 10. In its third year, the
Social media such as Weblogs, microblogs, and dis- SemEval task on Sentiment Analysis in Twitter has
cussion forums are used daily to express personal once again attracted a large number of participants:
thoughts, which allows researchers to gain valuable 41 teams across five subtasks, with most teams par-
insight into the opinions of a very large number of ticipating in more than one subtask.
individuals, i.e., at a scale that was simply not pos- This year the task included reruns of two legacy
sible a few years ago. As a result, nowadays, sen- subtasks, which asked to detect the sentiment ex-
timent analysis is commonly used to study the pub- pressed in a tweet or by a particular phrase in a
lic opinion towards persons, objects, and events. In tweet. The task further added three new subtasks.
particular, opinion mining and opinion detection are The first two focused on the sentiment towards a
applied to product reviews (Hu and Liu, 2004), for given topic in a single tweet or in a set of tweets,
agreement detection (Hillard et al., 2003), and even respectively. The third new subtask focused on de-
for sarcasm identification (González-Ibáñez et al., termining the strength of prior association of Twit-
2011; Liebrecht et al., 2013). ter terms with positive sentiment; this acts as an in-
Early work on detecting sentiment focused on trinsic evaluation of automatic methods that build
newswire text (Wiebe et al., 2005; Baccianella et al., Twitter-specific sentiment lexicons with real-valued
2010; Pang et al., 2002; Hu and Liu, 2004). As later sentiment association scores.
research turned towards social media, people real-
ized this presented a number of new challenges. 1
http://www.purl.com/net/lexicons

451

Proceedings of the 9th International Workshop on Semantic Evaluation (SemEval 2015), pages 451–463,
Denver, Colorado, June 4-5, 2015. c 2015 Association for Computational Linguistics
In the remainder of this paper, we first introduce 3 Datasets
the problem of sentiment polarity classification and
In this section, we describe the process of collect-
our subtasks. We then describe the process of creat-
ing and annotating our datasets of short social me-
ing the training, development, and testing datasets.
dia text messages. We focus our discussion on the
We list and briefly describe the participating sys-
2015 datasets; more detail about the 2013 and the
tems, the results, and the lessons learned. Finally,
2014 datasets can be found in (Nakov et al., 2013)
we compare the task to other related efforts and we
and (Rosenthal et al., 2014).
point to possible directions for future research.
3.1 Data Collection
2 Task Description
3.1.1 Subtasks A–D
Below, we describe the five subtasks of SemEval- First, we gathered tweets that express sentiment
2015 Task 10 on Sentiment Analysis in Twitter. about popular topics. For this purpose, we ex-
tracted named entities from millions of tweets, us-
• Subtask A. Contextual Polarity Disambigua- ing a Twitter-tuned NER system (Ritter et al., 2011).
tion: Given an instance of a word/phrase in the Our initial training set was collected over a one-year
context of a message, determine whether it ex- period spanning from January 2012 to January 2013.
presses a positive, a negative or a neutral senti- Each subsequent Twitter test set was collected a few
ment in that context. months prior to the corresponding evaluation. We
used the public streaming Twitter API to download
• Subtask B. Message Polarity Classification: the tweets.
Given a message, determine whether it expresses We then identified popular topics as those named
a positive, a negative, or a neutral/objective senti- entities that are frequently mentioned in association
ment. If both positive and negative sentiment are with a specific date (Ritter et al., 2012). Given this
expressed, the stronger one should be chosen. set of automatically identified topics, we gathered
tweets from the same time period which mentioned
• Subtask C. Topic-Based Message Polarity the named entities. The testing messages had differ-
Classification: Given a message and a topic, de- ent topics from training and spanned later periods.
cide whether the message expresses a positive, a The collected tweets were greatly skewed towards
negative, or a neutral sentiment towards the topic. the neutral class. In order to reduce the class im-
If both positive and negative sentiment are ex- balance, we removed messages that contained no
pressed, the stronger one should be chosen. sentiment-bearing words using SentiWordNet as a
repository of sentiment words. Any word listed in
• Subtask D. Detecting Trend Towards a Topic: SentiWordNet 3.0 with at least one sense having a
Given a set of messages on a given topic from positive or a negative sentiment score greater than
the same period of time, classify the overall sen- 0.3 was considered a sentiment-bearing word.2
timent towards the topic in these messages as For subtasks C and D, we did some manual prun-
(a) strongly positive, (b) weakly positive, (c) neu- ing based on the topics. First, we excluded top-
tral, (d) weakly negative, or (e) strongly negative. ics that were incomprehensible, ambiguous (e.g.,
Barcelona, which is a name of a sports team and also
• Subtask E. Determining Strength of Associa- of a place), or were too general (e.g., Paris, which is
tion of Twitter Terms with Positive Sentiment a name of a big city). Second, we discarded tweets
(Degree of Prior Polarity): Given a word/phrase, that were just mentioning the topic, but were not re-
propose a score between 0 (lowest) and 1 (high- ally about the topic. Finally, we discarded topics
est) that is indicative of the strength of association with too few tweets, namely less than 10.
of that word/phrase with positive sentiment. If a 2
Filtering based on an existing lexicon does bias the dataset
word/phrase is more positive than another one, it to some degree; however, note that the text still contains senti-
should be assigned a relatively higher score. ment expressions outside those in the lexicon.

452
Instructions: Subjective words are ones which convey an opinion or sentiment. Given a Twitter message, identify
whether it is objective, positive, negative, or neutral. Then, identify each subjective word or phrase in the context of the
sentence and mark the position of its start and end in the text boxes below. The number above each word indicates its
position. The word/phrase will be generated in the adjacent textbox so that you can confirm that you chose the correct
range. Choose the polarity of the word or phrase by selecting one of the radio buttons: positive, negative, or neutral.
If a sentence is not subjective please select the checkbox indicating that “There are no subjective words/phrases”. If
a tweet is sarcastic, please select the checkbox indicating that “The tweet is sarcastic”. Please read the examples and
invalid responses before beginning if this is your first time answering this hit.

Figure 1: The instructions we gave to the workers on Mechanical Turk, followed by a screenshot.

3.1.2 Subtask E 3.2 Annotation


We selected high-frequency target terms from the Below we describe the data annotation process.
Sentiment140 and the Hashtag Sentiment tweet cor-
pora (Kiritchenko et al., 2014). In order to re- 3.2.1 Subtasks A–D
duce the skewness towards the neutral class, we We used Amazon’s Mechanical Turk for the an-
selected terms from different ranges of automati- notations of subtasks A–D. Each tweet message was
cally determined sentiment values as provided by annotated by five Mechanical Turk workers, also
the corresponding Sentiment140 and Hashtag Sen- known as Turkers. The annotations for subtasks
timent lexicons. The term set comprised regular En- A–D were done concurrently, in a single task. A
glish words, hashtagged words (e.g., #loveumom), Turker had to mark all the subjective words/phrases
misspelled or creatively spelled words (e.g., parla- in the tweet message by indicating their start and
ment or happeeee), abbreviations, shortenings, and end positions and to say whether each subjective
slang. Some terms were negated expressions such word/phrase was positive, negative, or neutral (sub-
as no fun. (It is known that negation impacts the task A). He/she also had to indicate the overall po-
sentiment of its scope in complex ways (Zhu et al., larity of the tweet message in general (subtask B)
2014).) We annotated these terms for degree of sen- as well as the overall polarity of the message to-
timent manually. Further details about the data col- wards the given target topic (subtasks C and D). The
lection and the annotation process can be found in instructions we gave to the Turkers, along with an
Section 3.2.2 as well as in (Kiritchenko et al., 2014). example, are shown in Figure 1. We further made
The trial dataset consisted of 200 instances, and available to the Turkers several additional examples,
no training dataset was provided. Note, however, which we show in Table 1.
that the trial data was large enough to be used as a Providing all the required annotations for a given
development set, or even as a training set. More- tweet message constituted a Human Intelligence
over, the participants were free to use any additional Task, or a HIT. In order to qualify to work on our
manually or automatically generated resources when HITs, a Turker had to have an approval rate greater
building their systems for subtask E. The testset in- than 95% and should have completed at least 50 ap-
cluded 1,315 instances. proved HITs.

453
Authorities are only too aware that Kashgar is 4,000 kilometres (2,500 miles) from Beijing but only a tenth of
the distance from the Pakistani border, and are desperate to ensure instability or militancy does not leak over the
frontiers.
Taiwan-made products stood a good chance of becoming even more competitive thanks to wider access to overseas
markets and lower costs for material imports, he said.
“March appears to be a more reasonable estimate while earlier admission cannot be entirely ruled out,” according
to Chen, also Taiwan’s chief WTO negotiator.
friday evening plans were great, but saturday’s plans didnt go as expected – i went dancing & it was an ok club,
but terribly crowded :-(
WHY THE HELL DO YOU GUYS ALL HAVE MRS. KENNEDY! SHES A FUCKING DOUCHE
AT&T was okay but whenever they do something nice in the name of customer service it seems like a favor, while
T-Mobile makes that a normal everyday thin
obama should be impeached on TREASON charges. Our Nuclear arsenal was TOP Secret. Till HE told our enemies
what we had. #Coward #Traitor
My graduation speech: “I’d like to thanks Google, Wikipedia and my computer!” :D #iThingteens

Table 1: List of example sentences and annotations we provided to the Turkers. All subjective phrases are italicized
and color-coded: positive phrases are in green, negative ones are in red, and neutral ones are in blue.

I would love to watch Vampire Diaries :) and some Heroes! Great combination 9/13
I would love to watch Vampire Diaries :) and some Heroes! Great combination 11/13
I would love to watch Vampire Diaries :) and some Heroes! Great combination 10/13
I would love to watch Vampire Diaries :) and some Heroes! Great combination 13/13
I would love to watch Vampire Diaries :) and some Heroes! Great combination 12/13
I would love to watch Vampire Diaries :) and some Heroes! Great combination

Table 2: Example of a sentence annotated for subjectivity on Mechanical Turk. Words and phrases that were marked
as subjective are in bold italic. The first five rows are annotations provided by Turkers, and the final row shows their
intersection. The last column shows the token-level accuracy for each annotation compared to the intersection.

We further discarded the following types of mes- Corpus Pos. Neg. Obj. Total
sage annotations: / Neu.
Twitter2013-train 5,895 3,131 471 9,497
• containing overlapping subjective phrases; Twitter2013-dev 648 430 57 1,135
• marked as subjective but having no annotated Twitter2013-test 2,734 1,541 160 4,435
subjective phrases; SMS2013-test 1,071 1,104 159 2,334
• with every single word marked as subjective; Twitter2014-test 1,807 578 88 2,473
• with no overall sentiment marked; Twitter2014-sarcasm 82 37 5 124
LiveJournal2014-test 660 511 144 1,315
• with no topic sentiment marked.
Twitter2015-test 1899 1008 190 3097
Recall that each tweet message was annotated by
Table 3: Dataset statistics for subtask A.
five different Turkers. We consolidated these anno-
tations for subtask A using intersection as shown in
the last row of Table 2. A word had to appear in 3/5
of the annotations in order to be considered subjec- We also experimented with two alternative meth-
tive. It further had to be labeled with a particular ods for combining annotations: (i) by computing
polarity (positive, negative, or neutral) by three of the union of the annotations for the sentence, and
the five Turkers in order to receive that polarity la- (ii) by taking the annotations by the Turker who has
bel. As the example shows, this effectively shortens annotated the highest number of HITs. However,
the spans of the annotated phrases, often to single our manual analysis has shown that both alternatives
words, as it is hard to agree on long phrases. performed worse than using the intersection.

454
Corpus Pos. Neg. Obj. Total One such annotation scheme is MaxDiff (Lou-
/ Neu. viere, 1991), which is widely used in market surveys
Twitter2013-train 3,662 1,466 4,600 9,728 (Almquist and Lee, 2009); it was also used in a pre-
Twitter2013-dev 575 340 739 1,654
vious SemEval task (Jurgens et al., 2012).
Twitter2013-test 1,572 601 1,640 3,813
SMS2013-test 492 394 1,207 2,093 In MaxDiff, the annotator is presented with four
Twitter2014-test 982 202 669 1,853 terms and asked which term is most positive and
Twitter2014-sarcasm 33 40 13 86 which is least positive. By answering just these two
LiveJournal2014-test 427 304 411 1,142 questions, five out of six pairwise rankings become
Twitter2015-test 1040 365 987 2392 known. Consider a set in which a judge evaluates A,
B, C, and D. If she says that A and D are the most
Table 4: Dataset statistics for subtask B.
and the least positive, we can infer the following:
Corpus Topics Pos. Neg. Obj. Total
A > B, A > C, A > D, B > D, C > D. The re-
/ Neu. sponses to the MaxDiff questions can then be easily
Train 44 142 56 288 530 translated into a ranking for all the terms and also
Test 137 870 260 1256 2386 into a real-valued score for each term. We crowd-
sourced the MaxDiff questions on CrowdFlower, re-
Table 5: Twitter-2015 statistics for subtasks C & D. cruiting ten annotators per MaxDiff example. Fur-
ther details can be found in Section 6.1.2. of (Kir-
For subtasks B and C, we consolidated the tweet- itchenko et al., 2014).
level annotations using majority voting, requiring
3.3 Lower & Upper Bounds
that the winning label be proposed by at least three
of the five Turkers; we discarded all tweets for which When building a system to solve a task, it is good
3/5 majority could not be achieved. As in previous to know how well we should expect it to perform.
years, we combined the objective and the neutral la- One good reference point is agreement between an-
bels, which Turkers tended to mix up. notators. Unfortunately, as we derive annotations by
We used these consolidated annotations as gold agreement, we cannot calculate standard statistics
labels for subtasks A, B, C & D. The statistics for all such as Kappa. Instead, we decided to measure the
datasets for these subtasks are shown in Tables 3, 4, agreement between our gold standard annotations
and 5, respectively. Each dataset is marked with the (derived by agreement) and the annotations pro-
year of the SemEval edition it was produced for. An posed by the best Turker, the worst Turker, and the
annotated example from each source (Twitter, SMS, average Turker (with respect to the gold/consensus
LiveJournal) is shown in Table 6; examples for sen- annotation for a particular message). Given a HIT,
timent towards a topic can be seen in Table 7. we just calculate the overlaps as shown in the last
column in Table 2, and then we calculate the best,
3.2.2 Subtask E the worst, and the average, which are respectively
Subtask E asks systems to propose a numerical 13/13, 9/13 and 11/13, in the example. Finally, we
score for the positiveness of a given word or phrase. average these statistics over all HITs that contributed
Many studies have shown that people are actually to a given dataset, to produce lower, average, and
quite bad at assigning such absolute scores: inter- upper averages for that dataset. The accuracy (with
annotator agreement is low, and annotators strug- respect to the gold/consensus annotation) for differ-
gle even to remain self-consistent. In contrast, it ent averages is shown in Table 8. Since the overall
is much easier to make relative judgments, e.g., to polarity of a message is chosen based on majority,
say whether one word is more positive than another. the upper bound for subtask B is 100%. These aver-
Moreover, it is possible to derive an absolute score ages give a good indication about how well we can
from pairwise judgments, but this requires a much expect the systems to perform. We can see that even
larger number of annotations. Fortunately, there are if we used the best annotator for each HIT, it would
schemes that allow to infer more pairwise annota- still not be possible to get perfect accuracy, and thus
tions from less judgments. we should also not expect it from a system.

455
Source Message Message-Level
Polarity
Twitter Why would you [still]- wear shorts when it’s this cold?! I [love]+ how Britain positive
see’s a bit of sun and they’re [like ’OOOH]+ LET’S STRIP!’
SMS [Sorry]- I think tonight [cannot]- and I [not feeling well]- after my rest. negative
LiveJournal [Cool]+ posts , dude ; very [colorful]+ , and [artsy]+ . positive
Twitter Sarcasm [Thanks]+ manager for putting me on the schedule for Sunday negative

Table 6: Example annotations for each source of messages. The subjective phrases are marked in [. . .], and are
followed by their polarity (subtask A); the message-level polarity is shown in the last column (subtask B).

Topic Message Message-Level Topic-Level


Polarity Polarity
leeds united Saturday without Leeds United is like Sunday dinner it doesn’t negative positive
feel normal at all (Ryan)
demi lovato Who are you tomorrow? Will you make me smile or just bring neutral positive
me sorrow? #HottieOfTheWeek Demi Lovato

Table 7: Example of annotations in Twitter showing differences between topic- and message-level polarity.

Corpus Subtask A Subtask B 4 Scoring


Low Avg Up Avg
Twitter2013-train 75.1 89.7 97.9 77.6 4.1 Subtasks A-C: Phrase-Level,
Twitter2013-dev 66.6 85.3 97.1 86.4 Message-Level, and Topic-Level Polarity
Twitter2013-test 76.8 90.3 98.0 75.9
SMS2013-test 75.9 97.5 89.6 77.5 The participating systems were required to perform
Livejournal2014-test 61.7 82.3 94.5 76.2 a three-way classification, i.e., to assign one of the
Twitter2014-test 75.3 88.9 97.5 74.7 folowing three labels: positive, negative or objec-
Sarcasm2014-test 62.6 83.1 95.6 71.2 tive/neutral. We evaluated the systems in terms of a
Twitter2015-test 73.2 87.6 96.8 75.7 macro-averaged F1 score for predicting positive and
Table 8: Average (over all HITs) overlap of the gold an- negative phrases/messages.
notations with the worst, average, and the worst Turker We first computed positive precision, Ppos as fol-
for each HIT, for subtasks A and B. lows: we found the number of phrases/messages
that a system correctly predicted to be positive,
and we divided that number by the total number
3.4 Tweets Delivery of examples it predicted to be positive. To com-
Due to restrictions in the Twitter’s terms of service, pute positive recall, Rpos , we found the number of
we could not deliver the annotated tweets to the par- phrases/messages correctly predicted to be positive
ticipants directly. Instead, we released annotation and we divided that number by the total number
indexes and labels, a list of corresponding Twitter of positives in the gold standard. We then calcu-
IDs, and a download script that extracts the corre- lated an F1 score for the positive class as follows
2Ppos Rpos
sponding tweets via the Twitter API.3 Fpos = Ppos +Rpos . We carried out similar computa-
As a result, different teams had access to differ- tions for the negative phrases/messages, Fneg . The
ent number of training tweets depending on when overall score was then computed as the average of
they did the downloading. However, our analysis the F1 scores for the positive and for the negative
has shown that this did not have a major impact and classes: F = (Fpos + Fneg )/2.
many high-scoring teams had less training data com- We provided the participants with a scorer that
pared to some lower-scoring ones. outputs the overall score F , as well as P , R, and
F1 scores for each class (positive, negative, neutral)
3
https://dev.twitter.com and for each test set.

456
Team ID Affiliation
4.2 Subtask D: Overall Polarity Towards a CIS-positiv University of Munich
CLaC-SentiPipe CLaC Labs, Concordia University
Topic DIEGOLab Arizona State University
ECNU East China Normal University
This subtask asks to predict the overall sentiment of elirf
Frisbee
Universitat Politècnica de València
Frisbee
a set of tweets towards a given topic. In other words, Gradiant-Analytics
GTI
Gradiant
AtlantTIC Center, University of Vigo
to predict the ratio ri of positive (posi ) tweets to the IHS-RD
iitpsemeval
IHS inc
Indian Institute of Technology, Patna
number of positive and negative sentiment tweets in IIIT-H
INESC-ID
IIIT, Hyderabad
IST, INESC-ID
the set of tweets about the i-th topic: IOA
KLUEless
Institute of Acoustics, Chinese Academy of Sciences
FAU Erlangen-Nürnberg
lsislif Aix-Marseille University
NLP NLP
ri = P osi /(P osi + N egi ) RGUSentimentMiners123
RoseMerry
Robert Gordon University
The University of Melbourne
Sentibase IIIT, Hyderabad
SeNTU Nanyang Technological University, Singapore
Note, that neutral tweets do not participate in the SHELLFBK
sigma2320
Fondazione Bruno Kessler
Peking University
above formula; they have only an indirect impact on Splusplus
SWASH
Beihang University
Swarthmore College
the calculation, similarly to subtasks A–C. SWATAC
SWATCMW
Swarthmore College
Swarthmore College
We use the following two evaluation measures for SWATCS65
Swiss-Chocolate
Swarthmore College
Zurich University of Applied Sciences
subtask D: TwitterHawk
UDLAP2014
University of Massachusetts, Lowell
Universidad de las Amèricas Puebla, Mexico
UIR-PKU University of International Relations
UMDuluth-CS8761 University of Minnesota, Duluth
• AvgDiff (official score): Calculates the abso- UNIBA
unitn
University of Bari Aldo Moro
University of Trento
lute difference betweeen the predicted ri0 and UPF-taln
WarwickDCS
Universitat Pompeu Fabra
University of Warwick
the gold ri for each i, and then averages this Webis
whu-iss
Bauhaus-Universität Weimar
International Software School, Wuhan University
difference over all topics. Whu-Nlp
wxiaoac
Computer School, Wuhan University
Hong Kong University of Science and Technology
ZWJYYC Peking University

• AvgLevelDiff (unofficial score): This calcula- Table 9: The participating teams and their affiliations.
tion is the same as AvgDiff, but with ri0 and
ri first remapped to five coarse numerical cat- 5 Participants and Results
egories: 5 (strongly positive), 4 (weakly pos-
itive), 3 (mixed), 2 (weakly negative), and 1 The task attracted 41 teams: 11 teams participated in
(strongly negative). We define this remapping subtask A, 40 in subtask B, 7 in subtask C, 6 in sub-
based on intervals as follows: task D, and 10 in subtask E. The IDs and affiliations
of the participating teams are shown in Table 9.
– 5: 0.8 < x ≤ 1.0
– 4: 0.6 < x ≤ 0.8 5.1 Subtask A: Phrase-Level Polarity
– 3: 0.4 < x ≤ 0.6 The results (macro-averaged F1 score) for sub-
task A are shown in Table 10. The official
– 2: 0.2 < x ≤ 0.4
results on the new Twitter2015-test dataset are
– 1: 0.0 ≤ x ≤ 0.2 shown in the last column, while the first five
columns show F1 on the 2013 and on the 2014
4.3 Subtask E: Degree of Prior Polarity progress test datasets:4 Twitter2013-test, SMS2013-
The scores proposed by the participating systems test, Twitter2014-test, Twitter2014-sarcasm, and
were evaluated by first ranking the terms accord- LiveJournal2014-test. There is an index for each re-
ing to the proposed sentiment score and then com- sult showing the relative rank of that result within
paring this ranked list to a ranked list obtained the respective column. The participating systems
from aggregating the human ranking annotations. are ranked by their score on the Twitter2015-test
We used Kendall’s rank correlation (Kendall’s τ ) dataset, which is the official ranking for subtask A;
as the official evaluation metric to compare the all remaining rankings are secondary.
ranked lists (Kendall, 1938). We also calculated 4
Note that the 2013 and the 2014 test datasets were made
scores for Spearman’s rank correlation (Lehmann available for development, but it was explicitly forbidden to use
and D’Abrera, 2006), as an unofficial score. them for training.

457
2013: Progress 2014: Progress 2015: Official
# System Tweet SMS Tweet Tweet Live- Tweet
sarcasm Journal
1 unitn 90.101 88.602 87.121 73.655 84.462 84.791
2 KLUEless 88.562 88.621 84.993 75.594 83.944 84.512
3 IOA 83.907 84.187 85.372 71.586 85.611 82.763
4 WarwickDCS 84.086 84.405 83.895 78.032 83.185 82.464
5 TwitterHawk 82.878 83.648 84.054 75.623 83.973 82.325
6 iitpsemeval 85.813 85.863 82.736 65.719 81.767 81.316
7 ECNU 85.284 84.704 82.097 70.967 82.496 81.087
8 Whu-Nlp 79.769 81.789 81.698 63.1411 80.879 78.848
9 GTI 84.645 84.376 79.489 81.531 81.618 77.279
10 whu-iss 74.0210 70.2611 72.2010 69.338 73.5710 71.3510
11 UMDuluth-CS8761 72.7111 71.8010 69.8411 64.5310 71.5311 66.2111
baseline 38.1 31.5 42.2 39.8 33.4 38.0

Table 10: Results for subtask A: Phrase-Level Polarity. The systems are ordered by their score on the Twitter2015
test dataset; the rankings on the individual datasets are indicated with a subscript.

There were less participants this year, probably 5.2 Subtask B: Message-Level Polarity
due to having a new similar subtask: C. Notably,
The results for subtask B are shown in Table 11.
many of the participating teams were newcomers.
Again, we show results on the five progress test
We can see that all systems beat the majority datasets from 2013 and 2014, in addition to those
class baseline by 25-40 F1 points absolute on all for the official Twitter2015-test datasets.
datasets. The winning team unitn (using deep con- Subtask B attracted 40 teams, both newcomers
volutional neural networks) achieved an F1 of 84.79 and returning, similarly to 2013 and 2014. All
on Twitter2015-test, followed closely by KLUEless managed to beat the baseline with the exception
(using logistic regression) with F1 =84.51. of one system for Twitter2015-test, and one for
Looking at the progress datasets, we can see that Twitter2014-test. There is a cluster of four teams
unitn was also first on both progress Tweet datasets, at the top: Webis (ensemble combining four Twit-
and second on SMS and on LiveJournal. KLUE- ter sentiment classification approaches that partici-
less won SMS and was second on Twitter2013-test. pated in previous editions) with an F1 of 64.84, unitn
The best result on LiveJournal was achieved by IOA, with 64.59, lsislif (logistic regression with special
who were also second on Twitter2014-test and third weighting for positives and negatives) with 64.27,
on the official Twitter2015-test. None of these teams and INESC-ID (word embeddings) with 64.17.
was ranked in top-3 on Twitter2014-sarcasm, where The last column in the table shows the results for
the best team was GTI, followed by WarwickDCS. the 2015 sarcastic tweets. Note that, unlike in 2014,
Compared to 2014, there is an improvement on this time they were not collected separately and did
Twitter2014-test from 86.63 in 2014 (NRC-Canada) not have a special #sarcasm tag; instead, they are a
to 87.12 in 2015 (unitn). The best result on subset of 75 tweets from Twitter2015-test that were
Twitter2013-test of 90.10 (unitn) this year is very flagged as sarcastic by the human annotators. The
close to the best in 2014 (90.14 by NRC-Canada). top system is IOA with an F1 of 65.77, followed by
Similarly, the best result on LiveJournal stays ex- INESC-ID with 64.91, and NLP with 63.62.
actly the same, i.e., F1 =85.61 (SentiKLUE in 2014 Looking at the progress datasets, we can see that
and IOA in 2015). However, there is slight degra- the second ranked unitn is also second on SMS and
dation for SMS2013-test from 89.31 (ECNU) in on Twitter2014-test, and third on Twitter2013-test.
2014 to 88.62 (KLUEless) in 2015. The results INESC-ID in turn is third on Twitter2014-test and
also degraded for Twitter2014-sarcasm from 82.75 also third on Twitter2014-sarcasm. Webis and lsislif
(senti.ue) to 81.53 (GTI). were less strong on the progress datasets.

458
2013: Progress 2014: Progress 2015: Official
# System Tweet SMS Tweet Tweet Live- Tweet Tweet
sarcasm Journal sarcasm
1 Webis 68.4910 63.9214 70.867 49.3312 71.6414 64.841 53.5922
2 unitn 72.792 68.372 73.602 55.445 72.4812 64.592 55.0119
3 lsislif 71.344 63.4217 71.545 46.5722 73.0110 64.273 46.0033
4 INESC-ID? 71.973 63.7815 72.523 56.233 69.7822 64.174 64.912
5 Splusplus 72.801 67.165 74.421 42.8631 75.341 63.735 60.997
6 wxiaoac 66.4316 64.0413 68.9611 54.387 73.369 63.006 52.2226
7 IOA 71.325 68.143 71.864 51.489 74.522 62.627 65.771
8 Swiss-Chocolate 68.809 65.566 68.7412 48.2216 73.954 62.618 54.6620
9 CLaC-SentiPipe 70.427 63.0518 70.1610 51.4310 73.596 62.009 58.559
10 TwitterHawk 68.4411 62.1220 70.649 56.024 70.1719 61.9910 61.246
11 SWATCS65 68.2112 65.498 67.2314 37.2339 73.378 61.8911 52.6424
12 UNIBA 61.6629 65.507 65.1125 37.3038 70.0520 61.5512 48.1632
13 KLUEless 70.646 67.664 70.896 45.3626 73.507 61.2013 56.1917
14 NLP 66.9614 61.0525 67.4513 39.8734 66.1231 60.9314 63.623
15 ZWJYYC 69.568 64.7211 70.778 46.3423 71.6015 60.7715 52.4025
16 Gradiant-Analytics 65.2922 61.9721 66.8717 59.111 72.6311 60.6216 56.4516
17 IIIT-H 65.6820 62.2519 67.0416 57.502 69.9121 59.8317 62.755
18 ECNU 65.2523 68.491 66.3720 45.8725 74.403 59.7218 52.6723
19 CIS-positiv 64.8224 65.1410 66.0521 49.2314 71.4716 59.5719 57.7411
20 SWASH 63.0727 56.4934 62.9331 48.4215 69.4324 59.2620 54.3021
21 GTI 64.0325 63.5016 65.6522 55.386 70.5017 58.9521 57.0213
22 iitpsemeval 60.7831 60.5626 65.0926 47.3219 73.705 58.8022 58.1810
23 elirf 57.0532 60.2028 61.1735 45.9824 68.3328 58.5823 43.9134
24 SWATAC 65.8619 61.3024 66.6419 39.4535 68.6727 58.4324 50.6627
25 UIR-PKU? 67.4113 64.6712 67.1815 52.588 70.4418 57.6525 59.438
26 SWATCMW 65.6721 65.439 65.6223 37.4836 69.5223 57.6026 56.6914
27 WarwickDCS 66.5715 61.9222 65.4724 45.0328 68.9825 57.3227 56.5815
28 SeNTU 63.5026 60.5327 66.8518 45.1827 68.7026 57.0628 49.5329
29 DIEGOLab 62.4928 58.6030 63.9928 47.6218 63.7434 56.7229 55.5618
30 Sentibase 61.5630 59.2629 63.2930 47.0720 67.5529 56.6730 62.964
31 Whu-Nlp 65.9718 61.3123 63.9329 46.9321 71.8313 56.3931 22.2540
32 UPF-taln 66.1517 57.8431 65.0527 50.9311 64.5032 55.5932 41.6335
33 RGUSentimentMiners123 56.4134 57.1432 59.4436 44.7229 64.3933 53.7333 48.2131
34 IHS-RD? 55.0635 57.0833 61.3932 37.3237 66.9930 52.6534 36.0237
35 RoseMerry 52.3337 53.0036 61.2734 49.2513 62.5435 51.1835 49.6228
36 Frisbee 49.3738 46.5938 53.9238 42.0732 57.9438 49.1936 48.2630
37 UMDuluth-CS8761 54.1736 50.6437 55.8237 43.7430 60.2337 47.7737 34.4038
38 UDLAP2014 41.9339 39.3539 45.9339 41.0433 50.1139 42.1038 40.5936
39 SHELLFBK 32.1440 26.1440 32.2040 35.5840 34.0640 32.4539 25.7339
40 whu-iss 56.5133 54.2835 61.3133 47.7817 61.9836 24.8040 57.7312
baseline 29.2 19.0 34.6 27.7 27.2 30.3 30.2

Table 11: Results for subtask B: Message-Level Polarity. The systems are ordered by their score on the Twitter2015
test dataset; the rankings on the individual datasets are indicated with a subscript. Systems with late submissions for
the progress test datasets (but with timely submissions for the official 2015 test dataset) are marked with a ? .

Compared to 2014, there is improvement on sarcasm from 58.16 (NRC-Canada) to 59.11


Twitter2013-test from 72.12 (TeamX) to 72.80 (Gradiant-Analytics), and on LiveJournal from
(Splusplus), on Twitter2014-test from 70.96 74.84 (NRC-Canada) to 75.34 (Splusplus), but not
(TeamX) to 74.42 (Spluplus), on Twitter2014- on SMS: 70.28 (NRC-Canada) vs. 68.49 (ECNU).

459
# System Tweet Tweet 5.4 Subtask D: Trend Towards a Topic
sarcasm
1 TwitterHawk 50.511 31.302 The results for subtask D are shown in Table 13.
2 KLUEless 45.482 39.261 This subtask is closely related to subtask C (in fact,
3 Whu-Nlp 40.703 23.375 one obvious way to solve D is to solve C and then
4 whu-iss 25.624 28.904 to calculate the proportion), and thus it has attracted
5 ECNU 25.385 16.206 the same teams, except for one. Again, only three
6 WarwickDCS 22.796 13.577 of the participating teams managed to improve over
7 UMDuluth-CS8761 18.997 29.913
the baseline; not suprisingly, those were the same
baseline 26.7 26.4
three teams that were in top-3 for subtask C. How-
Table 12: Results for Subtask C: Topic-Level Polarity. ever, the ranking is different from that in subtask
The systems are ordered by the official 2015 score. C, e.g., TwitterHawk has dropped to third position,
while KLUEless and Why-Nlp have each climbed
# Team avgDiff avgLevelDiff one position up to positions 1 and 2, respectively.
1 KLUEless 0.202 0.810
Finally, note that avgDiff and avgLevelDiff
2 Whu-Nlp 0.210 0.869
3 TwitterHawk 0.214 0.978
yielded the same rankings.
4 whu-iss 0.256 1.007
5 ECNU 0.300 1.190
5.5 Subtask E: Degree of Prior Polarity
6 UMDuluth-CS8761 0.309 1.314 Ten teams participated in subtask E. Many chose
baseline 0.277 0.985 an unsupervised approach and leveraged newly-
created and pre-existing sentiment lexicons such as
Table 13: Results for Subtask D: Trend Towards a
Topic. The systems are sorted by the official 2015 score. the Hashtag Sentiment Lexicon, the Sentiment140
Lexicon (Kiritchenko et al., 2014), the MPQA Sub-
jectivity Lexicon (Wilson et al., 2005), and Sen-
5.3 Subtask C: Topic-Level Polarity tiWordNet (Baccianella et al., 2010), among oth-
ers. Several participants further automatically cre-
The results for subtask C are shown in Table 12. ated their own sentiment lexicons from large collec-
This proved to be a hard subtask, and only three tions of tweets. Three teams, including the winner
of the seven teams that participated in it managed INESC-ID, adopted a supervised approach and used
to improve over a majority vote baseline. These word embeddings (supplemented with lexicon fea-
three teams, TwitterHawk (using subtask B data tures) to train a regression model.
to help with subtask C) with F1 =50.51, KLUEless The results are presented in Table 14. The last row
(which ignored the topics as if it was subtask B) with shows the performance of a lexicon-based baseline.
F1 =45.48, and Whu-Nlp with F1 =40.70, achieved For this baseline, we chose the two most frequently
scores that outperform the rest by a sizable margin: used existing, publicly available, and automatically
15-25 points absolute more than the fourth team. generated sentiment lexicons: Hashtag Sentiment
Note that, despite the apparent similarity, subtask Lexicon and Sentiment140 Lexicon (Kiritchenko et
C is much harder than subtask B: the top-3 teams al., 2014).5 These lexicons have real-valued senti-
achieved an F1 of 64-65 for subtask B vs. an F1 of ment scores for most of the terms in the test set.
41-51 for subtask C. This cannot be blamed on the For negated phrases, we use the scores of the cor-
class distribution, as the difference in performance responding negated entries in the lexicons. For each
of the majority class baseline is much smaller: 30.3 term, we take its score from the Sentiment140 Lex-
for B vs. 26.7 for C. icon if present; otherwise, we take the term’s score
from the Hashtag Sentiment Lexicon. For terms not
Finally, the last column in the table reports the
found in any lexicon, we use the score of 0, which
results for the 75 sarcastic 2015 tweets. The win-
indicates a neutral term in these lexicons. The top
ner here is KLUEless with an F1 of 39.26, fol-
three teams were able to improve over the baseline.
lowed by TwitterHawk with F1 =31.30, and then by
UMDuluth-CS8761 with F1 =29.91. 5
http://www.purl.com/net/lexicons

460
Team Kendall’s τ Spearman’s ρ We should further note that our task was part of
coefficient coefficient a larger Sentiment Track, together with three other
INESC-ID 0.6251 0.8172 closely-related tasks, which were also interested in
lsislif 0.6211 0.8202
ECNU 0.5907 0.7861
sentiment analysis: Task 9 on CLIPEval Implicit Po-
CLaC-SentiPipe 0.5836 0.7771 larity of Events, Task 11 on Sentiment Analysis of
KLUEless 0.5771 0.7662 Figurative Language in Twitter, and Task 12 on As-
UMDuluth-CS8761-10 0.5733 0.7618 pect Based Sentiment Analysis. Another related task
IHS-RD-Belarus 0.5143 0.7121 was Task 1 on Paraphrase and Semantic Similarity in
sigma2320 0.5132 0.7086 Twitter, from the Text Similarity and Question An-
iitpsemeval 0.4131 0.5859 swering track, which also focused on tweets.
RGUSentminers123 0.2537 0.3728
Baseline 0.5842 0.7843 7 Conclusion
Table 14: Results for Subtask E: Degree of Prior Po- We have described the five subtasks organized as
larity. The systems are ordered by their Kendall’s τ part of SemEval-2015 Task 10 on Sentiment Anal-
score, which was the official score. ysis in Twitter: detecting sentiment of terms in con-
text (subtask A), classifiying the sentiment of an
6 Discussion
entire tweet, SMS message or blog post (subtask
As in the previous two years, almost all systems used B), predicting polarity towards a topic (subtask C),
supervised learning. Popular machine learning ap- quantifying polarity towards a topic (subtask D),
proaches included SVM, maximum entropy, CRFs, and proposing real-valued prior sentiment scores for
and linear regression. In several of the subtasks, the Twitter terms (subtask E). Over 40 teams partici-
top system used deep neural networks and word em- pated in these subtasks, using various techniques.
beddings, and some systems benefited from special We plan a new edition of the task as part of
weighting of the positive and negative examples. SemEval-2016, where we will focus on sentiment
Once again, the most important features were with respect to a topic, but this time on a five-
those derived from sentiment lexicons. Other impor- point scale, which is used for human review ratings
tant features included bag-of-words features, hash- on popular websites such as Amazon, TripAdvisor,
tags, handling of negation, word shape and punctua- Yelp, etc. From a research perspective, moving to an
tion features, elongated words, etc. Moreover, tweet ordered five-point scale means moving from binary
pre-processing and normalization were an important classification to ordinal regression.
part of the processing pipeline. We further plan to continue the trend detection
subtask, which represents a move from classification
Note that this year we did not make a distinc-
to quantification, and is on par with what applica-
tion between constrained and unconstrained sys-
tions need. They are not interested in the sentiment
tems, and participants were free to use any addi-
of a particular tweet but rather in the percentage of
tional data, resources and tools they wished to.
tweets that are positive/negative.
Overall, the task has attracted a total of 41 teams,
Finally, we plan a new subtask on trend detection,
which is comparable to previous editions: there were
but using a five-point scale, which would get us even
46 teams in 2014, and 44 in 2013. As in previous
closer to what business (e.g. marketing studies), and
years, subtask B was most popular, attracting almost
researchers, (e.g. in political science or public pol-
all teams (40 out of 41). However, subtask A at-
icy), want nowadays. From a research perspective,
tracted just a quarter of the participants (11 out of
this is a problem of ordinal quantification.
41), compared to about half in previous years, most
likely due to the introduction of two new, very re- Acknowledgements
lated subtasks C and D (with 6 and 7 participants,
respectively). There was also a fifth subtask (E, The authors would like to thank SIGLEX for sup-
with 10 participants), which further contributed to porting subtasks A–D, and the National Research
the participant split. Council Canada for funding subtask E.

461
References Svetlana Kiritchenko, Xiaodan Zhu, and Saif M. Mo-
hammad. 2014. Sentiment analysis of short infor-
Eric Almquist and Jason Lee. 2009. What do customers mal texts. Journal of Artificial Intelligence Research
really want? Harvard Business Review. (JAIR), 50:723–762.
Stefano Baccianella, Andrea Esuli, and Fabrizio Sebas- Efthymios Kouloumpis, Theresa Wilson, and Johanna
tiani. 2010. SentiWordNet 3.0: An enhanced lexical Moore. 2011. Twitter sentiment analysis: The good
resource for sentiment analysis and opinion mining. In the bad and the OMG! In Proceedings of the Fifth
Proceedings of the Seventh International Conference International Conference on Weblogs and Social Me-
on Language Resources and Evaluation, LREC ’10, dia, ICWSM ’11, pages 538–541, Barcelona, Catalo-
pages 2200–2204, Valletta, Malta. nia, Spain.
Luciano Barbosa and Junlan Feng. 2010. Robust senti- Erich Leo Lehmann and Howard JM D’Abrera. 2006.
ment detection on Twitter from biased and noisy data. Nonparametrics: statistical methods based on ranks.
In Proceedings of the 23rd International Conference Springer New York.
on Computational Linguistics: Posters, COLING ’10, Christine Liebrecht, Florian Kunneman, and Antal
pages 36–44, Beijing, China. Van den Bosch. 2013. The perfect solution for de-
Albert Bifet, Geoffrey Holmes, Bernhard Pfahringer, and tecting sarcasm in tweets #not. In Proceedings of the
Ricard Gavaldà. 2011. Detecting sentiment change in 4th Workshop on Computational Approaches to Sub-
Twitter streaming data. Journal of Machine Learning jectivity, Sentiment and Social Media Analysis, pages
Research, Proceedings Track, 17:5–11. 29–37, Atlanta, Georgia, USA.
Dmitry Davidov, Oren Tsur, and Ari Rappoport. 2010. Jordan J. Louviere. 1991. Best-worst scaling: A model
Semi-supervised recognition of sarcasm in Twitter and for the largest difference judgments. Technical report,
Amazon. In Proceedings of the Fourteenth Confer- University of Alberta.
ence on Computational Natural Language Learning, Saif Mohammad, Svetlana Kiritchenko, and Xiaodan
CoNLL ’10, pages 107–116, Uppsala, Sweden. Zhu. 2013. NRC-Canada: Building the state-of-the-
Roberto González-Ibáñez, Smaranda Muresan, and Nina art in sentiment analysis of tweets. In Proceedings of
Wacholder. 2011. Identifying sarcasm in Twitter: a the Seventh International Workshop on Semantic Eval-
closer look. In Proceedings of the 49th Annual Meet- uation, SemEval ’13, pages 321–327, Atlanta, Geor-
ing of the Association for Computational Linguistics: gia, USA.
Human Language Technologies - Short Papers, ACL- Preslav Nakov, Sara Rosenthal, Zornitsa Kozareva,
HLT ’11, pages 581–586, Portland, Oregon, USA. Veselin Stoyanov, Alan Ritter, and Theresa Wilson.
Dustin Hillard, Mari Ostendorf, and Elizabeth Shriberg. 2013. SemEval-2013 Task 2: Sentiment analysis
2003. Detection of agreement vs. disagreement in in Twitter. In Proceedings of the Seventh Interna-
meetings: Training with unlabeled data. In Proceed- tional Workshop on Semantic Evaluation, SemEval
ings of the 2003 Conference of the North American ’13, pages 312–320, Atlanta, Georgia, USA.
Chapter of the Association for Computational Lin- Brendan O’Connor, Ramnath Balasubramanyan, Bryan
guistics on Human Language Technology: Volume 2, Routledge, and Noah Smith. 2010. From tweets to
NAACL ’03, pages 34–36, Edmonton, Canada. polls: Linking text sentiment to public opinion time se-
Minqing Hu and Bing Liu. 2004. Mining and summa- ries. In Proceedings of the Fourth International Con-
rizing customer reviews. In Proceedings of the 10th ference on Weblogs and Social Media, ICWSM ’10,
ACM SIGKDD International Conference on Knowl- pages 122–129, Washington, DC, USA.
edge Discovery and Data Mining, KDD ’04, pages Alexander Pak and Patrick Paroubek. 2010. Twitter
168–177, New York, NY, USA. based system: Using Twitter for disambiguating senti-
Bernard Jansen, Mimi Zhang, Kate Sobel, and Abdur ment ambiguous adjectives. In Proceedings of the 5th
Chowdury. 2009. Twitter power: Tweets as elec- International Workshop on Semantic Evaluation, Se-
tronic word of mouth. J. Am. Soc. Inf. Sci. Technol., mEval ’10, pages 436–439, Uppsala, Sweden.
60(11):2169–2188. Bo Pang, Lillian Lee, and Shivakumar Vaithyanathan.
David Jurgens, Saif Mohammad, Peter Turney, and Keith 2002. Thumbs up?: Sentiment classification using ma-
Holyoak. 2012. SemEval-2012 Task 2: Measuring chine learning techniques. In Proceedings of the Con-
degrees of relational similarity. In Proceedings of the ference on Empirical Methods in Natural Language
Sixth International Workshop on Semantic Evaluation, Processing - Volume 10, EMNLP ’02, pages 79–86,
SemEval ’12, pages 356–364, Montréal, Canada. Philadephia, Pennsylvania, USA.
Maurice G Kendall. 1938. A new measure of rank corre- Alan Ritter, Sam Clark, and Oren Etzioni. 2011. Named
lation. Biometrika, pages 81–93. entity recognition in tweets: An experimental study. In

462
Proceedings of the Conference on Empirical Methods
in Natural Language Processing, EMNLP ’11, pages
1524–1534, Edinburgh, Scotland, UK.
Alan Ritter, Oren Etzioni, Sam Clark, et al. 2012. Open
domain event extraction from Twitter. In Proceedings
of the 18th ACM SIGKDD International Conference
on Knowledge Discovery and Data Mining, KDD ’12,
pages 1104–1112, Beijing, China.
Sara Rosenthal, Alan Ritter, Preslav Nakov, and Veselin
Stoyanov. 2014. SemEval-2014 Task 9: Sentiment
analysis in Twitter. In Proceedings of the 8th In-
ternational Workshop on Semantic Evaluation, Se-
mEval ’14, pages 73–80, Dublin, Ireland.
Janyce Wiebe, Theresa Wilson, and Claire Cardie. 2005.
Annotating expressions of opinions and emotions in
language. Language Resources and Evaluation, 39(2-
3):165–210.
Theresa Wilson, Janyce Wiebe, and Paul Hoffmann.
2005. Recognizing contextual polarity in phrase-
level sentiment analysis. In Proceedings of the Con-
ference on Human Language Technology and Em-
pirical Methods in Natural Language Processing,
HLT-EMNLP ’05, pages 347–354, Vancouver, British
Columbia, Canada.
Xiaodan Zhu, Hongyu Guo, Saif Mohammad, and Svet-
lana Kiritchenko. 2014. An empirical study on the
effect of negation words on sentiment. In Proceed-
ings of the 52nd Annual Meeting of the Association for
Computational Linguistics (Volume 1: Long Papers),
ACL ’14, pages 304–313, Baltimore, Maryland, USA.

463

You might also like