0% found this document useful (0 votes)
22 views9 pages

Enhanced Sentiment Learning Using Twitter Hashtags and Smileys

This document proposes a sentiment classification framework using Twitter hashtags and smileys as sentiment labels to identify and classify diverse sentiment types from short texts without manual annotation. It evaluates different feature types for sentiment classification and shows the framework can successfully identify sentiment types of untagged sentences.

Uploaded by

lbisht
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
22 views9 pages

Enhanced Sentiment Learning Using Twitter Hashtags and Smileys

This document proposes a sentiment classification framework using Twitter hashtags and smileys as sentiment labels to identify and classify diverse sentiment types from short texts without manual annotation. It evaluates different feature types for sentiment classification and shows the framework can successfully identify sentiment types of untagged sentences.

Uploaded by

lbisht
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 9

Enhanced Sentiment Learning Using Twitter Hashtags and Smileys

Dmitry Davidov∗ 1 Oren Tsur∗ 2 Ari Rappoport 2


1
ICNC / 2 Institute of Computer Science
The Hebrew University
{oren,arir}@cs.huji.ac.il

Abstract is not founded on proof or certainty”1 . Senti-


ment expressions may describe the mood of the
Automated identification of diverse sen- writer (happy/sad/bored/grateful/...) or the opin-
timent types can be beneficial for many ion of the writer towards some specific entity (X
NLP systems such as review summariza- is great/I hate X, etc.).
tion and public media analysis. In some of Automated identification of diverse sentiment
these systems there is an option of assign- types can be beneficial for many NLP sys-
ing a sentiment value to a single sentence tems such as review summarization systems, dia-
or a very short text. logue systems and public media analysis systems.
Sometimes it is directly requested by the user to
In this paper we propose a supervised
obtain articles or sentences with a certain senti-
sentiment classification framework which
ment value (e.g Give me all positive reviews of
is based on data from Twitter, a popu-
product X/ Show me articles which explain why
lar microblogging service. By utilizing
movie X is boring). In some other cases obtaining
50 Twitter tags and 15 smileys as sen-
sentiment value can greatly enhance information
timent labels, this framework avoids the
extraction tasks like review summarization. While
need for labor intensive manual annota-
the majority of existing sentiment extraction sys-
tion, allowing identification and classifi-
tems focus on polarity identification (e.g., positive
cation of diverse sentiment types of short
vs. negative reviews) or extraction of a handful of
texts. We evaluate the contribution of dif-
pre-specified mood labels, there are many useful
ferent feature types for sentiment classifi-
and relatively unexplored sentiment types.
cation and show that our framework suc-
cessfully identifies sentiment types of un- Sentiment extraction systems usually require
tagged sentences. The quality of the senti- an extensive set of manually supplied sentiment
ment identification was also confirmed by words or a handcrafted sentiment-specific dataset.
human judges. We also explore dependen- With the recent popularity of article tagging, some
cies and overlap between different sen- social media types like blogs allow users to add
timent types represented by smileys and sentiment tags to articles. This allows to use blogs
Twitter hashtags. as a large user-labeled dataset for sentiment learn-
ing and identification. However, the set of senti-
ment tags in most blog platforms is somewhat re-
1 Introduction
stricted. Moreover, the assigned tag applies to the
A huge amount of social media including news, whole blog post while a finer grained sentiment
forums, product reviews and blogs contain nu- extraction is needed (McDonald et al., 2007).
merous sentiment-based sentences. Sentiment is With the recent popularity of the Twitter micro-
defined as “a personal belief or judgment that blogging service, a huge amount of frequently
∗ 1
* Both authors equally contributed to this paper. WordNet 2.1 definitions.

241
Coling 2010: Poster Volume, pages 241–249,
Beijing, August 2010
self-standing short textual sentences (tweets) be- It was suggested that sentiment words may have
came openly available for the research commu- different senses (Esuli and Sebastiani, 2006; An-
nity. Many of these tweets contain a wide vari- dreevskaia and Bergler, 2006; Wiebe and Mihal-
ety of user-defined hashtags. Some of these tags cea, 2006), thus word sense disambiguation can
are sentiment tags which assign one or more senti- improve sentiment analysis systems (Akkaya et
ment values to a tweet. In this paper we propose a al., 2009). All works mentioned above identify
way to utilize such tagged Twitter data for classi- evaluative sentiment expressions and their polar-
fication of a wide variety of sentiment types from ity.
text.
We utilize 50 Twitter tags and 15 smileys as Another line of works aims at identifying a
sentiment labels which allow us to build a clas- broader range of sentiment classes expressing var-
sifier for dozens of sentiment types for short tex- ious emotions such as happiness, sadness, bore-
tual sentences. In our study we use four different dom, fear, and gratitude, regardless (or in addi-
feature types (punctuation, words, n-grams and tion to) positive or negative evaluations. Mihalcea
patterns) for sentiment classification and evaluate and Liu (2006) derive lists of words and phrases
the contribution of each feature type for this task. with happiness factor from a corpus of blog posts,
We show that our framework successfully identi- where each post is annotated by the blogger with
fies sentiment types of the untagged tweets. We a mood label. Balog et al. (2006) use the mood
confirm the quality of our algorithm using human annotation of blog posts coupled with news data
judges. in order to discover the events that drive the dom-
We also explore the dependencies and overlap inant moods expressed in blogs. Mishne (2005)
between different sentiment types represented by used an ontology of over 100 moods assigned
smileys and Twitter tags. to blog posts to classify blog texts according to
Section 2 describes related work. Section 3 moods. While (Mishne, 2005) classifies a blog en-
details classification features and the algorithm, try (post), (Mihalcea and Liu, 2006) assign a hap-
while Section 4 describes the dataset and labels. piness factor to specific words and expressions.
Automated and manual evaluation protocols and Mishne used a much broader range of moods.
results are presented in Section 5, followed by a Strapparava and Mihalcea (2008) classify blog
short discussion. posts and news headlines to six sentiment cate-
gories.
2 Related work
While most of the works on sentiment analy-
Sentiment analysis tasks typically combine two sis focus on full text, some works address senti-
different tasks: (1) Identifying sentiment expres- ment analysis in the phrasal and sentence level,
sions, and (2) determining the polarity (sometimes see (Yu and Hatzivassiloglou, 2003; Wilson et al.,
called valence) of the expressed sentiment. These 2005; McDonald et al., 2007; Titov and McDon-
tasks are closely related as the purpose of most ald, 2008a; Titov and McDonald, 2008b; Wilson
works is to determine whether a sentence bears a et al., 2009; Tsur et al., 2010) among others.
positive or a negative (implicit or explicit) opinion
about the target of the sentiment. Only a few studies analyze the sentiment and
Several works (Wiebe, 2000; Turney, 2002; polarity of tweets targeted at major brands. Jansen
Riloff, 2003; Whitelaw et al., 2005) use lexical re- et al. (2009) used a commercial sentiment ana-
sources and decide whether a sentence expresses lyzer as well as a manually labeled corpus. Davi-
a sentiment by the presence of lexical items (sen- dov et al. (2010) analyze the use of the #sarcasm
timent words). Others combine additional feature hashtag and its contribution to automatic recogni-
types for this decision (Yu and Hatzivassiloglou, tion of sarcastic tweets. To the best of our knowl-
2003; Kim and Hovy, 2004; Wilson et al., 2005; edge, there are no works employing Twitter hash-
Bloom et al., 2007; McDonald et al., 2007; Titov tags to learn a wide range of emotions and the re-
and McDonald, 2008a; Melville et al., 2009). lations between the different emotions.

242
3 Sentiment classification framework were set to 1000 words per million (upper bound
for FC ) and 100 words per million (lower bound
Below we propose a set of classification features
for FH )2 .
and present the algorithm for sentiment classifica-
The patterns allow 2–6 HFWs and 1–5 slots for
tion.
CWs. To avoid collection of patterns which cap-
3.1 Classification features ture only a part of a meaningful multiword ex-
pression, we require patterns to start and to end
We utilize four basic feature types for sentiment with a HFW. Thus a minimal pattern is of the
classification: single word features, n-gram fea- form [HFW] [CW slot] [HFW]. For each sentence
tures, pattern features and punctuation features. it is possible to generate dozens of different pat-
For the classification, all feature types are com- terns that may overlap. As with words and n-gram
bined into a single feature vector. features, we do not treat as features any patterns
3.1.1 Word-based and n-gram-based features which appear in less than 0.5% of the training set
sentences.
Each word appearing in a sentence serves as a
Since each feature vector is based on a single
binary feature with weight equal to the inverted
sentence (tweet), we would like to allow approx-
count of this word in the Twitter corpus. We also
imate pattern matching for enhancement of learn-
took each consecutive word sequence containing
ing flexibility. The value of a pattern feature is
2–5 words as a binary n-gram feature using a sim-
estimated according the one of the following four
ilar weighting strategy. Thus n-gram features al-
scenarios3 :
ways have a higher weight than features of their

component words, and rare words have a higher 

1
count(p)
: Exact match – all the pattern components

 appear in the sentence in correct
weight than common words. Words or n-grams 


 order without any additional words.
appearing in less than 0.5% of the training set sen- 




tences do not constitute a feature. ASCII smileys 

α
: Sparse match – same as exact match


count(p)
and other punctuation sequences containing two 
 but additional non-matching words can

 be inserted between pattern components.
or more consecutive punctuation symbols were 


used as single-word features. Word features also γ∗n
: Incomplete match – only n > 1 of N


N ∗count(p)
include the substituted meta-words for URLs, ref- 


pattern components appear in
 the sentence, while some non-matching
erences and hashtags (see Subsection 4.1). 

 words can be inserted in-between.



 At least one of the appearing components
3.1.2 Pattern-based features 


 should be a HFW.


Our main feature type is based on surface pat- 



 0: No match – nothing or only a single
terns. For automated extraction of patterns, we pattern component appears in the sentence.
followed the pattern definitions given in (Davidov
and Rappoport, 2006). We classified words into 0 ≤ α ≤ 1 and 0 ≤ γ ≤ 1 are parameters we use
high-frequency words (HFWs) and content words to assign reduced scores for imperfect matches.
(CWs). A word whose corpus frequency is more Since the patterns we use are relatively long, ex-
(less) than FH (FC ) is considered to be a HFW act matches are uncommon, and taking advantage
(CW). We estimate word frequency from the train- of partial matches allows us to significantly re-
ing set rather than from an external corpus. Unlike duce the sparsity of the feature vectors. We used
(Davidov and Rappoport, 2006), we consider all α = γ = 0.1 in all experiments.
single punctuation characters or consecutive se- This pattern based framework was proven effi-
quences of punctuation characters as HFWs. We cient for sarcasm detection in (Tsur et al., 2010;
also consider URL, REF, and HASHTAG tags as 2
Note that the FH and FC bounds allow overlap between
HFWs for pattern extraction. We define a pattern some HFWs and CWs. See (Davidov and Rappoport, 2008)
as an ordered sequence of high frequency words for a short discussion.
3
As with word and n-gram features, the maximal feature
and slots for content words. Following (Davidov weight of a pattern p is defined as the inverse count of a pat-
and Rappoport, 2008), the FH and FC thresholds tern in the complete Twitter corpus.

243
Davidov et al., 2010). to v is the label of the majority of the remaining
vectors.
3.1.3 Efficiency of feature selection If a similar number of remaining vectors have
Since we avoid selection of textual features different labels, we assigned to the test vector the
which have a training set frequency below 0.5%, most frequent of these labels according to their
we perform feature selection incrementally, on frequency in the dataset. If there are no matching
each stage using the frequencies of the features vectors found for v, we assigned the default “no
obtained during the previous stages. Thus first sentiment” label since there is significantly more
we estimate the frequencies of single words in non-sentiment sentences than sentiment sentences
the training set, then we only consider creation in Twitter.
of n-grams from single words with sufficient fre-
quency, finally we only consider patterns com- 4 Twitter dataset and sentiment tags
posed from sufficiently frequent words and n-
In our experiments we used an extensive Twit-
grams.
ter data collection as training and testing sets. In
3.1.4 Punctuation-based features our training sets we utilize sentiment hashtags and
smileys as classification labels. Below we de-
In addition to pattern-based features we used
scribe this dataset in detail.
the following generic features: (1) Sentence
length in words, (2) Number of “!” characters in 4.1 Twitter dataset
the sentence, (3) Number of “?” characters in the
We have used a Twitter dataset generously pro-
sentence, (4) Number of quotes in the sentence,
vided to us by Brendan O’Connor. This dataset
and (5) Number of capitalized/all capitals words
includes over 475 million tweets comprising
in the sentence. All these features were normal-
roughly 15% of all public, non-“low quality”
ized by dividing them by the (maximal observed
tweets created from May 2009 to Jan 2010.
value times averaged maximal value of the other
Tweets are short sentences limited to 140 UTF-
feature groups), thus the maximal weight of each
8 characters. All non-English tweets and tweets
of these features is equal to the averaged weight
which contain less than 5 proper English words5
of a single pattern/word/n-gram feature.
were removed from the dataset.
3.2 Classification algorithm Apart of simple text, tweets may contain URL
addresses, references to other Twitter users (ap-
In order to assign a sentiment label to new exam-
pear as @<user>) or a content tags (also called
ples in the test set we use a k-nearest neighbors
hashtags) assigned by the tweeter (#<tag>)
(kNN)-like strategy. We construct a feature vec-
which we use as labels for our supervised clas-
tor for each example in the training and the test
sification framework.
set. We would like to assign a sentiment class to
Two examples of typical tweets are: “#ipad
each example in the test set. For each feature vec-
#sucks and 6,510 people agree. See more on Ipad
tor V in the test set, we compute the Euclidean
sucks page: http://j.mp/4OiYyg?”, and “Pay no
distance to each of the matching vectors in the
mind to those who talk behind ur back, it sim-
training set, where matching vectors are defined as
ply means that u’re 2 steps ahead. #ihatequotes”.
ones which share at least one pattern/n-gram/word
Note that in the first example the hashtagged
feature with v.
words are a grammatical part of the sentence (it
Let ti , i = 1 . . . k be the k vectors with low- becomes meaningless without them) while #ihate-
est Euclidean distance to v 4 with assigned labels qoutes of the second example is a mere sentiment
Li , i = 1 . . . k. We calculate the mean distance label and not part of the sentence. Also note that
d(ti , v) for this set of vectors and drop from the set hashtags can be composed of multiple words (with
up to five outliers for which the distance was more no spaces).
then twice the mean distance. The label assigned
5
Identification of proper English words was based on an
4
We used k = 10 for all experiments. available WN-based English dictionary

244
Category # of tags % agreement the Amazon Mechanical Turk (AMT) service in
Strong sentiment 52 87
Likely sentiment 70 66 order to obtain a list of the most commonly used
Context-dependent 110 61 and unambiguous ASCII smileys. We asked each
Focused 45 75 of ten AMT human subjects to provide at least 6
No sentiment 3564 99
commonly used ASCII mood-indicating smileys
together with one or more single-word descrip-
Table 1: Annotation results (2 judges) for the 3852 most
frequent tweeter tags. The second column displays the av- tions of the smiley-related mood state. From the
erage number of tags, and the last column shows % of tags obtained list of smileys we selected a subset of 15
annotated similarly by two judges.
smileys which were (1) provided by at least three
human subjects, (2) described by at least two hu-
During preprocessing, we have replaced URL man subject using the same single-word descrip-
links, hashtags and references by URL/REF/TAG tion, and (3) appear at least 1000 times in our
meta-words. This substitution obviously had Twitter dataset. We then sampled 1000 tweets for
some effect on the pattern recognition phase (see each of these smileys, using these smileys as sen-
Section 3.1.2), however, our algorithm is robust timent tags in the sentiment classification frame-
enough to overcome this distortion. work described in the previous section.

4.2 Hashtag-based sentiment labels 5 Evaluation and Results


The Twitter dataset contains above 2.5 million dif-
ferent user-defined hashtags. Many tweets include The purpose of our evaluation was to learn how
more than a single tag and 3852 “frequent” tags well our framework can identify and distinguish
appear in more than 1000 different tweets. Two between sentiment types defined by tags or smi-
human judges manually annotated these frequent leys and to test if our framework can be success-
tags into five different categories: 1 – strong sen- fully used to identify sentiment types in new un-
timent (e.g #sucks in the example above), 2 – tagged sentences.
most likely sentiment (e.g., #notcute), 3 – context-
5.1 Evaluation using cross-validation
dependent sentiment (e.g., #shoutsout), 4 – fo-
cused sentiment (e.g., #tmobilesucks where the In the first experiment we evaluated the consis-
target of the sentiment is part of the hashtag), and tency and quality of sentiment classification us-
5 – no sentiment (e.g. #obama). Table 1 shows ing cross-validation over the training set. Fully
annotation results and the percentage of similarly automated evaluation allowed us to test the per-
assigned values for each category. formance of our algorithm under several dif-
We selected 50 hashtags annotated “1” or “2” ferent feature settings: Pn+W-M-Pt-, Pn+W+M-Pt-,
by both judges. For each of these tags we automat- Pn+W+M+Pt-, Pn-W-M-Pt+ and FULL, where +/−
ically sampled 1000 tweets resulting in 50000 la- stands for utilization/omission of the following
beled tweets. We avoided sampling tweets which feature types: Pn:punctuation, W:Word, M:n-
include more than one of the sampled hashtags. grams (M stands for ‘multi’), Pt:patterns. FULL
As a no-sentiment dataset we randomly sampled stands for utilization of all feature types.
10000 tweets with no hashtags/smileys from the In this experimental setting, the training set was
whole dataset assuming that such a random sam- divided to 10 parts and a 10-fold cross validation
ple is unlikely to contain a significant amount of test is executed. Each time, we use 9 parts as the
sentiment sentences. labeled training data for feature selection and con-
struction of labeled vectors and the remaining part
4.3 Smiley-based sentiment labels is used as a test set. The process was repeated ten
While there exist many “official” lists of possible times. To avoid utilization of labels as strong fea-
ASCII smileys, most of these smileys are infre- tures in the test set, we removed all instances of
quent or not commonly accepted and used as sen- involved label hashtags/smileys from the tweets
timent indicators by online communities. We used used as the test set.

245
Setup Smileys Hashtags Hashtags Avg #hate #jealous #cute #outrageous
random 0.06 0.02 Pn+W-M-Pt- 0.57 0.6 0.55 0.63 0.53
Pn+W-M-Pt- 0.16 0.06 Pn+W+M-Pt- 0.64 0.64 0.67 0.66 0.6
Pn+W+M-Pt- 0.25 0.15 Pn+W+M+Pt- 0.69 0.66 0.67 0.69 0.64
Pn+W+M+Pt- 0.29 0.18 Pn-W-M-Pt+ 0.73 0.75 0.7 0.69 0.69
Pn-W-M-Pt+ 0.5 0.26 FULL 0.8 0.83 0.76 0.71 0.78
FULL 0.64 0.31 Smileys Avg :) ;) X( :d
Pn+W-M-Pt- 0.64 0.66 0.67 0.56 0.65
Table 2: Multi-class classification results for smileys and Pn+W+M-Pt- 0.7 0.73 0.72 0.64 0.69
hashtags. The table shows averaged harmonic f-score for 10- Pn+W+M+Pt- 0.7 0.74 0.75 0.66 0.69
fold cross validation. 51 (16) sentiment classes were used for Pn-W-M-Pt+ 0.75 0.78 0.75 0.68 0.72
hashtags (smileys). FULL 0.86 0.87 0.9 0.74 0.81

Table 3: Binary classification results for smileys and hash-


Multi-class classification. Under multi-class tags. Avg column shows averaged harmonic f-score for 10-
classification we attempt to assign a single label fold cross validation over all 50(15) sentiment hashtags (smi-
leys).
(51 labels in case of hashtags and 16 labels in case
of smileys) to each of vectors in the test set. Note
that the random baseline for this task is 0.02 (0.06) ing any sentiment6 . For each of the 50 (15) labels
for hashtags (smileys). Table 2 shows the perfor- for hashtags (smileys) we have performed a bi-
mance of our framework for these tasks. nary classification when providing as training/test
Results are significantly above the random sets only positive examples of the specific senti-
baseline and definitely nontrivial considering the ment label together with non-sentiment examples.
equal class sizes in the test set. While still rel- Table 3 shows averaged results for this case and
atively low (0.31 for hashtags and 0.64 for smi- specific results for selected tags. We can see that
leys), we observe much better performance for our framework successfully identifies diverse sen-
smileys which is expected due to the lower num- timent types. Obviously the results are much bet-
ber of sentiment types. ter than those of multi-class classification, and the
The relatively low performance of hashtags can observed > 0.8 precision confirms the usefulness
be explained by ambiguity of the hashtags and of the proposed framework for sentiment classifi-
some overlap of sentiments. Examination of clas- cation of a variety of different sentiment types.
sified sentences reveals that many of them can We can see that even for binary classification
be reasonably assigned to more than one of the settings, classification of smiley-labeled sentences
available hashtags or smileys. Thus a tweet “I’m is a substantially easier task compared to classifi-
reading stuff that I DON’T understand again! ha- cation of hashtag-labeled tweets. Comparing the
haha...wth am I doing” may reasonably match contributed performance of different feature types
tags #sarcasm, #damn, #haha, #lol, #humor, #an- we can see that punctuation, word and pattern fea-
gry etc. Close examination of the incorrectly tures, each provide a substantial boost for classi-
classified examples also reveals that substantial fication quality while we observe only a marginal
amount of tweets utilize hashtags to explicitly in- boost when adding n-grams as classification fea-
dicate the specific hashtagged sentiment, in these tures. We can also see that pattern features con-
cases that no sentiment value could be perceived tribute the performance more than all other fea-
by readers unless indicated explicitly, e.g. “De tures together.
Blob game review posted on our blog. #fun”.
Obviously, our framework fails to process such 5.2 Evaluation with human judges
cases and captures noise since no sentiment data In the second set of experiments we evaluated our
is present in the processed text labeled with a spe- framework on a test set of unseen and untagged
cific sentiment label. tweets (thus tweets that were not part of the train-
Binary classification. In the binary classifica- 6
Note that this is a useful application in itself, as a filter
tion experiments, we classified a sentence as ei- that extracts sentiment sentences from a corpus for further
ther appropriate for a particular tag or as not bear- focused study/processing.

246
ing data), comparing its output to tags assigned by Setup % Correct % No sentiment Control
Smileys 84% 6% 92%
human judges. We applied our framework with Hashtags 77% 10% 90%
its FULL setting, learning the sentiment tags from
the training set for hashtags and smileys (sepa- Table 4: Results of human evaluation. The second col-
rately) and executed the framework on the reduced umn indicates percentage of sentences where judges find no
Tweeter dataset (without untagged data) allowing appropriate tags from the list. The third column shows per-
formance on the control set.
it to identify at least five sentences for each senti-
ment class. Hashtags #happy #sad #crazy # bored
#sad 0.67 - - -
In order to make the evaluation harsher, we re- #crazy 0.67 0.25 - -
moved all tweets containing at least one of the #bored 0.05 0.42 0.35 -
relevant classification hashtags (or smileys). For #fun 1.21 0.06 1.17 0.43
each of the resulting 250 sentences for hashtags, Smileys :) ;) :( X(
;) 3.35 - - -
and 75 sentences for smileys we generated an ‘as- :( 3.12 0.53 - -
signment task’. Each task presents a human judge X( 1.74 0.47 2.18 -
with a sentence and a list of ten possible hash- :S 1.74 0.42 1.4 0.15
tags. One tag from this list was provided by our
algorithm, 8 other tags were sampled from the re- Table 5: Percentage of co-appearance of tags in tweeter
corpus.
maining 49 (14) available sentiment tags, and the
tenth tag is from the list of frequent non-sentiment
tags (e.g. travel or obama). The human judge was higher than of hashtag labels, due to the lesser
requested to select the 0-2 most appropriate tags number of possible smileys and the lesser ambi-
from the list. Allowing assignment of multiple guity of smileys in comparison to hashtags.
tags conforms to the observation that even short
sentences may express several different sentiment 5.3 Exploration of feature dependencies
types and to the observation that some of the se- Our algorithm assigns a single sentiment type
lected sentiment tags might express similar senti- for each tweet. However, as discussed above,
ment types. some sentiment types overlap (e.g., #awesome and
We used the Amazon Mechanical Turk service #amazing). Many sentences may express several
to present the tasks to English-speaking subjects. types of sentiment (e.g., #fun and #scary in “Oh
Each subject was given 50 tasks for Twitter hash- My God http://goo.gl/fb/K2N5z #entertainment
tags or 25 questions for smileys. To ensure the #fun #pictures #photography #scary #teaparty”).
quality of assignments, we added to each test five We would like to estimate such inter-sentiment
manually selected, clearly sentiment bearing, as- dependencies and overlap automatically from the
signment tasks from the tagged Twitter sentences labeled data. We use two different methods for
used in the training set. Each set was presented to overlap estimation: tag co-occurrence and feature
four subjects. If a human subject failed to provide overlap.
the intended “correct” answer to at least two of
the control set questions we reject him/her from 5.3.1 Tag co-occurrence
the calculation. In our evaluation the algorithm Many tweets contain more than a single hash-
is considered to be correct if one of the tags se- tag or a single smiley type. As mentioned, we ex-
lected by a human judge was also selected by the clude such tweets from the training set to reduce
algorithm. Table 4 shows results for human judge- ambiguity. However such tag co-appearances can
ment classification. The agreement score for this be used for sentiment overlap estimation. We cal-
task was κ = 0.41 (we consider agreement when culated the relative co-occurrence frequencies of
at least one of two selected items are shared). some hashtags and smileys. Table 5 shows some
Table 4 shows that the majority of tags selected of the observed co-appearance ratios. As expected
by humans matched those selected by the algo- some of the observed tags frequently co-appear
rithm. Precision of smiley tags is substantially with other similar tags.

247
Hashtags #happy #sad #crazy # bored 6 Conclusion
#sad 12.8 - - -
#crazy 14.2 3.5 - -
#bored 2.4 11.1 2.1 - We presented a framework which allows an au-
#fun 19.6 2.1 15 4.4 tomatic identification and classification of various
Smileys :) ;) :( X( sentiment types in short text fragments which is
;) 35.9 - - -
:( 31.9 10.5 - -
based on Twitter data. Our framework is a su-
X( 8.1 10.2 36 - pervised classification one which utilizes Twitter
:S 10.5 12.6 21.6 6.1 hashtags and smileys as training labels. The sub-
stantial coverage and size of the processed Twit-
Table 6: Percentage of shared features in feature vectors ter data allowed us to identify dozens of senti-
for different tags.
ment types without any labor-intensive manually
labeled training sets or pre-provided sentiment-
specific features or sentiment words.
Interestingly, it appears that a relatively high
We evaluated diverse feature types for senti-
ratio of co-appearance of tags is with opposite
ment extraction including punctuation, patterns,
meanings (e.g., “#ilove eating but #ihate feeling
words and n-grams, confirming that each fea-
fat lol” or “happy days of training going to end
ture type contributes to the sentiment classifica-
in a few days #sad #happy”). This is possibly due
tion framework. We also proposed two different
to frequently expressed contrast sentiment types
methods which allow an automatic identification
in the same sentence – a fascinating phenomena
of sentiment type overlap and inter-dependencies.
reflecting the great complexity of the human emo-
In the future these methods can be used for au-
tional state (and expression).
tomated clustering of sentiment types and senti-
ment dependency rules. While hashtag labels are
5.3.2 Feature overlap specific to Twitter data, the obtained feature vec-
tors are not heavily Twitter-specific and in the fu-
In our framework we have created a set of fea- ture we would like to explore the applicability of
ture vectors for each of the Twitter sentiment tags. Twitter data for sentiment multi-class identifica-
Comparison of shared features in feature vector tion and classification in other domains.
sets allows us to estimate dependencies between
different sentiment types even when direct tag co-
occurrence data is very sparse. A feature is con- References
sidered to be shared between two different senti- Akkaya, Cem, Janyce Wiebe, and Rada Mihalcea.
ment labels if for both sentiment labels there is 2009. Subjectivity word sense disambiguation. In
at least a single example in the training set which EMNLP.
has a positive value of this feature. In order to au-
tomatically analyze such dependencies we calcu- Andreevskaia, A. and S. Bergler. 2006. Mining word-
net for fuzzy sentiment: Sentiment tag extraction
late the percentage of shared Word/n-gram/Pattern from wordnet glosses. In EACL.
features between different sentiment labels. Table
6 shows the observed feature overlap values for Balog, Krisztian, Gilad Mishne, and Maarten de Ri-
selected sentiment tags. jke. 2006. Why are they excited? identifying and
explaining spikes in blog mood levels. In EACL.
We observe the trend of results obtained by
comparison of shared feature vectors is similar to Bloom, Kenneth, Navendu Garg, and Shlomo Arga-
those obtained by means of label co-occurrence, mon. 2007. Extracting appraisal expressions. In
although the numbers of the shared features are HLT/NAACL.
higher. These results, demonstrating the pattern-
Davidov, D. and A. Rappoport. 2006. Efficient
based similarity of conflicting, sometimes contra- unsupervised discovery of word categories using
dicting, emotions are interesting from a psycho- symmetric patterns and high frequency words. In
logical and cognitive perspective. COLING-ACL.

248
Davidov, D. and A. Rappoport. 2008. Unsuper- Whitelaw, Casey, Navendu Garg, and Shlomo Arga-
vised discovery of generic relationships using pat- mon. 2005. Using appraisal groups for sentiment
tern clusters and its evaluation by automatically gen- analysis. In CIKM.
erated sat analogy questions. In ACL.
Wiebe, Janyce and Rada Mihalcea. 2006. Word sense
Davidov, D., O. Tsur, and A. Rappoport. 2010. and subjectivity. In COLING/ACL, Sydney, AUS.
Semi-supervised recognition of sarcastic sentences
in twitter and amazon. In CoNLL. Wiebe, Janyce M. 2000. Learning subjective adjec-
tives from corpora. In AAAI.
Esuli, Andrea and Fabrizio Sebastiani. 2006. Senti-
wordnet: A publicly available lexical resource for Wilson, Theresa, Janyce Wiebe, and Paul Hoffmann.
opinion mining. In LREC. 2005. Recognizing contextual polarity in phrase-
level sentiment analysis. In HLT/EMNLP.
Jansen, B.J., M. Zhang, K. Sobel, and A. Chowdury.
2009. Twitter power: Tweets as electronic word of Wilson, Theresa, Janyce Wiebe, and Paul Hoffmann.
mouth. Journal of the American Society for Infor- 2009. Recognizing contextual polarity: An explo-
mation Science and Technology. ration of features for phrase-level sentiment analy-
sis. Computational Linguistics, 35(3):399–433.
Kim, S.M. and E. Hovy. 2004. Determining the senti-
Yu, Hong and Vasileios Hatzivassiloglou. 2003. To-
ment of opinions. In COLING.
wards answering opinion questions: Separating
McDonald, Ryan, Kerry Hannan, Tyler Neylon, Mike facts from opinions and identifying the polarity of
Wells, and Jeff Reynar. 2007. Structured models opinion sentences. In EMNLP.
for fine-to-coarse sentiment analysis. In ACL.

Melville, Prem, Wojciech Gryc, and Richard D.


Lawrence. 2009. Sentiment analysis of blogs by
combining lexical knowledge with text classifica-
tion. In KDD. ACM.

Mihalcea, Rada and Hugo Liu. 2006. A corpus-


based approach to finding happiness. In In AAAI
2006 Symposium on Computational Approaches to
Analysing Weblogs. AAAI Press.

Mishne, Gilad. 2005. Experiments with mood clas-


sification in blog posts. In Proceedings of the 1st
Workshop on Stylistic Analysis Of Text.

Riloff, Ellen. 2003. Learning extraction patterns for


subjective expressions. In EMNLP.

Strapparava, Carlo and Rada Mihalcea. 2008. Learn-


ing to identify emotions in text. In SAC.

Titov, Ivan and Ryan McDonald. 2008a. A joint


model of text and aspect ratings for sentiment sum-
marization. In ACL/HLT, June.

Titov, Ivan and Ryan McDonald. 2008b. Modeling


online reviews with multi-grain topic models. In
WWW, pages 111–120, New York, NY, USA. ACM.

Tsur, Oren, Dmitry Davidov, and Ari Rappoport.


2010. Icwsm – a great catchy name: Semi-
supervised recognition of sarcastic sentences in
product reviews. In AAAI-ICWSM.

Turney, Peter D. 2002. Thumbs up or thumbs down?


semantic orientation applied to unsupervised classi-
fication of reviews. In ACL ’02, volume 40.

249

You might also like