Enhanced Sentiment Learning Using Twitter Hashtags and Smileys
Enhanced Sentiment Learning Using Twitter Hashtags and Smileys
241
Coling 2010: Poster Volume, pages 241–249,
Beijing, August 2010
self-standing short textual sentences (tweets) be- It was suggested that sentiment words may have
came openly available for the research commu- different senses (Esuli and Sebastiani, 2006; An-
nity. Many of these tweets contain a wide vari- dreevskaia and Bergler, 2006; Wiebe and Mihal-
ety of user-defined hashtags. Some of these tags cea, 2006), thus word sense disambiguation can
are sentiment tags which assign one or more senti- improve sentiment analysis systems (Akkaya et
ment values to a tweet. In this paper we propose a al., 2009). All works mentioned above identify
way to utilize such tagged Twitter data for classi- evaluative sentiment expressions and their polar-
fication of a wide variety of sentiment types from ity.
text.
We utilize 50 Twitter tags and 15 smileys as Another line of works aims at identifying a
sentiment labels which allow us to build a clas- broader range of sentiment classes expressing var-
sifier for dozens of sentiment types for short tex- ious emotions such as happiness, sadness, bore-
tual sentences. In our study we use four different dom, fear, and gratitude, regardless (or in addi-
feature types (punctuation, words, n-grams and tion to) positive or negative evaluations. Mihalcea
patterns) for sentiment classification and evaluate and Liu (2006) derive lists of words and phrases
the contribution of each feature type for this task. with happiness factor from a corpus of blog posts,
We show that our framework successfully identi- where each post is annotated by the blogger with
fies sentiment types of the untagged tweets. We a mood label. Balog et al. (2006) use the mood
confirm the quality of our algorithm using human annotation of blog posts coupled with news data
judges. in order to discover the events that drive the dom-
We also explore the dependencies and overlap inant moods expressed in blogs. Mishne (2005)
between different sentiment types represented by used an ontology of over 100 moods assigned
smileys and Twitter tags. to blog posts to classify blog texts according to
Section 2 describes related work. Section 3 moods. While (Mishne, 2005) classifies a blog en-
details classification features and the algorithm, try (post), (Mihalcea and Liu, 2006) assign a hap-
while Section 4 describes the dataset and labels. piness factor to specific words and expressions.
Automated and manual evaluation protocols and Mishne used a much broader range of moods.
results are presented in Section 5, followed by a Strapparava and Mihalcea (2008) classify blog
short discussion. posts and news headlines to six sentiment cate-
gories.
2 Related work
While most of the works on sentiment analy-
Sentiment analysis tasks typically combine two sis focus on full text, some works address senti-
different tasks: (1) Identifying sentiment expres- ment analysis in the phrasal and sentence level,
sions, and (2) determining the polarity (sometimes see (Yu and Hatzivassiloglou, 2003; Wilson et al.,
called valence) of the expressed sentiment. These 2005; McDonald et al., 2007; Titov and McDon-
tasks are closely related as the purpose of most ald, 2008a; Titov and McDonald, 2008b; Wilson
works is to determine whether a sentence bears a et al., 2009; Tsur et al., 2010) among others.
positive or a negative (implicit or explicit) opinion
about the target of the sentiment. Only a few studies analyze the sentiment and
Several works (Wiebe, 2000; Turney, 2002; polarity of tweets targeted at major brands. Jansen
Riloff, 2003; Whitelaw et al., 2005) use lexical re- et al. (2009) used a commercial sentiment ana-
sources and decide whether a sentence expresses lyzer as well as a manually labeled corpus. Davi-
a sentiment by the presence of lexical items (sen- dov et al. (2010) analyze the use of the #sarcasm
timent words). Others combine additional feature hashtag and its contribution to automatic recogni-
types for this decision (Yu and Hatzivassiloglou, tion of sarcastic tweets. To the best of our knowl-
2003; Kim and Hovy, 2004; Wilson et al., 2005; edge, there are no works employing Twitter hash-
Bloom et al., 2007; McDonald et al., 2007; Titov tags to learn a wide range of emotions and the re-
and McDonald, 2008a; Melville et al., 2009). lations between the different emotions.
242
3 Sentiment classification framework were set to 1000 words per million (upper bound
for FC ) and 100 words per million (lower bound
Below we propose a set of classification features
for FH )2 .
and present the algorithm for sentiment classifica-
The patterns allow 2–6 HFWs and 1–5 slots for
tion.
CWs. To avoid collection of patterns which cap-
3.1 Classification features ture only a part of a meaningful multiword ex-
pression, we require patterns to start and to end
We utilize four basic feature types for sentiment with a HFW. Thus a minimal pattern is of the
classification: single word features, n-gram fea- form [HFW] [CW slot] [HFW]. For each sentence
tures, pattern features and punctuation features. it is possible to generate dozens of different pat-
For the classification, all feature types are com- terns that may overlap. As with words and n-gram
bined into a single feature vector. features, we do not treat as features any patterns
3.1.1 Word-based and n-gram-based features which appear in less than 0.5% of the training set
sentences.
Each word appearing in a sentence serves as a
Since each feature vector is based on a single
binary feature with weight equal to the inverted
sentence (tweet), we would like to allow approx-
count of this word in the Twitter corpus. We also
imate pattern matching for enhancement of learn-
took each consecutive word sequence containing
ing flexibility. The value of a pattern feature is
2–5 words as a binary n-gram feature using a sim-
estimated according the one of the following four
ilar weighting strategy. Thus n-gram features al-
scenarios3 :
ways have a higher weight than features of their
component words, and rare words have a higher
1
count(p)
: Exact match – all the pattern components
appear in the sentence in correct
weight than common words. Words or n-grams
order without any additional words.
appearing in less than 0.5% of the training set sen-
tences do not constitute a feature. ASCII smileys
α
: Sparse match – same as exact match
count(p)
and other punctuation sequences containing two
but additional non-matching words can
be inserted between pattern components.
or more consecutive punctuation symbols were
used as single-word features. Word features also γ∗n
: Incomplete match – only n > 1 of N
N ∗count(p)
include the substituted meta-words for URLs, ref-
pattern components appear in
the sentence, while some non-matching
erences and hashtags (see Subsection 4.1).
words can be inserted in-between.
At least one of the appearing components
3.1.2 Pattern-based features
should be a HFW.
Our main feature type is based on surface pat-
0: No match – nothing or only a single
terns. For automated extraction of patterns, we pattern component appears in the sentence.
followed the pattern definitions given in (Davidov
and Rappoport, 2006). We classified words into 0 ≤ α ≤ 1 and 0 ≤ γ ≤ 1 are parameters we use
high-frequency words (HFWs) and content words to assign reduced scores for imperfect matches.
(CWs). A word whose corpus frequency is more Since the patterns we use are relatively long, ex-
(less) than FH (FC ) is considered to be a HFW act matches are uncommon, and taking advantage
(CW). We estimate word frequency from the train- of partial matches allows us to significantly re-
ing set rather than from an external corpus. Unlike duce the sparsity of the feature vectors. We used
(Davidov and Rappoport, 2006), we consider all α = γ = 0.1 in all experiments.
single punctuation characters or consecutive se- This pattern based framework was proven effi-
quences of punctuation characters as HFWs. We cient for sarcasm detection in (Tsur et al., 2010;
also consider URL, REF, and HASHTAG tags as 2
Note that the FH and FC bounds allow overlap between
HFWs for pattern extraction. We define a pattern some HFWs and CWs. See (Davidov and Rappoport, 2008)
as an ordered sequence of high frequency words for a short discussion.
3
As with word and n-gram features, the maximal feature
and slots for content words. Following (Davidov weight of a pattern p is defined as the inverse count of a pat-
and Rappoport, 2008), the FH and FC thresholds tern in the complete Twitter corpus.
243
Davidov et al., 2010). to v is the label of the majority of the remaining
vectors.
3.1.3 Efficiency of feature selection If a similar number of remaining vectors have
Since we avoid selection of textual features different labels, we assigned to the test vector the
which have a training set frequency below 0.5%, most frequent of these labels according to their
we perform feature selection incrementally, on frequency in the dataset. If there are no matching
each stage using the frequencies of the features vectors found for v, we assigned the default “no
obtained during the previous stages. Thus first sentiment” label since there is significantly more
we estimate the frequencies of single words in non-sentiment sentences than sentiment sentences
the training set, then we only consider creation in Twitter.
of n-grams from single words with sufficient fre-
quency, finally we only consider patterns com- 4 Twitter dataset and sentiment tags
posed from sufficiently frequent words and n-
In our experiments we used an extensive Twit-
grams.
ter data collection as training and testing sets. In
3.1.4 Punctuation-based features our training sets we utilize sentiment hashtags and
smileys as classification labels. Below we de-
In addition to pattern-based features we used
scribe this dataset in detail.
the following generic features: (1) Sentence
length in words, (2) Number of “!” characters in 4.1 Twitter dataset
the sentence, (3) Number of “?” characters in the
We have used a Twitter dataset generously pro-
sentence, (4) Number of quotes in the sentence,
vided to us by Brendan O’Connor. This dataset
and (5) Number of capitalized/all capitals words
includes over 475 million tweets comprising
in the sentence. All these features were normal-
roughly 15% of all public, non-“low quality”
ized by dividing them by the (maximal observed
tweets created from May 2009 to Jan 2010.
value times averaged maximal value of the other
Tweets are short sentences limited to 140 UTF-
feature groups), thus the maximal weight of each
8 characters. All non-English tweets and tweets
of these features is equal to the averaged weight
which contain less than 5 proper English words5
of a single pattern/word/n-gram feature.
were removed from the dataset.
3.2 Classification algorithm Apart of simple text, tweets may contain URL
addresses, references to other Twitter users (ap-
In order to assign a sentiment label to new exam-
pear as @<user>) or a content tags (also called
ples in the test set we use a k-nearest neighbors
hashtags) assigned by the tweeter (#<tag>)
(kNN)-like strategy. We construct a feature vec-
which we use as labels for our supervised clas-
tor for each example in the training and the test
sification framework.
set. We would like to assign a sentiment class to
Two examples of typical tweets are: “#ipad
each example in the test set. For each feature vec-
#sucks and 6,510 people agree. See more on Ipad
tor V in the test set, we compute the Euclidean
sucks page: http://j.mp/4OiYyg?”, and “Pay no
distance to each of the matching vectors in the
mind to those who talk behind ur back, it sim-
training set, where matching vectors are defined as
ply means that u’re 2 steps ahead. #ihatequotes”.
ones which share at least one pattern/n-gram/word
Note that in the first example the hashtagged
feature with v.
words are a grammatical part of the sentence (it
Let ti , i = 1 . . . k be the k vectors with low- becomes meaningless without them) while #ihate-
est Euclidean distance to v 4 with assigned labels qoutes of the second example is a mere sentiment
Li , i = 1 . . . k. We calculate the mean distance label and not part of the sentence. Also note that
d(ti , v) for this set of vectors and drop from the set hashtags can be composed of multiple words (with
up to five outliers for which the distance was more no spaces).
then twice the mean distance. The label assigned
5
Identification of proper English words was based on an
4
We used k = 10 for all experiments. available WN-based English dictionary
244
Category # of tags % agreement the Amazon Mechanical Turk (AMT) service in
Strong sentiment 52 87
Likely sentiment 70 66 order to obtain a list of the most commonly used
Context-dependent 110 61 and unambiguous ASCII smileys. We asked each
Focused 45 75 of ten AMT human subjects to provide at least 6
No sentiment 3564 99
commonly used ASCII mood-indicating smileys
together with one or more single-word descrip-
Table 1: Annotation results (2 judges) for the 3852 most
frequent tweeter tags. The second column displays the av- tions of the smiley-related mood state. From the
erage number of tags, and the last column shows % of tags obtained list of smileys we selected a subset of 15
annotated similarly by two judges.
smileys which were (1) provided by at least three
human subjects, (2) described by at least two hu-
During preprocessing, we have replaced URL man subject using the same single-word descrip-
links, hashtags and references by URL/REF/TAG tion, and (3) appear at least 1000 times in our
meta-words. This substitution obviously had Twitter dataset. We then sampled 1000 tweets for
some effect on the pattern recognition phase (see each of these smileys, using these smileys as sen-
Section 3.1.2), however, our algorithm is robust timent tags in the sentiment classification frame-
enough to overcome this distortion. work described in the previous section.
245
Setup Smileys Hashtags Hashtags Avg #hate #jealous #cute #outrageous
random 0.06 0.02 Pn+W-M-Pt- 0.57 0.6 0.55 0.63 0.53
Pn+W-M-Pt- 0.16 0.06 Pn+W+M-Pt- 0.64 0.64 0.67 0.66 0.6
Pn+W+M-Pt- 0.25 0.15 Pn+W+M+Pt- 0.69 0.66 0.67 0.69 0.64
Pn+W+M+Pt- 0.29 0.18 Pn-W-M-Pt+ 0.73 0.75 0.7 0.69 0.69
Pn-W-M-Pt+ 0.5 0.26 FULL 0.8 0.83 0.76 0.71 0.78
FULL 0.64 0.31 Smileys Avg :) ;) X( :d
Pn+W-M-Pt- 0.64 0.66 0.67 0.56 0.65
Table 2: Multi-class classification results for smileys and Pn+W+M-Pt- 0.7 0.73 0.72 0.64 0.69
hashtags. The table shows averaged harmonic f-score for 10- Pn+W+M+Pt- 0.7 0.74 0.75 0.66 0.69
fold cross validation. 51 (16) sentiment classes were used for Pn-W-M-Pt+ 0.75 0.78 0.75 0.68 0.72
hashtags (smileys). FULL 0.86 0.87 0.9 0.74 0.81
246
ing data), comparing its output to tags assigned by Setup % Correct % No sentiment Control
Smileys 84% 6% 92%
human judges. We applied our framework with Hashtags 77% 10% 90%
its FULL setting, learning the sentiment tags from
the training set for hashtags and smileys (sepa- Table 4: Results of human evaluation. The second col-
rately) and executed the framework on the reduced umn indicates percentage of sentences where judges find no
Tweeter dataset (without untagged data) allowing appropriate tags from the list. The third column shows per-
formance on the control set.
it to identify at least five sentences for each senti-
ment class. Hashtags #happy #sad #crazy # bored
#sad 0.67 - - -
In order to make the evaluation harsher, we re- #crazy 0.67 0.25 - -
moved all tweets containing at least one of the #bored 0.05 0.42 0.35 -
relevant classification hashtags (or smileys). For #fun 1.21 0.06 1.17 0.43
each of the resulting 250 sentences for hashtags, Smileys :) ;) :( X(
;) 3.35 - - -
and 75 sentences for smileys we generated an ‘as- :( 3.12 0.53 - -
signment task’. Each task presents a human judge X( 1.74 0.47 2.18 -
with a sentence and a list of ten possible hash- :S 1.74 0.42 1.4 0.15
tags. One tag from this list was provided by our
algorithm, 8 other tags were sampled from the re- Table 5: Percentage of co-appearance of tags in tweeter
corpus.
maining 49 (14) available sentiment tags, and the
tenth tag is from the list of frequent non-sentiment
tags (e.g. travel or obama). The human judge was higher than of hashtag labels, due to the lesser
requested to select the 0-2 most appropriate tags number of possible smileys and the lesser ambi-
from the list. Allowing assignment of multiple guity of smileys in comparison to hashtags.
tags conforms to the observation that even short
sentences may express several different sentiment 5.3 Exploration of feature dependencies
types and to the observation that some of the se- Our algorithm assigns a single sentiment type
lected sentiment tags might express similar senti- for each tweet. However, as discussed above,
ment types. some sentiment types overlap (e.g., #awesome and
We used the Amazon Mechanical Turk service #amazing). Many sentences may express several
to present the tasks to English-speaking subjects. types of sentiment (e.g., #fun and #scary in “Oh
Each subject was given 50 tasks for Twitter hash- My God http://goo.gl/fb/K2N5z #entertainment
tags or 25 questions for smileys. To ensure the #fun #pictures #photography #scary #teaparty”).
quality of assignments, we added to each test five We would like to estimate such inter-sentiment
manually selected, clearly sentiment bearing, as- dependencies and overlap automatically from the
signment tasks from the tagged Twitter sentences labeled data. We use two different methods for
used in the training set. Each set was presented to overlap estimation: tag co-occurrence and feature
four subjects. If a human subject failed to provide overlap.
the intended “correct” answer to at least two of
the control set questions we reject him/her from 5.3.1 Tag co-occurrence
the calculation. In our evaluation the algorithm Many tweets contain more than a single hash-
is considered to be correct if one of the tags se- tag or a single smiley type. As mentioned, we ex-
lected by a human judge was also selected by the clude such tweets from the training set to reduce
algorithm. Table 4 shows results for human judge- ambiguity. However such tag co-appearances can
ment classification. The agreement score for this be used for sentiment overlap estimation. We cal-
task was κ = 0.41 (we consider agreement when culated the relative co-occurrence frequencies of
at least one of two selected items are shared). some hashtags and smileys. Table 5 shows some
Table 4 shows that the majority of tags selected of the observed co-appearance ratios. As expected
by humans matched those selected by the algo- some of the observed tags frequently co-appear
rithm. Precision of smiley tags is substantially with other similar tags.
247
Hashtags #happy #sad #crazy # bored 6 Conclusion
#sad 12.8 - - -
#crazy 14.2 3.5 - -
#bored 2.4 11.1 2.1 - We presented a framework which allows an au-
#fun 19.6 2.1 15 4.4 tomatic identification and classification of various
Smileys :) ;) :( X( sentiment types in short text fragments which is
;) 35.9 - - -
:( 31.9 10.5 - -
based on Twitter data. Our framework is a su-
X( 8.1 10.2 36 - pervised classification one which utilizes Twitter
:S 10.5 12.6 21.6 6.1 hashtags and smileys as training labels. The sub-
stantial coverage and size of the processed Twit-
Table 6: Percentage of shared features in feature vectors ter data allowed us to identify dozens of senti-
for different tags.
ment types without any labor-intensive manually
labeled training sets or pre-provided sentiment-
specific features or sentiment words.
Interestingly, it appears that a relatively high
We evaluated diverse feature types for senti-
ratio of co-appearance of tags is with opposite
ment extraction including punctuation, patterns,
meanings (e.g., “#ilove eating but #ihate feeling
words and n-grams, confirming that each fea-
fat lol” or “happy days of training going to end
ture type contributes to the sentiment classifica-
in a few days #sad #happy”). This is possibly due
tion framework. We also proposed two different
to frequently expressed contrast sentiment types
methods which allow an automatic identification
in the same sentence – a fascinating phenomena
of sentiment type overlap and inter-dependencies.
reflecting the great complexity of the human emo-
In the future these methods can be used for au-
tional state (and expression).
tomated clustering of sentiment types and senti-
ment dependency rules. While hashtag labels are
5.3.2 Feature overlap specific to Twitter data, the obtained feature vec-
tors are not heavily Twitter-specific and in the fu-
In our framework we have created a set of fea- ture we would like to explore the applicability of
ture vectors for each of the Twitter sentiment tags. Twitter data for sentiment multi-class identifica-
Comparison of shared features in feature vector tion and classification in other domains.
sets allows us to estimate dependencies between
different sentiment types even when direct tag co-
occurrence data is very sparse. A feature is con- References
sidered to be shared between two different senti- Akkaya, Cem, Janyce Wiebe, and Rada Mihalcea.
ment labels if for both sentiment labels there is 2009. Subjectivity word sense disambiguation. In
at least a single example in the training set which EMNLP.
has a positive value of this feature. In order to au-
tomatically analyze such dependencies we calcu- Andreevskaia, A. and S. Bergler. 2006. Mining word-
net for fuzzy sentiment: Sentiment tag extraction
late the percentage of shared Word/n-gram/Pattern from wordnet glosses. In EACL.
features between different sentiment labels. Table
6 shows the observed feature overlap values for Balog, Krisztian, Gilad Mishne, and Maarten de Ri-
selected sentiment tags. jke. 2006. Why are they excited? identifying and
explaining spikes in blog mood levels. In EACL.
We observe the trend of results obtained by
comparison of shared feature vectors is similar to Bloom, Kenneth, Navendu Garg, and Shlomo Arga-
those obtained by means of label co-occurrence, mon. 2007. Extracting appraisal expressions. In
although the numbers of the shared features are HLT/NAACL.
higher. These results, demonstrating the pattern-
Davidov, D. and A. Rappoport. 2006. Efficient
based similarity of conflicting, sometimes contra- unsupervised discovery of word categories using
dicting, emotions are interesting from a psycho- symmetric patterns and high frequency words. In
logical and cognitive perspective. COLING-ACL.
248
Davidov, D. and A. Rappoport. 2008. Unsuper- Whitelaw, Casey, Navendu Garg, and Shlomo Arga-
vised discovery of generic relationships using pat- mon. 2005. Using appraisal groups for sentiment
tern clusters and its evaluation by automatically gen- analysis. In CIKM.
erated sat analogy questions. In ACL.
Wiebe, Janyce and Rada Mihalcea. 2006. Word sense
Davidov, D., O. Tsur, and A. Rappoport. 2010. and subjectivity. In COLING/ACL, Sydney, AUS.
Semi-supervised recognition of sarcastic sentences
in twitter and amazon. In CoNLL. Wiebe, Janyce M. 2000. Learning subjective adjec-
tives from corpora. In AAAI.
Esuli, Andrea and Fabrizio Sebastiani. 2006. Senti-
wordnet: A publicly available lexical resource for Wilson, Theresa, Janyce Wiebe, and Paul Hoffmann.
opinion mining. In LREC. 2005. Recognizing contextual polarity in phrase-
level sentiment analysis. In HLT/EMNLP.
Jansen, B.J., M. Zhang, K. Sobel, and A. Chowdury.
2009. Twitter power: Tweets as electronic word of Wilson, Theresa, Janyce Wiebe, and Paul Hoffmann.
mouth. Journal of the American Society for Infor- 2009. Recognizing contextual polarity: An explo-
mation Science and Technology. ration of features for phrase-level sentiment analy-
sis. Computational Linguistics, 35(3):399–433.
Kim, S.M. and E. Hovy. 2004. Determining the senti-
Yu, Hong and Vasileios Hatzivassiloglou. 2003. To-
ment of opinions. In COLING.
wards answering opinion questions: Separating
McDonald, Ryan, Kerry Hannan, Tyler Neylon, Mike facts from opinions and identifying the polarity of
Wells, and Jeff Reynar. 2007. Structured models opinion sentences. In EMNLP.
for fine-to-coarse sentiment analysis. In ACL.
249