Sarcastic Sentiment Detection in Tweets Streamed I - 2016 - Digital Communicatio

Download as pdf or txt
Download as pdf or txt
You are on page 1of 14

Digital Communications and Networks 2 (2016) 108–121

HOSTED BY Contents lists available at ScienceDirect

Digital Communications and Networks


journal homepage: www.elsevier.com/locate/dcan

Sarcastic sentiment detection in tweets streamed in real time: a big


data approach
S.K. Bharti n, B. Vachha, R.K. Pradhan, K.S. Babu, S.K. Jena
Department of Computer Science & Engineering, National Institute of Technology, Rourkela 769008, India

art ic l e i nf o a b s t r a c t

Article history: Sarcasm is a type of sentiment where people express their negative feelings using positive or intensified
Received 20 February 2016 positive words in the text. While speaking, people often use heavy tonal stress and certain gestural clues
Received in revised form like rolling of the eyes, hand movement, etc. to reveal sarcastic. In the textual data, these tonal and
16 May 2016
gestural clues are missing, making sarcasm detection very difficult for an average human. Due to these
Accepted 15 June 2016
Available online 12 July 2016
challenges, researchers show interest in sarcasm detection of social media text, especially in tweets.
Rapid growth of tweets in volume and its analysis pose major challenges. In this paper, we proposed a
Keywords: Hadoop based framework that captures real time tweets and processes it with a set of algorithms which
Big data identifies sarcastic sentiment effectively. We observe that the elapse time for analyzing and processing
Flume
under Hadoop based framework significantly outperforms the conventional methods and is more suited
Hadoop
for real time streaming tweets.
Hive
MapReduce & 2016 Chongqing University of Posts and Telecommunications. Production and Hosting by Elsevier B.V.
Sarcasm This is an open access article under the CC BY-NC-ND license
Sentiment (http://creativecommons.org/licenses/by-nc-nd/4.0/).
Tweets

1. Introduction challenges are posed. Some of them are accessing, storing, pro-
cessing, verification of data sources, dealing with misinformation
With the advent of smart mobile devices and the high-speed and fusing various types of data [3]. However, almost 80% of
Internet, users are able to engage with social media services like generated data is unstructured [4]. As the technology developed,
Facebook, Twitter, Instagram, etc. The volume of social data being people were given more and more ways to interact, from simple
generated is growing rapidly. Statistics from Global WebIndex text messaging and message boards to other more engaging and
shows a 17% yearly increase in mobile users with the total number engrossing channels such as images and videos. These days, social
of unique mobile users reaching 3.7 billion people [1]. Social net- media channels are usually the first to get the feedback about
working websites have become a well-established platform for current event and trends from their user base, allowing them to
users to express their feelings and opinions on various topics, such provide companies with invaluable data that can be used to po-
as events, individuals or products. Social media channels have sition their products in the market as well as gather rapid feedback
become a popular platform to discuss ideas and to interact with from customers.
people worldwide. For instance, Facebook claims to have When an event commences or a product is launched, people
start tweeting, writing reviews, posting comments, etc. on social
1.59 billion monthly active users, each one being a friend with 130
media. People turn to social media platforms to read reviews from
people on average [2]. Similarly, Twitter claims to have more than
other users about a product before they decide whether to pur-
500 million users, out of which more than 332 million are active
chase it or not. Organizations also depend on these sites to know
[1]. Users post more than 340 million tweets and 1.6 billion search
the response of users for their products and subsequently use the
queries every day [1].
feedback to improve their products. However, finding and verify-
With such large volumes of data being generated, a number of
ing the legitimacy of opinions or reviews is a formidable task. It is
difficult to manually read through all the reviews and determine
n
Corresponding author. which of the opinions expressed are sarcastic. In addition, the
E-mail addresses: [email protected] (S.K. Bharti), common reader will have difficulty in recognizing sarcasm in
[email protected] (B. Vachha), tweets or product reviews, which may end up misleading them.
[email protected] (R.K. Pradhan), [email protected] (K.S. Babu),
[email protected] (S.K. Jena).
A tweet or a review may not state the exact orientation of the
Peer review under responsibility of Chongqing University of Posts and user directly, i.e., it may be sarcastically expressed. Sarcasm is a
Telecommunications. kind of sentiment which acts as an interfering factor in any text

http://dx.doi.org/10.1016/j.dcan.2016.06.002
2352-8648/& 2016 Chongqing University of Posts and Telecommunications. Production and Hosting by Elsevier B.V. This is an open access article under the CC BY-NC-ND
license (http://creativecommons.org/licenses/by-nc-nd/4.0/).
S.K. Bharti et al. / Digital Communications and Networks 2 (2016) 108–121 109

that can flip the polarity [5]. For example, ‘I love being ignored processing of such large data sets become a complex problem.
#sarcasm’. Here, "love" expresses a positive sentiment in a nega- Twitter is one such social networking platform that generates data
tive context. Therefore, the tweet is classified as sarcastic. Unlike a continuously. In the existing literature, most of the researchers
simple negation, sarcastic tweets contain positive words or even used Tweepy (An easy-to-use Python library for accessing the
intensified positive words to convey a negative opinion or vice Twitter API) and Twitter4J (a java library for accessing the Twitter
versa. This creates a need for the large volumes of reviews, tweets API) for aggregation of tweets from Twitter [5,18–22]. The Twitter
or feedback messages to be analyzed rapidly to predict their exact Application Programming Interface (API) [23] provides a streaming
orientation. Moreover, each tweet may have to pass through a set API [24] to allow developers to obtain real time access to tweets.
of algorithms to be accurately classified. Befit and Frank [25] discuss the challenges of capturing Twitter
In this paper, we propose a Hadoop-based framework [6] that data streams. Tufekci and Zeynep [26] examined the methodolo-
allows the user to acquire and store tweets in a distributed en- gical and conceptual challenges for social media based big data
vironment [7] and process them for detecting sarcastic content in operations with special attention to the validity and representa-
real time using the MapReduce [8] programming model. The tiveness of big data analysis of social media. Due to some restric-
mapper class works as a partitioner and divides large volume of tions placed by Twitter on the use of their retrieval APIs, one can
tweets into small chunks and distributes them among the nodes in only download a limited amount of tweets in a specified time
the Hadoop cluster. The reducer class works as a combiner and is frame using these APIs and libraries. Getting a larger amount of
responsible for collecting processed tweets from each node in the tweets in real time is a challenging task. There is a need for effi-
cluster and assembles them to produce the final output. Apache cient techniques to acquire a large amount of tweets from Twitter.
Flume [9,10] is used for capturing tweets in real time as it is highly Researchers are evaluating the feasibility of using the Hadoop
reliable, distributed and configurable. Flume uses an elegant de- ecosystem [6] for the storage and processing [22,27–29] of large
sign to make data loading easy and efficient from several sources amounts of tweets from Twitter. Shirahatti et al. [27] used Apache
into the Hadoop Distributed File System (HDFS) [11]. For proces- Flume [10] with the Hadoop ecosystem to collect tweets from
sing these tweets stored in the HDFS, we use Apache Hive [12]. It Twitter. Ha et al. [22] used Topsy with the Hadoop ecosystem for
provides us with an SQL-like language called HiveQL to convert gathering tweets from Twitter. Furthermore, they analyzed the
queries into mapper and reducer classes [12]. Further, we use sentiment and emotion information for the collected tweets in
natural language processing (NLP) techniques like POS tagging their research. Taylor et al. [28] used the Hadoop framework in
[13], parsing [14], text mining [15,16] and sentiment analysis [17] applications in the bioinformatics domain.
to identify sarcasm in these processed tweets.
My paper compares and contrasts the time requirements for 2.2. Sarcasm sentiment analysis
our approach when run on a standard non-Hadoop implementa-
tion as well as on a Hadoop deployment to find the improvement Sarcasm sentiment analysis is a rapidly growing area of NLP
in performance when we use Hadoop. For real time applications with research ranging from word, phrase and sentence level
where millions of tweets need to be processed as fast as possible, classification [5,18,19,30] to document [31] and concept level
we observe that the time taken by the single node approach in- classification [21]. Research is progressing in finding ways for ef-
creases much higher than the Hadoop implementation. This sug- ficient analysis of sentiments with better accuracy in written text
gests that for higher volumes of data it is more advantageous to as well as analyzing irony, humor and sarcasm within social media
use the proposed deployment for sarcasm analysis. data. Sarcastic sentiment detection is classified into three cate-
The contributions of this paper are as follows: gories based on text features used for classification, which are
lexical, pragmatic and hyperbolic as shown in Fig. 1.
1. Capturing and processing real time tweets using Apache Flume
and Hive under the Hadoop framework. 2.2.1. Lexical feature based classification
2. We propose a set of algorithms to detect sarcasm in tweets Text properties such as unigram, bigram, n-grams, etc. are
under the Hadoop framework. classified as lexical features of a text. Authors used these features
3. We propose another set of algorithms to detect sarcasm in
to identify sarcasm, Kreuz et al. [32] introduced this concept for
tweets.
the first time and they observed that lexical features play a vital
role in detecting irony and sarcasm in text. Kreuz et al. [33], in
The rest of this paper is organized as follows. Section 2 presents
their subsequent work, used these lexical features along with
related work for capturing and processing data acquired through
syntactic features to detect sarcastic tweets. Davidov et al. [30]
the Twitter streaming API followed by sarcasm analysis of the
used pattern-based (high-frequency words and content words)
captured data. Section 3 explains preliminaries of this research
and punctuation-based methods to build a weighted k-nearest
paper. The proposed scheme is described in Section 4. Section 5
presents the performance analysis of the proposed schemes.
Finally, the conclusion and recommendations for future work are
drawn in Section 6.

2. Related work

In this section the literature survey is done on two folds. At


first, capturing and preprocessing of the real time tweets are
surveyed and then literature on sarcasm detection follows.

2.1. Capturing and preprocessing of tweets in large volume

Rapid adaption and growth of social networking platforms


enable users to generate data at an alarming rate. Storing and Fig. 1. Classification of sarcasm detection based on text features used.
110 S.K. Bharti et al. / Digital Communications and Networks 2 (2016) 108–121

neighbor (kNN) classification model to perform sarcasm detection. achieved good accuracy in their research to detect sarcasm in
Tsur et al. [34] observed that bigram based features produce better tweets. Utsumi and Akira [41] discussed extreme adjectives and
results in detecting sarcasm in tweets and Amazon product re- adverbs and how the presence of these two intensifies the text.
views. González-Ibánez et al. [18] explored numerous lexical fea- Most often, it provides an implicit way to display negative atti-
tures (derived from LWIC [35] and WordNet affect [36]) to identify tudes, i.e., sarcasm. Kreuz et al. [33] discussed the other hyperbolic
sarcasm. Riloff et al. [5] used a well-constructed lexicon based terms such as interjection and punctuation. They have shown how
approach to detect sarcasm and for lexicon generation they used hyperbole is useful in sarcasm detection. Filatova and Elena [31]
unigram, bigram and trigram features. Bharti et al. [19] considered used the hyperbole features in document level text. According to
bigram and trigram to generate bags of lexicons for sentiment and them, phrase or sentence level is not sufficient for good accuracy
situation in tweets. Barbieri et al. [37] considered seven lexical and considered the text context in that document to improve the
features to detect sarcasm through its inner structure such as accuracy. Liebrecht et al. [42] explained hyperbole features with
unexpectedness, the intensity of the terms or imbalance between examples of utterances: ‘Fantastic weather’ when it rains is iden-
registers. tified as sarcastic with more ease than the utterance without a
hyperbole (‘the weather is good’ when it rains). Lunando et al. [20]
2.2.2. Pragmatic feature based classification declared that the tweet containing interjection words such as
The use of symbolic and figurative text in tweets is frequent wow, aha, yay, etc. has a higher chance of being sarcastic. They
due to the limitations in message length of a tweet. These sym- developed a system for sarcasm detection for Indonesian social
bolic and figurative texts are called pragmatic features (such as media. Tungthamthiti et al. [21] explored concept level knowledge
smilies, emoticons, replies, @user, etc.). It is one of the powerful using the hyperbolic words in sentences and gave an indirect
features to identify sarcasm in tweets as several authors have used contradiction between sentiment and situation, such as raining,
this feature in their work to detect sarcasm. Pragmatic features are bad weather, which are conceptually the same. Therefore, if
one of the key features used by Kreuz et al. [33] to detect sarcasm ‘raining’ is present in any sentence, then one can assume ‘bad
in text. Carvalho et al. [38] used pragmatic features like emoticons weather’. Bharti et al. [19] considered interjection as a hyperbole
and special punctuations to detect irony from newspaper text data. feature to detect sarcasm in tweets that starts with an interjection.
González-Ibánez et al. [18] further explored this feature with some Based on the classification, a consolidated summary of previous
more parameters like smilies and replies and developed a sarcasm studies related to sarcasm identification is shown in Table 1. It
detection system using the pragmatic features of Twitter data. provides types of approaches used by previous authors (denoted
Tayal et al. [39] also used the pragmatic feature in political tweets as A1 and A2), various types of sarcasm occurring in tweets (de-
to predict which party will win in the election. Similarly, Rajade- noted as T1, T2, T3, T4, T5, T6, and T7), text features (denoted as F1,
singan et al. [40] used psychological and behavioral features on F2, and F3) and datasets from different domains (denoted as D1,
users' present and past tweets to detect sarcasm. D2, D3, D4, and D5), mostly from Twitter data. The details are
shown in Table 2.
2.2.3. Hyperbole feature based classification From Table 1, it is observed that only Bharti et al. [19] have
Hyperbole is another key feature often used in sarcasm de- worked for sarcasm type T2 and T3. Lunando et al. [20] discussed
tection from textual data. A hyperbolic text contains one of the that tweets with interjections are classified as sarcastic. Further,
text properties, such as intensifier, interjection, quotes, punctua- Rajadesingan et al. [40] are the only authors who worked for
tion, etc. Previous authors used these hyperbole features and sarcasm type T4. Most of the researchers identified sarcasm in

Table 1
Previous studies in sarcasm detection in text.

Study Approaches Types of sarcasm Type of feature Domains

A1 A2 T1 T2 T3 T4 T5 T6 T7 F1 F2 F3 D1 D2 D3 D4 D5

A11 A12 F31 F32 F33 F34

Kreuz et al.(1995) ✓ ✓ ✓ ✓ ✓ ✓
Utsumi et al. (2000) ✓ ✓ ✓ ✓ ✓
Verma et al. (2004) ✓ ✓ ✓ ✓ ✓
Bhattacharyya et al. (2004) ✓ ✓ ✓ ✓ ✓
Kreuz et al. (2007) ✓ ✓ ✓ ✓ ✓ ✓
Chaumartin et al. (2007) ✓ ✓ ✓ ✓
Carvalho et al. (2009) ✓ ✓ ✓ ✓
Tsur et al. (2010) ✓ ✓ ✓ ✓
Davidov et al. (2010) ✓ ✓ ✓ ✓ ✓ ✓
González-Ibánez (2011) ✓ ✓ ✓ ✓ ✓
Filatova et al. (2012) ✓ ✓ ✓ ✓ ✓ ✓
Riloff et al. (2013) ✓ ✓ ✓ ✓ ✓
Lunando et al. (2013) ✓ ✓ ✓ ✓ ✓
Liebrecht et al. (2013) ✓ ✓ ✓ ✓ ✓
Lukin et al. (2013) ✓ ✓ ✓ ✓ ✓
Tungthamthiti et al. (2014) ✓ ✓ ✓ ✓ ✓
Peng et al. (2014) ✓ ✓ ✓ ✓ ✓
Raquel et al. (2014) ✓ ✓ ✓ ✓ ✓
Kunneman et al. (2014) ✓ ✓ ✓ ✓ ✓ ✓ ✓
Barbieri et al. (2014) ✓ ✓ ✓ ✓
Tayal et al. (2014) ✓ ✓ ✓ ✓ ✓
Pielage et al. (2014) ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓
Rajadesingan et al. (2015) ✓ ✓ ✓ ✓ ✓ ✓
Bharti et al. (2015) ✓ ✓ ✓ ✓ ✓ ✓ ✓
S.K. Bharti et al. / Digital Communications and Networks 2 (2016) 108–121 111

Table 2
Types, features and domains of sarcasm detection.

Types of Approaches used in sarcasm detection


A1 Machine learning based
A11 Supervised
A12 Semi-supervised
A2 Corpus based

Types of sarcasm occur in text


T1 Contrast between positive sentiment and negative situation
T2 Contrast between negative sentiment and positive situation
T3 Tweet starts with an interjection word
T4 Likes and Dislikes contradiction – behavior based
T5 Tweet contradicting universal facts
T6 Tweet carries positive sentiment with antonym pair
T7 Tweet contradicting time dependent facts

Types of features
F1 Lexical – unigram, bigram, trigram, n-gram, #hashtag
F2 Pragmatic – smilies, emoticons, replies
F3 Hyperbole – Interjection, Intensifier, Punctuation Mark, Quotes
F31 Interjection – yay, oh, wow, yeah, nah, aha, etc.
F32 Intensifier – adverb, adjectives
F33 Punctuation Mark – !!!!!, ????
F34 Quotes – “ ” , ‘ ’

Types of domains
D1 Tweets of Twitter
D2 Online product reviews
D3 Website comments
D4 Google Books
D5 Online discussion forums

tweets in type T1. None of the authors worked on sarcasm types


T5, T6 and T7 until now. In this work, we consider these research
gaps as challenges and propose a set of algorithms to tackle them.
Fig. 3. Parallel HDFS architecture.

3. Preliminaries matching certain criteria, trends or keywords. The tweets retrieved


from Flume are in JavaScript Object Notation (JSON) format which
This section describes the overall framework for capturing and is passed on to the HDFS. Oozie is a module in Hadoop that pro-
analyzing tweets streamed in real time. In addition, the archi- vides the output from one stage as the input to the next. Oozie is
tecture of Hadoop HDFS followed by POS tagging, parsing and used to partition the incoming tweets into blocks of tweets, par-
sentiment analysis of the given phrase or sentence are elaborated. titioned on an hourly basis. These partitions are passed onto the
Hive module, which then parses the incoming JSON tweets into a
3.1. Framework for sarcasm analysis in real time tweets format suitable for consumption by the sarcasm detection engine
(SDE). These parsed tweets are stored again in the HDFS and later
The proposed system uses the Hadoop framework to process retrieved by SDE for further processing and attainment of final
and store the tweets streamed in real time. These tweets are re- sentiment summarization.
trieved from Twitter using the Twitter streaming API (Twitter4j) as
shown in Fig. 2. The Flume module is responsible for commu- 3.2. Parallel HDFS
nicating with the Twitter streaming API and retrieving tweets
To increase the throughput of a system and handle the massive
volume of tweets, the parallel architecture of HDFS that is used is
shown in Fig. 3. The overall file system consists of a metadata file,
master node and multiple slave nodes that are managed by the
master node.
A metadata file contains two subfiles, namely, fsimage and
edits file. The fsimage contains the complete state of the file sys-
tem at a given instance of time and the edits file contains the log of
changes to the file system after the most recent fsimage was made.
The master node contains three entities, namely, name node,
secondary name node and data node. All three entities in the
name node can communicate with each other. The name node is
responsible for the overall functioning of the file system. A sec-
ondary name node is responsible for updating and maintaining of
the name node as well as managing the updates to the metadata.
The Job tracker is a service in Hadoop that interfaces between the
name node and the task trackers and matches the jobs with the
closest available task tracker.
Fig. 2. System model for capturing and analyzing sarcasm sentiment in tweets. The Slave node contains two entities, namely data node and
112 S.K. Bharti et al. / Digital Communications and Networks 2 (2016) 108–121

Fig. 5. Parse tree for a tweet: I love waiting forever for my doctor.

information. With the help of the POS information and syntactic


relation, it forms units like subject, verb, and object, then de-
termines the relations between these units and generates a parse
tree. In this paper, a python based package called TEXTBLOB has
been used for parsing. An example of parsing for text “I love
waiting forever for my doctor” is I/PRP/B-NP/O, love/NN/I-NP/O,
Fig. 4. Sarcasm detection engine. waiting/VBG/B-VP/O, forever/RB/B-ADVP/O, for/IN/B-PP/B-PNP,
my/PRP$/BNP/ I-PNP, doctor/NN/I-NP/I-PNP. With the help of the
parse data, two examples of parse trees are shown in Figs. 5 and 6.
task tracker. Both entities can communicate with each other
within the slave node. The data node is responsible for handling 3.6. Sentiment analysis
the data blocks and providing the services for storage, and re-
trieval of the data as requested by the name node. The task tracker Sentiment analysis is a mechanism to recognize one's opinion,
is responsible for processing the input according to user require- polarity, attitude and orientation of any target like movies, in-
ments and returning the output. dividuals, events, sports, products, organizations, locations, ser-
In the parallel HDFS architecture, the name node commu- vices, etc. To identify sentiment in given phrase, we use pre-de-
nicates with the various data nodes in the slave nodes while si- fined lists of positive and negative words such as Sentiwordnet
multaneously the job tracker in the name node coordinates with [44]. It is a standard list for positive and negative English words.
the task trackers on the slaves in parallel, resulting in a high rate of Using the Sentiwordnet lists along with Eqs. (1)–(3), we find the
output which is fed into the SDE. sentiment score for a given phrase or sentence:
PWP
3.3. Sarcasm detection engine PR =
TWP (1)
To identify the sentiment of a given tweet, it passes through the
MapReduce functions for sentiment classification. The tweet is
NWP
classified into either a negative, positive or neutral, based on the NR =
TWP (2)
detection engine. Fig. 4 depicts an automated SDE which takes
tweets as an input and produces the actual sentiment of the tweet
as an output. Once the tweet is classified as either positive or
Sentiment Score = PR − NR (3)
negative, further checks are required to confirm if it has an actual
positive/negative sentiment or a sarcastic sentiment. where PR is the positive ratio, NR the negative ratio, PWP the
number of positive words in a given phrase, NWP the number of
3.4. Parts-of-speech tagging negative words in a given phrase, and TWP the total words in given
phrase.
Parts-of-speech (POS) tagging divides sentences or paragraphs
into words and assigning corresponding parts-of-speech in-
formation to each word based on their relationship with adjacent 4. Proposed scheme
and related words in a phrase, sentence, or paragraph. In this
paper, a Hidden Markov Model (HMM) based POS tagger [13] is There is an increasing need for automatic techniques to capture
used to identify the correct POS tag information of given words. and process real time tweets and analyze their sarcastic sentiment.
For example: POS tag information for the sentence “Love has no It provides useful information for market analysis and risk man-
finite coverage” is love-NN, has-VBZ, no-DT, finite-JJ, and coverage- agement applications. Therefore, we propose the following
NN. Where NN, JJ, VBZ and DT denote the notations for noun,
adjective, verb and determiner, respectively. The Penn Treebank
tag [43] set notations are used to assign a tag to the particular
word. It is a brown corpus style of tagging having 44 tags.

3.5. Parsing

Parsing is a process of analyzing grammatical structure, iden-


tifying its parts of speech and syntactic relations of words in
sentences. When a sentence is passed through a parser, the parser
divides the sentence into words and identifies the POS tag Fig. 6. Parse tree for a tweet: I hate Australia in cricket because they always win.
S.K. Bharti et al. / Digital Communications and Networks 2 (2016) 108–121 113

approaches to sarcasm detection in tweets: 4.2. HMM-based POS tagging

 Capturing and processing real time tweets using Flume and In this paper, an HMM-based POS tagger is deployed to evalu-
Hive. ate accurate POS tag information for the Twitter dataset as shown
 An HMM-based algorithm for POS tagging. in Algorithms 1 and 2. Algorithm 1 trains the system using
 MapReduce functions for three approaches to detect sarcasm in 500,000 pre-tagged (according to the Penn Tree Bank style)
tweets: American English words from the American National Corpus (ANC)
1. Parsing_based_lexicon_generation_algorithm. [45,46]. Algorithm 2 evaluates the POS tag information of words in
2. Interjection_word_start. the given dataset.
3. Positive_sentiment_with_antonym_pair.
 Other approaches to detect sarcasm in tweets: Algorithm 1. POS_training.
1. Tweet_contradicting_universal_facts.
2. Tweet_contradicting_time_dependent_facts.
3. Likes_dislikes_contradiction.

4.1. Capturing and processing real time streaming tweets using


flume and hive

The Twitter Streaming API returns a constant stream of tweets


in JSON format which is then stored in the HDFS as shown in Fig. 2.
To avoid issues related to security and writing code that requires
complicated integration with secure clusters, we prefer to use the
existing components within Cloudera Hadoop [29]. This allows us
to directly store the data retrieved by the API into the HDFS. We
use Apache Flume to store the data in the HDFS. Flume is a data
ingestion system that is defined by setting up channels in which
data flows between sources and sinks. Each piece of data is an
event and each such event goes through a channel. The Twitter API
does the work of the source here and the sink is a system that
writes out the data to the HDFS. Along with the data capture, the
Flume module allows us to set up custom filters and keyword-
based searches that allow us to further narrow down the tweets to
just the ones relevant to our requirements.
Once the data from the Twitter API is fed into the HDFS, the
data must be pre-processed to convert the tweets stored in JSON
format into usable text for the SDE. We make use of the Oozie
module for handling the work flow, which is scheduled to run at
periodic intervals. We configure Oozie to partition the data in the
HDFS on the basis of hourly retrievals and load the last hour's data
into the hive as shown in Fig. 2. The hive is another module in
Hadoop that allows one to translate and load data with the help of
the Serializer–Deserializer. This allows us to convert the JSON
tweets into a query-able format an we then add these entries back Algorithm 2. POS_testing.
into the HDFS for processing by the SDE.

Fig. 7. Procedure to obtain sentiment and situation phrase from tweets


114 S.K. Bharti et al. / Digital Communications and Networks 2 (2016) 108–121

According to Algorithm 1, HMM uses pre-tagged American output of the mapper class (sentiment phrase file and situation
English words [45,46] as an input and creates three dictionary phrase file) passes to the reducer class as an input. The reducer
objects, namely WT, TT and T. WT stores the number of occurrence class calculates the sentiment score (as explained in Section 3.6) of
of each word with its associated tag in the training corpus. Simi- each phrase in both the sentiment and the situation phrase file.
larly, TT stores the number of occurrence of the bi-gram tags in the Then, it gives output an aggregated positive or negative score for
corpus and T stores the number of occurrence of uni-gram tag. For each phrase in terms of the sentiment and situation of the tweet.
each word in the sentence, it checks if the word is the starting Based on whether the score is positive or negative, the phrases are
word of the sentence or not. If a word is the starting word then it stored in the corresponding phrase file as shown in the reducer
assumes the previous tag to be ‘ $’. Otherwise, the previous tag is class of Fig. 7. PBLGA generates four files, namely positive senti-
the tag of the previous word in the respective sentence. It in- ment, negative sentiment, positive situation and negative situation
creases the occurrence of various tags through the dictionary ob- files as an output. Furthermore, we use these four files to detect
jects WT, TT and T. Finally, it creates a probability table using the sarcasm in tweets with tweet structure contradiction between
dictionary objects WT, TT and T. positive sentiment and negative situation and vice versa as shown
Algorithm 2 finds all the possible tags of a given word (for tag in Algorithm 3.
evaluation) using the pre-tagged corpus [45,46] and applies Eq. (4) Algorithm 3. PBLGA_testing.
[47], if the word is the starting word of a respective sentence
otherwise it applies Eq. (5) [47]. Next, it selects the tag whose
probability value is maximum. For example: once you encounter a
POS tag determiner (DT), such as ‘the’, maybe the probability that
the next word is a noun is 40% and it being a verb is 20%. Once the
model finishes its training, it is used to determine whether ‘can’ in
‘the can’ is a noun (as it should be) or a verb:
argmax[TT ($, t )/T ($)]⁎ [WT (word, t )/T (t )]
t ∈ APT (4)

argmax[TT (P , t )/T (P )]⁎ [WT (word, t )/T (t )]


t ∈ APT (5)

where APT is all possible tags

4.3. MapReduce functions for sarcasm analysis

Here, the Map function comprises three approaches to detect


sarcasm. Each of the approaches is detailed below.

4.3.1. Parsing based lexicon generation algorithm


The MapReduce function, parsing based lexicon generation al-
gorithm (PBLGA), is based on our previous study [19]. It takes
tweets as an input from HDFS and parses them into the form of
phrases such as noun phrase (NP), verb phrase (VP), adjective
phrase (ADJP), etc. These phrases are stored in the phrase file for
further processing. The phrase file is then subsequently passed
onto the rule-based classifier to classify sentiment phrases and
situation phrases as shown in the mapper part of Fig. 7 and stores
it in the sentiment phrase file and situation phrase file. Then, the

Fig. 8. Procedure to detect sarcasm in tweets that starts with interjection word. According to Algorithm 3, it takes testing tweets and four bags
S.K. Bharti et al. / Digital Communications and Networks 2 (2016) 108–121 115

of lexicons generated using PBLGA. If the testing tweet matches all the positive sentiment tweets as sarcastic as shown in the re-
with any positive sentiment from the positive sentiment file, it ducer class of Fig. 9. In this approach, the antonym pairs of nouns,
subsequently checks for any matches with negative situation verbs, adjectives and adverbs are taken from NLTK wordnet [48].
against the negative situation file. If both checks match, the testing The algorithm PSWAP is executed under the Hadoop framework
tweet is sarcastic and similarly, and it checks for sarcasm with a as well as without Hadoop framework to compare the running
negative sentiment in a positive situation. Otherwise, the given time.
tweet is not sarcastic. Both the algorithms are executed under the
Hadoop framework as well as without the Hadoop framework to
4.4. Other approaches for sarcasm detection in tweets
compare the running time.

We propose three other novel approaches to identify sarcasm


4.3.2. Interjection word start
in three different tweet types, i.e., T4, T5 and T7 as shown in Ta-
The MapReduce function for interjection word start (IWS) is also
ble 2. Due to the unavailability of various aspects modeling these
based on [19] as shown in Fig. 8. This approach is applicable for the
algorithms in the Hadoop framework is undone. However, the
tweets that start with an interjection word such as aha, wow, nah,
uh, etc. In this approach, the tweet that is sent to the mapper is first methods were implemented without the Hadoop framework. Each
parsed into its constituent tags using Algorithms 1 and 2. Then, the of the methods is described below.
tags are separated as first tag, second tag and remaining tags of each
tweet. The output of this stage gives us three lists: the list of the first 4.4.1. Tweets contradicting with universal facts
tag, which stores the first tag of the tweet, the list of the second tag, Tweets contradicting with universal facts (TCUF) is based on
which stores the second tag of the tweet and the list of remaining universal facts. In this approach, universal facts are used as a
tags, which stores the remaining tags in the tweet. The lists are then feature to identify sarcasm in tweets as shown in Algorithm 4. For
passed to a rule based pattern as given in the mapper class of Fig. 8 an example ‘the sun rises in the east’ is a universal fact. The corpus
that checks that if the first tag is an interjection, i.e., UH (interjection of universal fact sentences, Algorithm 4 takes as an input and
tag notation) and second tag is either adjective or adverb, the tweet generates a list of 〈key, value〉 pairs for every sentence in the cor-
is classified as sarcastic. Otherwise, it checks that if the first tag is an pus. To generate 〈key, value〉 pair, it finds triplets of (subject, verb,
interjection and the remaining tags are either adverbs followed by and object) values according to the Rusu_Triplets [49] method for
adjectives, adjectives followed by nouns, or adverbs followed by every sentence. Furthermore, it combines the subject and verb
verbs, the tweet is sarcastic else it is not sarcastic. If the pattern does together as key and object as value. The 〈key, value〉 pair for the
not find any match in a given tweet, tweet is not sarcastic. The al- sentence “the sun rises in the east” is 〈(sun , rises ) , east 〉.
gorithm IWS also executes under the Hadoop framework as well as
without the Hadoop framework to compare the running time. Algorithm 4. Tweet_contradict_universal_facts.

4.3.3. Positive sentiment with antonym pair


The MapReduce function for positive sentiment with antonym
pair (PSWAP) is a novel approach as shown in Fig. 9 to determine if
the tweet is sarcastic or not. The tweet that is sent to the mapper is
first parsed into its constituent tags using Algorithms 1 and 2. The
output of this stage gives us a bag of tags which is then passed to a
rule based classifier as given in the mapper class of Fig. 9 which
looks for antonym pairs of certain tags such as noun, verb, ad-
jective and adverb. If any antonym pair is found, it stores them in a
separate file. The reducer class is responsible for generating a
sentiment score using Eqs. (1)–(3) for the tweet contained in the
file of antonym tweets and are sorted according to their sentiment
score into positive and negative sentiment tweets. It then classifies

Identifying sarcasm in tweets using universal facts is shown in


Algorithm 5. It takes the universal facts 〈key, value〉 pair file and
tests the tweets as input and extracts triplet values (subject, ob-
ject, verb) from the test tweets using the Rusu_Triplets [49]
method. Furthermore, we form 〈key, value〉 pairs of the testing
tweet using the subject, verb, and object. If the 〈key, value〉 of the
testing tweet is matched with any key in universal fact 〈key, value〉
pair file, it checks the value of the testing tweet along with the
corresponding value in the universal fact 〈key, value〉 pair file. If
Fig. 9. Procedure to detect sarcasm in positive sentiment tweets with antonym both the 〈key, value〉 pairs are matched, the current testing tweet is
pair. not sarcastic. Otherwise, the tweet is sarcastic.
116 S.K. Bharti et al. / Digital Communications and Networks 2 (2016) 108–121

Table 3
Experimental environment.

Components OS CPU Memory HDD

Primary server Ubuntu_14.04  64 Intel Xeon E5-2620 (6 core, v3 @ 2.4 GHz) 24 GB 1 TB


Secondary server Ubuntu_14.04  64 Intel Xeon E5-2620 (6 core, v3 @ 2.4 GHz) 8 GB 1 TB
Data server 1 Ubuntu_14.04  64 Intel Xeon E5-2620 (6 core, v3 @ 2.4 GHz) 4 GB 20 GB
Data server 2 Ubuntu_14.04  64 Intel Xeon E5-2620 (6 core, v3 @ 2.4 GHz) 4 GB 20 GB
Data server 3 Ubuntu_14.04 x64 Intel Xeon E5-2620 (6 core, v3 @ 2.4 GHz) 4 GB 20 GB

Table 4
Datasets captured for experiment and analysis.

Datasets No. of tweets (approx) Extraction period (h)

Set 1 5,000 1
Set 2 51,000 9
Set 3 100,000 21
Set 4 250,000 50
Set 5 1,050,000 187

Algorithm 5. TCUF_testing_tweets.

Fig. 10. Elapsed time for POS tagging under the Hadoop framework vs without the
Hadoop framework.

world cup again in 2015’ is 〈(Australia, won) , (cricketworldcup, 2015)〉.

Algorithm 6. Tweet_contradict_time_dependent_facts.

4.4.2. Tweets contradicting with time-dependent facts


Tweets contradicting with time-dependent facts (TCTDF) are based
on temporal facts. In this approach, time-dependent facts (ones that
may change over a certain time period) are used as a feature to identify
sarcasm in tweets as shown in Algorithm 6. For instance, ‘@MirzaSania
becomes world number one. Great day for Indian tennis’ is a time-
dependent fact sentence. After some time, someone else will be the
number one tennis player. The newspaper headlines are used as a
corpus for time-dependent facts. Algorithm 6 uses newspaper head-
lines as an input corpus and generates a list of 〈key, value〉 pairs for
every headlines in the corpus. To generate a 〈key, value〉 pair, it finds
the triplet of (subject, verb, and object) values according to the Ru-
su_Triplets [49] method for every sentence. Furthermore, it combines
the subject and verb together as key and combines the object and
time-stamp as value. The time-stamp is the news headline date. The Fig. 11. Processing time to analyze sarcasm in tweets using PBLGA under the Ha-
〈key, value〉 pair for the sentence ‘Wow, Australia won the cricket doop framework vs without the Hadoop framework.
S.K. Bharti et al. / Digital Communications and Networks 2 (2016) 108–121 117

Identifying sarcasm in tweets using time-dependent facts is Algorithm 8. Likes_and_Dislikes_Contradiction.


similar to TCUF as shown in Algorithm 7. The only difference is in
the value of the 〈key, value〉 pair. While matching the 〈key, value〉
pair of the testing tweets with the 〈key, value〉 pair in the file to
identify sarcasm using the TCTDF approach, one needs to match
the object as well as the time-stamp together as the value. If both
match, the current testing tweet is not sarcastic else it is sarcastic.

Algorithm 7. TCTDF_testing_tweets.

4.4.3. Likes dislikes contradiction


Likes dislikes contradiction (LDC) is based on the behavioral
features of the Twitter user. It is given in Algorithm 8. Here, the
algorithm observes a user's behavior using their past tweets. It
analyzes the user's tweet history in the profile and generates a list
behaviors for his likes and dislikes. To generate the likes and dis-
likes list of a particular user, one needs to crawl through all the
past tweets from the user's Twitter account as an input for Algo-
rithm 8. Next, the algorithm calculates the sentiment score of all
the tweets in the corpus using Eqs. (1)–(3). Later it classifies the
tweets as positive sentiment or negative sentiment using the
sentiment score (if the sentiment score is >0.0, the tweet is po-
sitive). Otherwise the tweet is negative. Then both the positive and
negative tweets are stored in separate files. From the positive
sentiment tweet file, one needs to extract triplet value (subject,
object, verb) for every tweet in the file using the Rusu_Triplets [49]
method. If the subject value is a pronoun such as ‘I’ or ‘We’, ‘object’
value of that tweet is appended in the likes list. Otherwise, the
‘subject’ value of that tweet is appended in the likes list. Similarly,
in the negative sentiment tweet file, one needs to extract triplet
value (subject, object, verb) for every tweet in the file using the
Rusu_Triplets [49] method. If the subject value is a pronoun such
as ‘I’ or ‘We’, the ‘object’ value of that tweet is appended to the
dislikes list. Otherwise, the ‘subject’ value of that tweet is ap-
pended in the dislikes list. For example: ‘@Modi is doing good job The method to identify sarcasm in tweets using behavioral
for India’. Given the tweet is positive as the word ‘good’ is present, features (likes, dislikes) is shown in Algorithm 9. The algorithm
the subject of this particular tweet is ‘Modi’. Therefore, "Modi" is considers the testing tweets and the list of likes and dislikes as an
appended to the likes list of that particular user. input parameter for the particular user. While testing sarcasm in
118 S.K. Bharti et al. / Digital Communications and Networks 2 (2016) 108–121

Algorithm 9. LDC_testing_tweets.

Fig. 12. Processing time to analyze sarcasm in tweets using IWS under the Hadoop
framework vs without the Hadoop framework.

Fig. 13. Processing time to analyze sarcasm in tweets using PBLGA under Hadoop
framework vs without Hadoop framework.

Fig. 14. Processing time to analyze sarcasm in tweets using PBLGA, IWS and PSWAP
(combined approach) under the Hadoop framework vs without the Hadoop
framework.

tweets, one needs to calculate the sentiment score of the tweet.


Then, extract the triplet (subject, verb and object) of that tweet. If
the tweet is positive and the subject is not a pronoun check the
subject value in the likes list. If the subject value is found in the
likes list, the tweet is not sarcastic. If it is found in the dislikes list,
the tweet is sarcastic. Similarly, if the subject value is a pronoun
and the tweet is positive the object value checks the likes list. If it
is found the tweet is not sarcastic. If it is found in the dislikes list
the tweet is sarcastic. In a similar fashion, one identifies sarcasm
for negative tweets as well.
S.K. Bharti et al. / Digital Communications and Networks 2 (2016) 108–121 119

Table 5 Hadoop framework. Tweets were in different sets and we ran the
Precision, recall and F-score values for proposed approaches. POS tag algorithm separately for each set. Therefore the graph in
Fig. 10 shows the maximum time (674 s) for 10.5 million tweets.
Approach Precision Recall F − score

PBLGA approach 0.84 0.81 0.82 5.4. Execution time for sarcasm detection algorithm
IWS approach 0.83 0.91 0.87
PSWAP approach 0.92 0.89 0.90
There are three proposed approaches, namely PBLGA, IWS and
Combined (PBLGA, IWS, and PSWAP) approach 0.97 0.98 0.97
LDC (first user's account) 0.92 0.72 0.81 PSWAP, which are deployed under Hadoop framework to analyze
LDC (second user's account) 0.91 0.77 0.84 the estimated time for sarcasm detection in tweets. We pass tag-
LDC (third user's account) 0.92 0.73 0.82 ged tweets as an input to all three approaches. Therefore, the
TCUF approach 0.96 0.57 0.72
tagging time is not considered in the proposed approaches for
TCTDF approach 0.93 0.62 0.74
sarcasm analysis. Then, we compared the elapsed time under the
Hadoop framework vs without the Hadoop framework for all three
approaches as shown in Figs. 11–13. PBLGA approach takes approx.
5. Results and discussion 3386 s to analyze sarcasm in 1.4 million tweets without the Ha-
doop framework and takes approx. 1,400 s to analyze sarcasm in
This section describes the experimental results of the proposed
1.4 million tweets under the Hadoop framework. The IWS ap-
scheme. We started with an experimental setup where a five node
proach takes approx. 25 s to analyze sarcasm in 1.4 million tweets
cluster is deployed under the Hadoop framework. Five datasets are
without the Hadoop framework and takes approx. 9 s to analyze
crawled using Apache Flume and the Twitter streaming API. We
sarcasm in 1.4 million tweets under the Hadoop framework. The
also discuss the time consumption of the proposed approach un-
PSWAP approach takes approx. 7,786 s to analyze sarcasm in
der the Hadoop framework as well as without the Hadoop fra-
1.4 million tweets without the Hadoop framework and takes ap-
mework and made a comparison. We also discuss all the ap-
prox. 2,663 s to analyze sarcasm in 1.4 million tweets under the
proaches with precision, recall and F-score measure.
Hadoop framework. Finally, we combined all three approaches and
ran with 1.4 million tweets. Then, we compared the elapsed time
5.1. Experimental environment under the Hadoop framework vs without the Hadoop framework
for all three combined approaches as shown in Fig. 14 and it takes
Our experimental setup consists of a five node cluster with the approx. 11,609 s to analyze sarcasm in 1.4 million tweets without
specifications as shown in Table 3. The master node consists of an the Hadoop framework (indicated with the solid line) and takes
Intel Xeon E5-2620 (6 core, v3 @ 2.4 GHz) processor with 6 cores approx. 4,147 s to analyze sarcasm in 1.4 million tweets under the
running the Ubuntu 14.04 operating system with 24 GB of main Hadoop framework (indicated with the dotted line).
memory. The remaining four nodes were virtual machines. All the
VMs ran on a single machine. The secondary name node server is
5.5. Statistical evaluation metrics
another Ubuntu 14.04 machine running on an Intel Xeon E5-2620
with 8 GB of main memory. The remaining three slave nodes re-
There are three statistical parameters, namely precision, recall
sponsible for processing the data consist of three Ubuntu 14.04
and F-score, which are used to evaluate our proposed approaches.
machines running Intel Xeon E5-2620 with 4 GB of main memory.
Precision shows how much relevant information is identified cor-
rectly and recall shows how much extracted information is re-
5.2. Datasets collection for experiment and analysis levant. F-score is the harmonic mean of precision and recall. Eqs. 6,
7, and 8 shows the formula to calculate precision, recall and F-score,
The datasets for the experimental analysis are shown in Ta-
respectively:
ble 4. There are five sets of tweets crawled from the Twitter using
the Twitter Streaming API and processed through Flume before Tp
Precision =
being stored in the HDFS. In total, 1.45 million tweets were col- Tp + Fp (6)
lected using keywords #sarcasm, #sarcastic, sarcasm, sarcastic,
happy, enjoy, sad, good, bad, love, joyful, hate, etc. After pre-
processing, approximately 156,000 tweets were found as sarcastic Tp
(tweets ending with #sarcasm or #sarcastic). The remaining Recall =
Tp + Fn (7)
tweets approximately 1.294 million were not sarcastic. Every set
contained a different number of tweets. Depending on the number
of tweets in each set, the crawling time (in hours) is given in
2⁎Precision⁎Recall
Table 4. F − Score =
Precision + Recall (8)

5.3. Execution time for POS tagging where Tp is true positive, Fp is false positive, and Fn is false negative.
Experimental datasets consist of a mixture of sarcastic and
In this paper, POS tagging is an essential phase for all the non-sarcastic tweets. In this paper, we assume the tweets with the
proposed approaches. Therefore, we used Algorithms 1 and 2 to hashtag sarcasm or sarcastic (#sarcasm or #sarcastic) as sarcastic
find POS information for all the datasets (approximately tweets. The datasets consist of a total of 1.4 million tweets. Among
1.45 million tweets). We deployed algorithms on both Hadoop as these tweets, 156,000 were sarcastic and the rest was non-sar-
well as without the Hadoop framework and estimated the elapsed castic. Experimental results in terms of precision , recall and
time as shown in Fig. 10. The solid line shows time taken (approx. F − score was the same under both the Hadoop and the non-
674 s) for POS tagging (approx. 10.5 million tweets) without the Hadoop framework. The only difference was algorithm processing
Hadoop framework, while the dotted line shows time (approx. time due to the parallel architecture of HDFS. Experimental results
225 s) for POS tagging (approx. 10.5 million tweets) under the are shown in Table 5.
120 S.K. Bharti et al. / Digital Communications and Networks 2 (2016) 108–121

5.6. Discussion on experimental results References

Among the six proposed approaches, PBLGA and IWS were [1] D. Chaffey, Global Social Media Research Summary 2016. URL 〈http://www.
earlier implemented and discussed in [19] with a small set of test smartinsights.com/social-media-marketing/social-media-strategy/new-glo
bal-social-media-research/〉.
data (approx. 3,000 tweets for each experiment) and deployed in a [2] W. Tan, M.B. Blake, I. Saleh, S. Dustdar, Social-network-sourced big data ana-
non-Hadoop framework. In this work, we deployed PSWAP (novel lytics, Internet Comput. 17 (5) (2013) 62–69.
approach) along with PBLGA and IWS in both a Hadoop and non- [3] Z.N. Gastelum, K.M. Whattam, State-of-the-Art of Social Media Analytics Re-
search, Pacific Northwest National Laboratory, 2013, pp. 1-9.
Hadoop framework to check the efficiency in terms of time. PBLGA [4] P. Zikopoulos, C. Eaton, Understanding Big Data: Analytics for Enterprise Class
generates four lexicon files, namely positive sentiment, negative Hadoop and Streaming Data, McGraw-Hill Osborne Media, 2011.
situation, positive situation, and negative sentiment, using [5] E. Riloff, A. Qadir, P. Surve, L. De Silva, N. Gilbert, R. Huang, Sarcasm as contrast
between a positive sentiment and negative situation, in: Proceedings of the
156,000 sarcastic tweets. The PBLGA algorithm used 1.45 million Conference on Empirical Methods in Natural Language Processing, 2013, pp.
tweets as test data. While testing, PBLGA checks each tweet's 704–714.
structure for the contradiction between positive sentiment and [6] Hadoop. URL 〈http://hadoop.apache.org/〉.
[7] S. Fitzgerald, I. Foster, C. Kesselman, G. Von Laszewski, W. Smith, S. Tuecke, A
negative situation and vice versa to classify them as sarcastic or directory service for configuring high-performance distributed computations,
non-sarcastic. For 1.45 million tweets, PBLGA takes approx. 3386 s in: Proceedings on High Performance Distributed Computing, IEEE, 1997, pp.
in the non-Hadoop framework and it takes approx. 1,400 s in the 365–375.
[8] J. Dean, S. Ghemawat, Mapreduce: simplified data processing on large clusters,
Hadoop framework. PBLGA consumes most of the time to access Commun. ACM 51 (1) (2008) 107–113.
the four lexicon files for every tweet to meet the condition of [9] S. Hoffman, Apache Flume: Distributed Log Collection for Hadoop, Packt
tweet structure. IWS does not require any training set to identify Publishing Ltd, 2013.
[10] Flume. URL 〈http://flume.apache.org/〉.
tweets as sarcastic. Therefore, it takes the minimal processing time [11] K. Shvachko, H. Kuang, S. Radia, R. Chansler, The Hadoop distributed file sys-
in both frameworks (25 s for the without Hadoop and 9 s for the tem, in: Proceedings of 26th Symposium on Mass Storage Systems and
Hadoop framework). PSWAP requires a list of antonym pairs for Technologies (MSST), IEEE, 2010, pp. 1–10.
[12] A. Thusoo, J.S. Sarma, N. Jain, Z. Shao, P. Chakka, S. Anthony, H. Liu, P. Wyckoff,
noun, adjective, adverb, and verb to identify sarcasm in tweets. R. Murthy, Hive: a warehousing solution over a map-reduce framework, Proc.
Therefore, it takes approx. 7,786 s for 1.45 million tweets in the VLDB Endow. 2(2) (2009) 1626–1629.
non-hadoop framework and approx. 2,663 s for 1.45 million [13] S.M. Thede, M.P. Harper, A second-order hidden Markov model for part-of-
speech tagging, in: Proceedings of the 37th Annual Meeting on Computational
tweets in the Hadoop framework. PSWAP consumes most of the Linguistics, ACL, 1999, pp. 175–182.
time in searching antonym pairs for all four tags (noun, adjective, [14] D. Klein, C.D. Manning, Accurate unlexicalized parsing, in: Proceedings of the
adverb, and verb) for every tweet. Finally, we combined all three 41st Annual Meeting on Association for Computational Linguistics, ACL, 2003,
pp. 423–430.
approaches together and tested. In the combined approach, the F- [15] K. Park, K. Hwang, A bio-text mining system based on natural language pro-
score value attained is 97%, but execution time is more as it checks cessing, J. KISS: Comput. Pract. 17 (4) (2011) 205–213.
all three approaches sequentially for every tweet until each one is [16] Q. Mei, C. Zhai, Discovering evolutionary theme patterns from text: an ex-
ploration of temporal text mining, in: Proceedings of the Eleventh ACM
satisfied to detect sarcasm. SIGKDD International Conference on Knowledge Discovery in Data Mining,
Three more novel algorithms were proposed, namely TCUF, ACM, 2005, pp. 198–207.
TCTDF and LDC. These three algorithms are implemented using [17] B. Liu, Sentiment analysis and opinion mining, Synth. Lect. Hum. Lang. Tech-
nol. 5 (1) (2012) 1–167.
conventional methods with small datasets. Presently, there are no [18] R. González-Ibánez, S. Muresan, N. Wacholder, Identifying sarcasm in twitter:
sufficient datasets available with us to deploy these algorithms a closer look, in: Proceedings of the 49th Annual Meeting on Human Language
under the Hadoop framework. TCUF requires a corpus of universal Technologies, ACL, 2011, pp. 581–586.
[19] S.K. Bharti, K.S. Babu, S.K. Jena, Parsing-based sarcasm sentiment recognition
facts. The accuracy of this approach is dependent on the universal in twitter data, in: Proceedings of the 2015 IEEE/ACM International Conference
facts set. We crawled approximately 5,000 universal facts from on Advances in Social Networks Analysis and Mining (ASONAM), ACM, 2015,
Google and Wikipedia for experimentation. TCTDF requires a cor- pp. 1373–1380.
[20] E. Lunando, A. Purwarianti, Indonesian social media sentiment analysis with
pus of time-dependent facts. Accuracy of this approach is depen- sarcasm detection, in: International Conference on Advanced Computer Sci-
dent on the time-dependent facts. Presently, we trained TCTDF ence and Information Systems (ICACSIS), IEEE, 2013, pp. 195–198.
with 10,000 news article headlines as time-dependent facts. LDC [21] P. Tungthamthiti, S. Kiyoaki, M. Mohd, Recognition of sarcasm in tweets based
on concept level sentiment analysis and supervised learning approaches, in:
requires Twitter users’ profile information and their past tweet 28th Pacific Asia Conference on Language, Information and Computation,
history. In this work, we tested LDC using ten Twitter users profile 2014, pp. 404–413.
and their past tweet history. [22] I. Ha, B. Back, B. Ahn, Mapreduce functions to analyze sentiment information
from social big data, Int. J. Distrib. Sens. Netw. 2015 (1) (2015) 1–11.
[23] Twitter streaming api. URL 〈http://apiwiki.twitter.com/〉, 2010.
[24] J. Kalucki, Twitter streaming api. URL 〈http://apiwiki.twitter.com/Streaming-
API-Documentation/〉, 2010.
6. Conclusion and future work [25] A. Bifet, E. Frank, Sentiment knowledge discovery in twitter streaming data, in:
13th International Conference on Discovery Science, Springer, 2010, pp. 1–15.
Sarcasm detection and analysis in social media provides in- [26] Z. Tufekci, Big questions for social media big data: representativeness, validity
and other methodological pitfalls, arXiv preprint arXiv:1403.7400.
valuable insight into the current public opinion on trends and [27] A.P. Shirahatti, N. Patil, D. Kubasad, A. Mujawar, Sentiment Analysis on Twitter
events in real time. In this paper six algorithms, namely PBLGA, Data Using Hadoop.
IWS, PSWAP, TCUF, TCTDF, and LDC, were proposed to detect sar- [28] R.C. Taylor, An overview of the Hadoop/mapreduce/hbase framework and its
current applications in bioinformatics, BMC Bioinform. 11 (Suppl 12) (2010) 1–6.
casm in tweets collected from Twitter. Three algorithms were run [29] M. Kornacker, J. Erickson, Cloudera Impala: Real Time Queries in Apache Ha-
with and without the Hadoop framework. The running time of doop, for Real. URL 〈http://blog〉. cloudera. com/blog/2012/10/cloudera-im-
each algorithm was shown. The processing time under the Hadoop pala-real-time-queries-in-apache-hadoop-for-real.
[30] D. Davidov, O. Tsur, A. Rappoport, Semi-supervised recognition of sarcastic
framework with data nodes reduced up to 66% on 1.45 million sentences in twitter and amazon, in: Proceedings of the Fourteenth Con-
tweets. ference on Computational Natural Language Learning, ACL, 2010, pp. 107–116.
In the future, sufficient datasets suitable for the other three [31] E. Filatova, Irony and sarcasm: Corpus generation and analysis using crowd-
sourcing, in: Proceedings of Language Resources and Evaluation Conference,
algorithms namely LDC, TCUF and TCTDF need to be attained and 2012, pp. 392–398.
deployed under the Hadoop framework. [32] R.J. Kreuz, R.M. Roberts, Two cues for verbal irony: hyperbole and the ironic
S.K. Bharti et al. / Digital Communications and Networks 2 (2016) 108–121 121

tone of voice, Metaphor Symb. 10 (1) (1995) 21–31. [48] J. Perkins, Python Text Processing with NLTK 2.0 Cookbook, Packt Publishing
[33] R.J. Kreuz, G.M. Caucci, Lexical influences on the perception of sarcasm, in: Ltd, 2010.
Proceedings of the Workshop on Computational Approaches to Figurative [49] D. Rusu, L. Dali, B. Fortuna, M. Grobelnik, D. Mladenic, Triplet extraction from
Language, ACL, 2007, pp. 1–4. sentences, in: Proceedings of the 10th International Multiconference on In-
[34] O. Tsur, D. Davidov, A. Rappoport, Icwsm—a great catchy name: Semi-su- formation Society—IS, 2007, pp. 8–12.
pervised recognition of sarcastic sentences in online product reviews, in:
Proceedings of International Conference on Weblogs and Social Media, 2010,
pp. 162–169.
[35] J.W. Pennebaker, M.E. Francis, R.J. Booth, Linguistic Inquiry and Word Count:
Liwc 2001, vol. 71, no. 1, Lawrence Erlbaum Associates, Mahway, 2001, pp. 1–11.
Santosh Kumar Bharti is currently pursuing his Ph.D. in Computer Science & En-
[36] C. Strapparava, A. Valitutti, et al., Wordnet affect: an affective extension of
gineering from National Institute of Technology Rourkela, India. His research in-
wordnet, in: Proceedings of Language Resources and Evaluation Conference,
terest includes opinion mining and sarcasm sentiment detection.
vol. 4, 2004, pp. 1083–1086.
[37] F. Barbieri, H. Saggion, F. Ronzano, Modelling sarcasm in twitter a novel ap-
proach, in: Proceedings of the 5th Workshop on Computational Approaches to
Subjectivity, Sentiment and Social Media Analysis, 2014, pp. 50–58.
[38] P. Carvalho, L. Sarmento, M.J. Silva, E. De Oliveira, Clues for detecting irony in
user-generated contents: oh...!! it's so easy;-), in: Proceedings of the 1st In- Bakhtyar Vachha is currently pursuing his M.Tech in Computer Science & En-
ternational CIKM Workshop on Topic-Sentiment Analysis for Mass Opinion, gineering from National Institute of Technology Rourkela, India. His research in-
ACM, 2009, pp. 53–56. terest includes network security and big data.
[39] D. Tayal, S. Yadav, K. Gupta, B. Rajput, K. Kumari, Polarity detection of sarcastic
political tweets, in: Proceedings of International Conference on Computing for
Sustainable Global Development (INDIACom), IEEE, 2014, pp. 625–628.
[40] A. Rajadesingan, R. Zafarani, H. Liu, Sarcasm detection on twitter: a behavioral
modeling approach, in: Proceedings of the Eighth ACM International Con-
ference on Web Search and Data Mining, ACM, 2015, pp. 97–106. Ramkrushna Pradhan is currently pursuing his M.Tech duel degree in Computer
[41] A. Utsumi, Verbal irony as implicit display of ironic environment: distin- Science & Engineering from National Institute of Technology Rourkela, India. His
guishing ironic utterances from nonirony, J. Pragmat. 32 (12) (2000) research interest includes speech translation, social media analysis and big data.
1777–1806.
[42] C. Liebrecht, F. Kunneman, A. van den Bosch, The perfect solution for detecting
sarcasm in tweets# not, in: Proceedings of the 4th Workshop on Computa-
tional Approaches to Subjectivity, Sentiment and Social Media Analysis, ACL,
New Brunswick, NJ, 2013, pp. 29–37. Korra Sathya Babu is working as an Assistant Professor in the Department of
[43] M.P. Marcus, M.A. Marcinkiewicz, B. Santorini, Building a large annotated Computer Science & Engineering, National Institute of Technology Rourkela, India.
corpus of English: the Penn treebank, Comput. Linguist. 19 (2) (1993) 313–330.
[44] A. Esuli, F. Sebastiani, Sentiwordnet: A publicly available lexical resource for
opinion mining, in: Proceedings of Language Resources and Evaluation Con-
ference, 2006, pp. 417–422.
[45] N. Ide, K. Suderman, The american national corpus first release, in: Proceed-
ings of Language Resources and Evaluation Conference, Citeseer, 2004. Sanjay Kumar Jena is working as Professor in the Department of Computer Science
[46] N. Ide, C. Macleod, The american national corpus: a standardized resource of & Engineering, National Institute of Technology Rourkela, India.
American English, in: Proceedings of Corpus Linguistics, 2001.
[47] E. Charniak, Statistical techniques for natural language parsing, AI Mag. 18 (4)
(1997) 33–43.

You might also like