Sample Project Report
Sample Project Report
Submitted by
HARIHARAN S (211516205033)
of
BACHELOR OF TECHNOLOGY
IN
INFORMATION TECHNOLOGY
BONAFIDE CERTIFICATE
SIGNATURE SIGNATURE
Certified that the candidates were examined in the university project viva-voce
held on at Panimalar Institute of Technology, Chennai
600 123.
Sarcasm is a sophisticated form of irony widely used in social networks and micro-blogging
websites. It is sometimes wont to convey implicit data inside the message an individual
transmits. Sarcasm can be used for various functions like criticism or mockery. However,
it's onerous even for humans to acknowledge. Therefore, recognizing sardonic statements is
terribly helpful to enhance automatic sentiment analysis of information collected from
micro blogging websites or social networks. Sentiment analysis refers to the Identification
and aggregation of attitudes and opinions expressed by Internet users towards a Specific
topic. In this paper, we tend to propose a pattern-based approach to observe humor on
Twitter. We propose four sets of options that cowl the various forms of humor we tend to
defined. We use those to classify tweets as sardonic and non-sarcastic. Our projected
approach reaches associate accuracy of eighty-three .1% with a precision equal to 91.1%.
We conjointly study the importance of every of the projected sets of options and measure its
side price to the classification. In particular we tend to emphasize the importance of pattern-
based options for the detection of sardonic statements.
I
TABLE OF CONTENTS
CHAPTER TITLE PAGE
NO. NO.
ABSTRACT I
LIST OF ABBREVATIONS IV
1. INTRODUCTION 1
2. SYSTEM DESCRIPTION 3
2.1 Existing system 3
2.2 Proposed system 4
3. LITERATURE SURVEY 5
4. MACHINE LEARNING 10
4.1Machine Learning Description 10
4.2 Types of Machine Learning 11
4.3 Steps in Machine Learning 13
5. DATA COLLECTION 21
5.1 Feature Selection 22
6. DATA PRE-PROCESSING 23
7. TRAINING THE SYSTEM 27
7.1 Model Selection 28
7.2 Classification 32
II
9.
SYSTEM DESIGN 41
9.1 Architecture Diagram 41
IV
1.INTRODUCTION
Sarcasm is sometimes used as merely a synonym of irony, but the word has a more
specific sense: irony that's meant to mock or convey contempt. This meaning is
found in its etymology. In Greek, sarkazein meant "to tear flesh; to wound." When
you use sarcasm, you really tear into them. A clever person coined the variant
spelling sarchasm (a blend of sarcasm and chasm) and defined it as "the gap
between the author of sarcastic wit and the person who doesn't get it."
A field that takes ideas from machine learning and applies them to text data. Your
email spam filter is an application of NLP; there is a learning algorithm that learns
how to differentiate a spam email from a regular email by looking at the text
content of the email. It had just came out in the news that the U.S. secret service
was looking for a sarcasm detector to improve their intelligence coming from
Twitter and I was curious to It wasn't clear to me that this was possible because
sarcasm is a complicated concept. Let's go back to the spam filter example for a
minute. If you look at a spam filter algorithm, the features that will be most
relevant to the classification of emails will be certain keywords: Not spam, Free
access or Enlarge your ... for instance. A good learning algorithm will learn the
vocabulary associated with spam emails, so when presented with an email which
contains words in that vocabulary the classifier will classify that email as spam.
My initial intuition was that sarcasm detection is more complicated than spam
detection, because I didn't think there was a vocabulary associated with sarcastic
sentences. I thought sarcasm is hidden in the tone and the ambivalence of the
sentence. Merriam-Webster defines sarcasm as the use of words that mean the
opposite of what you really want to say especially in order to insult someone, to
show irritation, or to be funny. So to detect sarcasm properly a computer would
have to figure out that you meant the opposite of what you just said. It is
sometimes hard for Humans to detect sarcasm, and Humans have a much better
grasp at the English language than computers do, so this was not going to be an
easy task.
2
2. SYSTEM DESCRIPTION
sarcastic detection can not recognize the polarity of the reviews correctly. This
result shows a significance of sarcasm extraction even if the number of sarcastic
sentences in reviews is small. In the experiment, we compared our method with a
baseline based on a simple rule. As a result, Our method outperformed the baseline
However, some approaches to extract sarcastic sentences have, such as Riloff’s
method. Comparison with state-of-the-art methods is important future work to
evaluate our method. In Addition, the accuracy of our method was insufficient,
especially the precision rate. The result is due to the lack of analysis. Although we
analyzed sarcastic sentences in our data, the data contains only 70 sarcastic
3
sentences. Collecting new sarcastic sentences and analyzing the sentences
manually are important.
4
3.LITERATURE SURVEY
Description :
In this paper, we want to review one of the challenging problems for the opinion
mining task, which is sarcasm detection. To be able to do that, many researchers
tried to explore such properties in sarcasm like theories of sarcasm, syntactical
properties, psycholinguistic of sarcasm, lexical feature, semantic properties, etc.
Studies done in the last 15 years not only made progress in semantic features, but
also show increasing amount of method of analysis using a machine-learning
approach to process data. Because of this reason, this paper will try to explain
current mostly used method to detect sarcasm. Lastly, we will present a result of
our finding, which might help other researchers to gain a better result in the future.
Description :
5
directly to the improvement of the accuracy of sentiment analysis tasks. In this
study, we propose a extraction method of sarcastic sentences in product reviews.
First, we analyze sarcastic sentences in product reviews and classify the sentences
into 8 classes by focusing on evaluation expressions. Next, we generate
classification rules for each class and use them to extract sarcastic sentences. Our
method consists of three stage; judgment processes based on rules for 8 classes,
boosting rules and rejection rules. In the experiment, we compare our method with
a baseline based on a simple rule. The experimental result shows the effectiveness
of our method.
Description :
The presence of sarcasm in text can hamper the performance of sentiment analysis.
The challenge is to detect the existence of sarcasm in texts. This challenge is
compounded when bilingual texts are considered, for example using Malay social
media data. In this paper a feature extraction process is proposed to detect sarcasm
using bilingual texts; more specifically public comments on economic related posts
on Face book. Four categories of feature that can be extracted using natural
language processing are considered; lexical, pragmatic, prosodic and syntactic. We
also investigated the use of idiosyncratic feature to capture the peculiar and odd
comments found in a text. To determine the effectiveness of the proposed process,
a non-linear Support Vector Machine was used to classify texts, in terms of the
6
identified features, according to whether they included sarcastic content or not. The
results obtained demonstrate that a combination of syntactic, pragmatic and
prosodic features produced the best performance with an F-measure score of 0.852.
Description :
Opinion mining and sentiment analysis refer to the identification and the
aggregation of attitudes or opinions expressed by internet users towards a specific
topic. However, due to the limitation in terms of characters (i.e. 140 characters per
tweet) and the use of informal language, the state-of-the-art approaches of
sentiment analysis present lower performances in Twitter than that when they are
applied on longer texts. Moreover, presence of sarcasm makes the task even more
challenging .Sarcasm is when a person conveys implicit information, usually the
opposite of what is said, within the message he transmits. In this paper we propose
a method that makes use of a minimal set of features, yet, efficiently classifies
tweets regardless of their topic. We also study the importance of detecting sarcastic
tweets automatically, and demonstrate how the accuracy of sentiment analysis can
be enhanced knowing which tweets are sarcastic and which are not.
7
Dave, A. D., & Desai, N. P.
Description:
During the last decade majority of research has been carried out in the area of
sentiment Analysis of textual data available on the web. Sentiment Analysis has its
challenges, and one of them is Sarcasm. Classification of sarcastic sentences is a
difficult task due to representation variations in the textual form sentences. This
can affect many Natural Language Processing based applications. Sarcasm is the
kind of representation to convey the different sentiment than presented. In our
study we have tried to identify different supervised classification techniques
mainly used for sarcasm detection and their features. Also we have analyzed
results of the classification techniques, on textual data available in various
languages on review related sites, socialmedia sites and micro-blogging sites.
Furthermore, for each method studied, our paper presents the analysis of data set
generation and feature selection process used thereof. We also carried out
preliminary experiment to detect sarcastic sentences in “Hindi” language. We
trained SVM classifier with 10X validation with simple Bag-Of-Words as features
and TF-IDF as frequency measure of the feature. We found that this simple model
based on “bag-of-words” feature accurately classified 50% of sarcastic sentences.
Thus, primary experiment has revealed the fact that simple Bag-of-Words are not
sufficient for sarcasm detection.
8
Description:
9
4. MACHINE LEARNING
We know humans learn from their past experiences and machines follow
instructions given by humans but what if humans can turn the machines to learn
from the past data. But it’s a lot more than just learning it’s also about
understanding and reasoning. In machine learning, computers learn patterns from a
set of data. Once it learns those patterns, it can apply the lessons to new, unseen
data. Machine learning is an exciting field, and its usage is exploding and will
continue to reshape modern business and technology for the next decade. So, it is
good time for all of us to learn a machine learning for the new exciting field, so
machine learning is disrupting the entire industries from agriculture to health care
to finance multiple ways the business, the enterprises are getting disrupted.
Nor so, Machine learning is improving global business practices across marketing,
human resources, e commerce we are seeing every day and also Machine learning
powering emerging technologies like a self-driving cars of the future and we are
parking about virtual reality and augmented reality and Machine learning is
playing an important role and also as we observe in a enter prenatal areal every
week, actually machine learning start up servicing huge investments, so beneficial
action is happening.
It is the practice of teaching computers how to learn patterns from data, often for
making decisions or predictions. Practical machine learning focuses on intuition
10
and simplicity, with a strong emphasis on results whereas academic machine
learning focuses on Math and Theory, with a strong emphasis on writing
algorithms from scratch.
It is very to write programs that solve problems like recognizing a face. We can
easily extend this definition easily to our Artificial Intelligence systems. Machine
learning is learning in neural networks will be very different to learning in rule-
based systems.
It is the art and science of giving computers the ability to learn to make decision
from data without being explicitly programmed. The value of machine learning is
only just beginning to show itself. There is a lot of data in the world today
generated not only by people, but also by computers, phones and other devices. We
see machine learning all around us in the products we use today. However, it isn’t
always apparent that machine learning is behind it all. Today, machine learning’s
immediate applications are already quite wide-ranging, including image
recognition, fraud detection and recommendation systems, as well as text and
speech systems too.
These powerful capabilities can be applied to a wide range of fields, from diabetic
retinopathy and skin cancer detection to retail and of course, transportation in the
form of self-parking and self-driving vehicles. It wasn’t that long ago that when a
company or product had machine learning in its offerings, it was considered novel.
So, every machine learning algorithm in a very broad way we can classify into
three different category one is the supervised learning wonder there is the
unsupervised learning and it is a reinforcement learning. So this supervised
learning again we can classify into regression and the classification, So we see one
11
by one what are the characteristic of individual category and we’ll see about what
is supervised learning what is unsupervised learning what are the different
parameter associated with it what kind of problem individual category is trying to
solve in tutor machine learning system reinforcement learning we’ll see into the
subsequent lecture. So, let us begin with the supervised learning.
So this is the supervised learning just concentrate on this very simple functional
mapping from Y to X so what we do into supervised learning this is the our input
data what X is our input data, so we supply huge amount of data into X we have a
label associated with all individual records what we called as a Y and what is the
problem in the supervised learning is our task is to predict this function f, so you
have been given this input data and a label in short the output data and your task is
to find this predicted function f. Now let us try to understand with the help of very
simple I example this is a very simple example of recreation that you fit the data
into regression algorithm and eventually you will get this model.
This model is nothing but this predicted function f. So, let us try to understand with
a very simple example of this predicting the house price state how this house price
calculation problem can be solved with the help of regression. Let’s try to
concentrate on this is the house price is data so just see there are the two columns
and each row is associated with a single record, so in the first column it is been
written that what is the size of individual house and for this size what are the price
associated with that house so there is a huge amount of data has been given for
example I have seen some six data point here it is been read as a size of two
thousand one hundred and four square foot of house. Let’s try to see what is the
classification those this is another kind of supervised learning algorithm in a
machine learning system, so see in a regression what we have seen that all outputs
are the continuous output, so this classification how it is different from the
12
recreation because overall it is another kind of supervised learning so it holds all
properties of the supervised learning algorithm or the only difference in
classification with the regression algorithms are this output values at the discrete
value it has no continuous value.
So, there is a fixed amount of output value either ten or five or seven or nine or
even thousand class is also possible. So let’s try to understand with the help of very
simple example just concentrate on this data this data uploaded on to two-
dimensional graph, so in earlier case of whatever the output we have there was a
continuous but here what this classification algorithm will try to find the boundary
between this kind of data. So this class of data will lie into the one class and this
class of data lies into the second class, so whenever we have been given suppose
some new example so it will try to classify this and to this kind of group so there
are only two output value associated in this class there are no continuous value, so
that is the only difference between the regression and to the class even so that’s it
for the supervised learning.
But in order to train the model, we need to collect data to train on. We’ll call these
our features from now on colour and alcohol. Get some equipment to do our
measurements a spectrometer for measuring the colour and a hydrometer to
measure the alcohol content. Once this equipment and then booze we got it all set
up it’s time for our first real step of machine learning gathering that data. In this
case, the data we collect will be the colour and alcohol content of each drink. This
will yield us a table of colour, alcohol content, and whether it’s beer or wine. This
will be our training data.
13
So, a few hours of measurements later, we’ve gathered our training data and had a
few drinks, perhaps. And now it’s time for our next step of machine learning is
data preparation. This is also a good time to do any pertinent visualizations of your
data, helping you see if there are any relevant relationships between different
variables as well as show you if there are any data imbalances. For instances, if we
collected way more data points about beer than wine, the model we train will be
heavily biased towards guessing that virtually everything. If a beer since it would
be right most of the time.
We don’t have to use the same data that the model was trained on for evaluation
since then it would just be able to memorize the questions, just as you wouldn’t
want to use the questions from your math homework on the math exam.
Sometimes the data we collected needs other forms of adjusting and manipulation
things like duplication, normalization, error correction, and others. It would all
happen at the data preparation step. In this case, we don’t have any further data
preparation needs, so let’s move on forward.
The next step in this workflow is choosing a model. There are many models that
researchers and data scientists have created over the years. Some are very well
suited for image data, others for sequences, such as text or music, some for
numerical data, and others for text-based data. In this case, we have just two
features colour and alcohol percentage. Now we move on to what is often
considered the bulk of machine learning the training.
We’ll use this data to incrementally improve our model’s ability to predict
whether a given drink is wine or beer. So, let’s look at what that means more
concretely for our data set. When we first start the training, it’s like it drew a
14
random line through the data. Then each step of the training progresses, the line
moves step by step closer to the ideal separation of the wine and beer.
Once training is complete, it’s time to see if the model is any good. Using
evaluation, this is where that data set that we set aside earlier comes into play.
Evaluation allows us to test our model against data that has never been used for
training. This metric allows us to see how the model might perform against data
that it has not yet seen. This is meant to be representative of how the model might
perform in the real world.
Python
Python is an interpreter, object-oriented, high-level programming language with
dynamic semantics. Its high-level built in data structures, combined with dynamic
typing and dynamic binding make it very attractive for Rapid Application
Development, as well as for use as a scripting or glue language to connect existing
components together. Python's simple, easy to learn syntax emphasizes readability
and therefore reduces the cost of program maintenance. Python supports modules
and packages, which encourages program modularity and code reuse. The Python
interpreter and the extensive standard library are available in source or binary form
without charge for all major platforms, and can be freely distributed Often,
programmers fall in love with Python because of the increased productivity it
provides. Since there is no compilation step, the edit-test-debug cycle is incredibly
fast. Debugging Python programs is easy: a bug or bad input will never cause a
segmentation fault. Instead, when the interpreter discovers an error, it raises an
exception. When the program doesn't catch the exception, the interpreter prints a
stack trace. A source level debugger allows inspection of local and global
variables, evaluation of arbitrary expressions, setting breakpoints, stepping through
the code a line at a time, and so on. The debugger is written in Python itself,
15
testifying to Python's introspective power. On the other hand, often the quickest
way to debug a program is to add a few print statements to the source: the fast edit-
test-debug cycle makes this simple approach very effective
Python: Dynamic programming language which supports several different
programming paradigms:
Procedural programming
Object oriented programming
Functional programming
Standard: Python byte code is executed in the Python interpreter (similar to Java)
platform independent code
Extremely versatile language
Website development, data analysis, server maintenance, numerical analysis,
Syntax is clear, easy to read and learn (almost pseudo code)
Common language
Intuitive object oriented programming
Full modularity, hierarchical packages
Comprehensive standard library for many tasks
Big community
Simply extendable via C/C++, wrapping of C/C++ libraries
Focus: Programming speed
Anaconda
16
Anaconda distribution is used by over 12 million users and includes more than
1400 popular data-science packages suitable for Windows, Linux, and MacOS.
Anaconda will enable you to create virtual environments and install packages
needed for data science and deep learning. With virtual environments you can
install specific package versions for a particular project or a tutorial without
worrying about version conflicts.
17
# install a package with conda and verify it's installed
$ conda install numpy
$ conda list
# take a look at the list of environments you currently have
$ conda info -e
# remove an environment
$ condaenv remove --name [my-env-name]
I highly recommend you download and print out the Anaconda cheatsheet here.
Condavs Pip install
Numpy
NumPy is the fundamental package for scientific computing in Python. It is a
Python library that provides a multidimensional array object, various derived
objects (such as masked arrays and matrices), and an assortment of routines for fast
operations on arrays, including mathematical, logical, shape manipulation, sorting,
selecting, I/O, discrete Fourier transforms, basic linear algebra, basic statistical
operations, random simulation and much more. At the core of the NumPy package,
18
is the ndarray object. This encapsulates n-dimensional arrays of homogeneous data
types, with many operations being performed in compiled code for performance.
There are several important differences between NumPy arrays and the standard
Python sequences:
NumPy arrays have a fixed size at creation, unlike Python lists (which can
grow dynamically). Changing the size of an array will create a new array
and delete the original.
The elements in a NumPy array are all required to be of the same data type,
and thus will be the same size in memory. The exception: one can have
arrays of (Python, including NumPy) objects, thereby allowing for arrays of
different sized elements.
NumPy arrays facilitate advanced mathematical and other types of
operations on large numbers of data. Typically, such operations are
executed more efficiently and with less code than is possible using Python’s
built-in sequences.
A growing plethora of scientific and mathematical Python-based packages
are using NumPy arrays; though these typically support Python-sequence
input, they convert such input to NumPy arrays prior to processing, and they
often output NumPy arrays. In other words, in order to efficiently use much
(perhaps even most) of today’s scientific/mathematical Python-based
software, just knowing how to use Python’s built-in sequence types is
insufficient - one also needs to know how to use NumPy arrays.
Pandas
Data processing is important part of analyzing the data, because data is not all the
time accessible in preferred format. Various dispensation are necessary before
19
analyzing the data such as cleaning, restructuring or merging etc. Numpy, Scipy,
Cython and Panda are the tools available in python which can be used fast
processing of the data. Further, Pandas are built on the top of Numpy. Pandas
provides rich set of functions to process various types of data. Further, working
with Panda is fast, easy and more expressive than other tools. Pandas provides fast
data processing as Numpy along with flexible data manipulation techniques as
spreadsheets and relational databases. Lastly, pandas integrates well with
matplotlib library, which makes it very handy tool for analyzing the data.
Pandas provides two very useful data structures to process the data i.e. Series and
DataFrame . The Series is a one-dimensional array that can store various data
types, including mix data types. The row labels in a Series are called the index.
Any list, tuple and dictionary can be converted in to Series using ‘series’ .
DataFrame is the widely used data structure of pandas. Note that, Series are used to
work with one dimensional array, whereas DataFrame can be used with two
dimensional arrays. DataFrame has two different index i.e. column-index and row-
index. The most common way to create a DataFrame is by using the dictionary of
equal-length list as shown below. Further, all the spreadsheets and text files are
read as DataFrame, therefore it is very important data structure of pandas.
20
5. DATA COLLECTION
To train an algorithm to detect sarcasm, we first need some data to train our
algorithm on. Classification is a supervised learning exercise, which means
we need to have some sentences labeled as sarcastic and sentences labeled as
non-sarcastic so that our classifier can learn the difference between the two.
One option would be to go over an online corpus which might contain some
sarcastic sentences, for example online reviews or comments, and label the
sentences by hand. This can be a very tedious exercise if we want to have a
large data set. The other option is to rely on the people writing the sentences
to tell us whether their sentences are sarcastic or not. This is what we are
going to do. The idea here is to use the Twitter API to stream tweets with the
label #sarcasm, these will be our sarcastic texts, and other tweets that don't
have the The obvious advantage of taking our data from Twitter is that we
can have as many samples as we want. Every day people write new sarcastic
tweets, we can simply stream them and store them in a database. I ended up
collecting 20 000 clean sarcastic tweets and 100 000 clean non-sarcastic
tweets over a period of three weeks in June-July 2014 (see section below to
understand what a clean tweet is). Since tweets are often about what is
21
currently happening in the world, it is important to collect the positive
(sarcastic) and negative (non-sarcastic) samples during the same time period
in order to isolate the sarcasm variable. However, there is a drawback to
taking our data from Twitter; it's noisy. Some people use the #sarcasm hash
tag to point out that their tweet was meant to be sarcastic, but a Human
would not have been able to guess that the tweet is sarcastic without the
label #sarcasm (example: What a great summer vacation I've been having so
far :) #sarcasm). One may argue however that this is not really noise since
the tweet is still sarcastic, at least according to the tweet's owner, and that
sarcasm is in the eyes of the beholder. The converse also happens, someone
may write a tweet which is clearly sarcastic but without the label #sarcasm.
There are also instances of sarcastic tweets where the sarcasm is in a linked
picture or article. Sometimes tweets are responses to other tweets, in which
case the sarcasm can only be understood within the context of the previous
tweets. Sometimes the label #sarcasm is meant to indicate that, while the
tweet itself is not sarcastic, some of its hashtags are (example: Time to do
my homework #yay #sarcasm). I will discuss in the next section how to
remove most of that noise, but short of reading all the tweets and labeling
them by hand we cannot remove all the noise.
22
6. DATA PREPROCESSING
Before extracting features from our text data, it is important to clean it up.
To remove the possibility of having sarcastic tweets in which the sarcasm is
either in an attached link or in response to another tweet, we simply discard
all tweets that have http addresses in them and all tweets that start with the
@ symbol. Ideally we would only collect tweets that are written in English.
When we collect sarcastic tweets, the requirement that it contains the label
#sarcasm makes it very likely that the tweet will be in English. To maximize
the number of English tweets when we collect non-sarcastic tweets, we
require that the location of the tweet is either San-Francisco or New-York. In
addition to these steps, we remove tweets which contain Non-ASCII
characters. We then remove all the hashtags, all the friend tags and all
mentions of the word sarcasm or sarcastic from the remaining tweets. If after
this pruning stage the tweet is at least 3 words long, we add it to our dataset.
We add this last requirement in order to remove some noise from the
23
sarcastic dataset since I do not believe that one can be sarcastic with only 2
words. Finally, we remove duplicates.
Analysing of the data helps in screaming of the data carefully which can
rectify misleading results. Pre-processing is done in three major steps like
The text must be parsed to eliminate words, called tokenization. Then the words
need to be determined as integers or floating-point value for use as input to a
machine learning algorithm, called feature extraction.
This is the most important phase in the development of the system. Before
applying feature extraction algorithms, the stemming of words was performed.
Stemming is the process in which the words are shortened and normalized to their
stem and their tenses are ignored. For example, “cats running ran cactus cactuses
cacti community communities” will be stemmed to ‘cat run ran cactu cactus cacti
commun’. The root of the word is preserved for better efficiency of feature
extraction and to reduce redundancy. This system takes into account the features
developed from N-grams, sentiments, topics, pos-tags, capitalization, etc. The
features from N grams are majorly unigrams i.e. containing one word (For
example, “beautiful”, “happy”, etc.) and bigramsi.e. Containing two words
24
(For example,“heythere”,“whatsup”). Next we consider topics as features. Topics
are basically word which have a high probability of appearing together. For
example, “saturday”, “night”, “party”, “fever” are mostly used together. We extract
the topics from the dataset and assign separate scores to them.For example,
according to our training performed words like “just what”, “yay” have high
occurrence in the tweets according to the scores that are generated. The sentiments
from the previous step are loaded and its features are generated. For better
accuracy the dates are then spitted into 2 and 3 parts respectively and the scores are
generated.
This is really the meat of the algorithm. The question here is, what are the variables
in a tweet that make it sarcastic or non-sarcastic? And how do we extract them
from the tweet? To this end I engineered several features that might help the
classification of tweets and I tested them on a cross-validation set (I will discuss
metrics for evaluating cross-validation in a later section). The most important
features that came out of this analysis are the following:
n-grams: More precisely, unigrams and bigrams. These are just collections of one
word (example: really, great, awesome, etc.) and two words (example: really
great, super awesome, very weird, etc.). To extract those, each tweet
was tokenized, stemmed, uncapitalized and then each n-gram was added to a
binary feature dictionary.
25
Sentiments: My hypothesis here is that sarcastic tweets might be more negative
than non-sarcastic tweets or the other way around. Moreover, there is often a big
contrast of sentiments in sarcastic tweets. What I mean by this is that tweets often
start with a very positive sentiment and end with a very negative sentiment
(example: I love being cheated on #sarcasm). Tweets is a subject on its own so the
idea here is to have something simple that can test my hypothesis. To this end I
first split each tweet in one, two and three parts, and then do a sentiment analysis
on all parts of the three splitting. I used two distinct sentiment analyzers. The first
one is my own quick and dirty implementation which uses the dictionary. This
dictionary gives a positive and a negative sentiment score to each word of the
English language. By looking up words in this dictionary, we can give a sentiment
score to each part of the tweets. The other implementation of the sentiment
analysis used the python library which has a built-in sentiments core function
.There are words that are often grouped together in the same tweets
(example: Saturday, party, night, friends, etc.). We call these groups of words
topics. If we first learn the topics, then the classifier will just have to learn which
topics are more associated with sarcasm and that will make the supervised learning
easier and more accurate. To learn the topics, I used the python
library gensim which implements topic modeling using latent Dirichlet
allocation (LDA). We first feed all the tweets to the topic modeler which learns the
topics. Then each tweet can be decomposed as a sum of topics, which we use as
features.
After the feature extraction we’ll search for the any null parameters in my data’s. If
there is any null parameters there means we need to fill the parameters with the
26
related content. In this process the steaming of the words are done. The similar
words are been considered as one and the model is being build.
27
The metric I used to guide my cross-validation is the F-score. This is a good metric
when we have a lot more samples from one category than from the other
categories. In our case we have 5 times more non-sarcastic tweets than sarcastic
tweets. If we just use for our metric the accuracy, that is the number of correct
predictions divided by the total number of tweets in our cross-validation set, then a
simple classifier which always predicts the tweets as non-sarcastic would get a
83% accuracy. This is obviously very misleading so we need a better metric. We
can do better by considering precision and recall for the sarcastic category.
Precision is the number sarcastic tweets correctly identified divided by the total
number of tweets classified as sarcastic, while recall is the number sarcastic tweets
correctly identified divided by the total number of sarcastic tweets in the cross
validation set. Both precision and recall would be equal to 0% with a dumb
classifier which always predicts tweets to be non-sarcastic, so these are already
much better scores to quantify the quality of a sarcasm classifier. The F-score is
simply the harmonic mean of precision and recall.
i) Supervised learning
i) Supervised learning
28
Supervised learning is the task of inferring a function from labeled training data.
By fitting to the labeled training set, we want to find the most optimal model
parameters to predict unknown labels on other objects (test set). If the label is a
real number, we call the task regression. If the label is from the limited number of
values, where these values are unordered, then it’sclassification.
29
iii) Semi-supervised learning:
30
.
Reinforcement learning is not like any of our previous tasks because we don’t have
labeled or unlabeled datasets here. RL is an area of machine learning concerned
with how software agents ought to take actions in some environment to maximize
some notion of cumulative reward.
31
Imagine, you’re a robot in some strange place, you can perform the activities and
get rewards from the environment for them. After each action your behavior is
getting more complex and clever, so you are training to behave the most effective
way on each step. In biology, this is called adaptation to natural environment.
7.2 CLASSIFICATION
Naive Bayes
Naive Bayes is based on two assumption. Firstly, all features in an entrance that
needs to be classify are causative evenly in the decision (equally important).
Secondly, all attributes are statistically self-determining, meaning that, knowing an
attribute’s value does not indicate whatever thing about other attributes’ values
which is not always true in practice . The process of classifying an instance is done
by applying the Bayes rule for each class given the occurrence. In the fraud
detection task, the following formula is calculated for each of the two classes
(fraudulent and legitimate) and the class associated with the higher prospect is the
predicted class for the instance.
32
model is a representation of the examples as points in space, mapped so that the
examples of the separate categories are divided by a clear gap that is as wide as
possible. New examples are then mapped into that same space and predicted to
belong to a category based on which side of the gap they fall.
Random Forest
33
formulation, is a way to implement the "stochastic discrimination" approach to
classification proposed by Eugene Kleinberg.
Neural Network
34
from experience can occur within networks, which can derive conclusions from a
complex and seemingly unrelated set of information.
The first thing you will see here is ROC curve and we can determine whether our
ROC curve is good or not by looking at AUC (Area Under the Curve) and other
parameters which are also called as Confusion Metrics. A confusion matrix is a
table that is often used to describe the performance of a classification model on a
set of test data for which the true values are known. All the measures except AUC
can be calculated by using left most four parameters. So, let’s talk about those four
parameters first.
Predicated Class
Class=yes Class=no
Class=yes True positive False Negative
Actual Class
Class=no False positive True negative
True positive and true negatives are the observations that are correctly predicted
and therefore shown in green. We want to minimize false positives and false
negatives so they are shown in red color. These terms are a bit confusing. So let’s
take each term one by one and understand it fully.
True Positives (TP) - These are the correctly predicted positive values which
means that the value of actual class is yes and the value of predicted class is also
yes. E.g. if actual class value indicates that this passenger survived and predicted
class tells you the same thing.
35
True Negatives (TN) - These are the correctly predicted negative values which
means that the value of actual class is no and value of predicted class is also no.
E.g. if actual class says this passenger did not survive and predicted class tells you
the same thing. False positives and false negatives, these values occur when your
actual class contradicts with the predicted class.
False Positives (FP) – When actual class is no and predicted class is yes. E.g. if
actual class says this passenger did not survive but predicted class tells you that
this passenger will survive.
False Negatives (FN) – When actual class is yes but predicted class in no. E.g. if
actual class value indicates that this passenger survived and predicted class tells
you that passenger will die.
Once you understand these four parameters then we can calculate Accuracy,
Precision, Recall and F1 score.
Accuracy - Accuracy is that the most intuitive performance live and it's merely a
magnitude relation of properly foretold observation to the entire observations. One
might imagine that, if we've high accuracy then our model is best. Yes, accuracy
may be a nice live however only you have got radially symmetrical datasets
wherever values of false positive and false negatives area unit virtually same.
Therefore, you have got to appear at different parameters to judge the performance
of your model. For our model, we've got zero.803 which means our model is
approx. 80% accurate.
Accuracy = TP+TN/TP+FP+FN+TN
36
Precision - preciseness is that the magnitude relation of properly foretold positive
observations to the entire foretold positive observations. The question that this
metric answer is of all passengers that labelled as survived, what percentage really
survived? High preciseness relates to the low false positive rate. We have got
zero.788 preciseness that is pretty sensible.
Precision = TP/TP+FP
Recall = TP/TP+FN
37
7.4 FEASIBLITY STUDY
The objective of feasibility study is not only to solve the difficulty but also to
obtain a sense of its scope. During the study, the problem definition was
crystallized and aspects of the problem to be included in the system are
determined. Consequently benefits are estimated with greater accuracy at this
stage. The key considerations are:
i) Economic feasibility
i) Economic feasibility
This study is carried out to check the economic impact that the system will have on
the organization. The amount of fund that the company can pour into the research
and development of the system is limited. The expenditures must be justified. Thus
the developed system as well within the budget and this was achieved because
most of the technologies used are freely available. Only the customized products
had to be purchased.
38
client. The developed system must have a modest requirement, as only minimal or
null changes are required for implementing this system.
39
8.REQUIREMENT SPECIFICATION
Hardware Requirements
RAM : 2 GB (Min)
Software Requirements
40
9.SYSTEM DESIGN
9.1 Architecture Diagram
41
9.2 Use Case Diagram:
42
9.4 Dataflow Diagram:
43
9.4.2 Data Flow Level 1
44
9.5 Collaboration Diagram:
45
10. CONCLUSION AND FUTURE ENHANCEMENT
46
The way of up the existent caustic remark detection algorithms by as well
as higher pre-processing and text mining techniques like emoji and slang
detection area unit given. For classifying tweets as sarcasm and no sarcasm
there are various techniques used, however, the paper takes up a classification
algorithm and suggests various improvements that directly contribute to the
advance of accuracy. The project derived analytical views from a social media
dataset and also filtered out or reverses analyzed sarcastic tweets to achieve a
comprehensive accuracy in the classification of the info that's given. The model
has been tested in time period and may capture live streaming tweets by filtering
through hash tags so perform immediate classification.
47
11.APPENDIX I
SOURCE CODE
# Import packages
Import os
import pandas as pd fromsklearn
import metrics fromsklearn.model_selection
import cross_val_predict fromsklearn.linear_model
import LogisticRegression fromsklearn.svm
import SVC fromsklearn.naive_bayes
import GaussianNB fromsklearn.metrics
import accuracy_score fromsklearn.neural_network
import MLPClassifierfromsklearn.tree
import DecisionTreeClassifier fromsklearn.ensemble
import RandomForestRegressor
import warnings warnings.filterwarnings('ignore')
df = pd.read_csv( os.curdir + "/data/feature_list.csv")
data = df
def LR_CV(data):
acc = []
logreg = LogisticRegression(C=1e-6, multi_class='ovr', penalty='l2',
random_state=0)
predict = cross_val_predict(logreg, data.drop(['label'],axis=1), data['label'],cv=10)
acc.append(accuracy_score(predict, data['label']))
#print (metrics.classification_report(data['label'], predict))
F1 = metrics.f1_score(data['label'], predict)
48
P = metrics.precision_score(data['label'], predict)
R = metrics.recall_score(data['label'], predict)
return (float(sum(acc) / len(acc))) * 100, F1 * 100, P * 100, R * 100
def SVM_CV(data):
acc = []
SVM = SVC(C=0.1, kernel='linear')
predict = cross_val_predict(SVM, data.drop(['label'],axis=1), data['label'],cv=10)
acc.append(accuracy_score(predict, data['label']))
#print metrics.classification_report(data['label'], predict)
F1 = metrics.f1_score(data['label'], predict)
P = metrics.precision_score(data['label'], predict)
R = metrics.recall_score(data['label'], predict)
return (float(sum(acc) / len(acc))) * 100, F1 * 100, P * 100, R * 100
def DT_CV(data):
acc = []
classifier = DecisionTreeClassifier()
predict = cross_val_predict(classifier, data.drop(['label'], axis=1), data['label'],
cv=10)
acc.append(accuracy_score(predict, data['label']))
#print metrics.classification_report(data['label'], predict)
F1 = metrics.f1_score(data['label'], predict)
P = metrics.precision_score(data['label'], predict)
R = metrics.recall_score(data['label'], predict)
return (float(sum(acc) / len(acc))) * 100, F1 * 100, P * 100, R * 100
49
def NB_CV(data):
acc = []
classifier = GaussianNB()
predict = cross_val_predict(classifier, data.drop(['label'], axis=1), data['label'],
cv=10)
acc.append(accuracy_score(predict, data['label']))
#print metrics.classification_report(data['label'], predict)
F1 = metrics.f1_score(data['label'], predict)
P = metrics.precision_score(data['label'], predict)
R = metrics.recall_score(data['label'], predict)
return (float(sum(acc) / len(acc)))*100, F1*100, P*100, R*100
def NN_CV(data):
acc = []
classifier = MLPClassifier(hidden_layer_sizes=(100, 100, 100), max_iter=1000)
predict = cross_val_predict(classifier, data.drop(['label'], axis=1), data['label'],
cv=10)
acc.append(accuracy_score(predict, data['label']))
#print metrics.classification_report(data['label'], predict)
F1 = metrics.f1_score(data['label'], predict)
P = metrics.precision_score(data['label'], predict)
R = metrics.recall_score(data['label'], predict)
return (float(sum(acc) / len(acc))) * 100, F1 * 100, P * 100, R * 100
dflr.to_csv(os.curdir + "/data/LR.csv",index=False)
dfdt.to_csv(os.curdir + "/data/DT.csv",index=False)
dfnb.to_csv(os.curdir + "/data/NB.csv",index=False)
52
dfnn = pd.DataFrame(columns=['Feature', 'Accuracy-NN', 'f1', 'Precison','Recall'])
print ("Model: " + "NN")
for feature in features:
tiny_data = data[[feature, 'label']]
Acc, F1, P, R = NN_CV(tiny_data)
#print (feature)
#print ("Acc: "+str(Acc)+" F1: "+str(F1)+ " P: "+str(P)+" R: "+str(R))
dfnn.loc[feature] = [feature,Acc,F1,P,R]
dfnn.to_csv(os.curdir + "/data/NN.csv",index=False)
dfrand.to_csv(os.curdir + "/data/RAND.csv",index=False)
dfsvm.to_csv(os.curdir + "/data/SVM.csv",index=False)
Django
fromdjango.shortcuts
import render,redirect from .models
import Register,Comments
import os fromdjango.conf
import settings
import pickle fromsklearn.feature_extraction.text
import CountVectorizer
import pandas as pd fromsklearn.naive_bayes
import MultinomialNB fromsklearn.ensemble
import RandomForestClassifier fromsklearn.model_selection
import train_test_split fromdjango.http
import HttpResponse
def home(request):
return render(request,"sarcasm/Spam.html")
defloginv(request):
54
ifrequest.method == "POST":
name = request.POST['name']
pwd = request.POST['pwd']
verified = Register.objects.get(name=name)
if verified.pwd == pwd:
return render(request,"sarcasm/Spam.html")
return render(request,"sarcasm/Login.html")
def register(request):
ifrequest.method == "POST":
name = request.POST['name']
pwd = request.POST['pwd']
mailid = request.POST['mailid']
ph = request.POST['ph']
if name == "" or pwd == "":
return redirect('/post/')
else:
reg = Register(
name=name,
pwd=pwd,
mailid=mailid,
ph=ph
)
reg.save()
return render(request,"sarcasm/Login.html")
return render(request,"sarcasm/Register.html")
55
def post(request):
data=pd.read_csv( "E:/ML-Project/Sarcasm-copy/Sarcastic.csv")
X=data['Tweet']
Y=data['Class']
cv = CountVectorizer()
X=cv.fit(X)
clf = MultinomialNB()
file = "E:/ML-Project/Sarcasm-copy/RF.sav"
loaded_model = pickle.load(open(file,'rb'))
data = Comments.objects.filter(spam="Normal")
ifrequest.method == "GET":
return render(request,"sarcasm/Post.html",{'data':data})
else:
cmd = request.POST["cmd"]
data = [cmd]
vect = cv.transform(data).toarray()
prediction = loaded_model.predict(vect)
if prediction == 1:
comments = Comments(
feed=cmd,
spam="Sarcastic"
)
comments.save()
data = Comments.objects.filter(spam="Normal")
return redirect('/post/')
else:
56
comments = Comments(
feed=cmd,
spam="Normal"
)
comments.save()
data = Comments.objects.filter(spam="Normal")
return redirect('/post/')
#return HttpResponse("Empty")
Model
fromdjango.db import models
class Register(models.Model):
name = models.CharField(max_length=500)
pwd = models.CharField(max_length=500)
mailid = models.CharField(max_length=500)
ph = models.CharField(max_length=500)
def __str__(self):
return self.name
class Comments(models.Model):
feed = models.CharField(max_length=500)
spam = models.CharField(max_length=10)
57
12.APPENDIX II
EXPERIMENTAL RESULTS
Sentiment analysis:
58
Webpage:
59
Training the system:
60
Graphs :
Naïve Bayes
61
Neural Network
62
Random Forest
63
SVM
64
REFERENCES
65
6. Hiai, Satoshi, and Kazutaka Shimada. "A sarcasm extraction method
based on patterns of evaluation expressions." 2016 5th IIAI International
Congress on Advanced Applied Informatics (IIAI-AAI). IEEE, 2016.
7. Dave, A. D., & Desai, N. P. (2016, March). A comprehensive study of
classification techniques for sarcasm detection on textual data. In 2016
International Conference on Electrical, Electronics, and Optimization
Techniques (ICEEOT) (pp. 1985-1991). IEEE.
8. Zhang, Meishan, Yue Zhang, and Guohong Fu. "Tweet sarcasm detection
using deep neural network." Proceedings of COLING 2016, The 26th
International Conference on Computational Linguistics: Technical Papers.
2016.
9. Bharti, Santosh Kumar, et al. "Sarcasm analysis on twitter data using
machine learning approaches." Trends in Social Network Analysis.
Springer, Cham, 2017. 51-76.
10.Ahmad, Tanvir, et al. "Satire detection from web documents using
machine learning methods." 2014 International Conference on Soft
Computing and Machine Intelligence. IEEE, 2014.
11.Dmitry Davidov, Oren Tsur and Ari Rappoport, Semi-Supervised
Recognition of Sarcastic Sentences in Twitter and Amazon
12. Ellen Riloff, AshequlQadir, PrafullaSurve, Lalindra De Silva, Nathan
Gilbert and Ruihong Huang, Sarcasm as Contrast between a Positive
Sentiment and Negative Situation
13.Roberto Gonzalez-Ibanez, SmarandaMuresan and Nina
Wacholder, Identifying Sarcasm in Twitter: A Closer Look
14.Christine Liebrecht, Florian Kunneman and Antal Van den Bosch, The
perfect solution for detecting sarcasm in tweets #not
66
67