0% found this document useful (0 votes)
128 views76 pages

Sample Project Report

The document discusses prediction of sarcasm through machine learning. It proposes using a pattern-based approach to classify tweets as sarcastic or non-sarcastic. Four sets of features are proposed that cover different forms of sarcasm. The approach achieves 83.1% accuracy and 91.1% precision in classification. The importance of each feature set is analyzed and the importance of pattern-based features for sarcasm detection is emphasized.

Uploaded by

simple sheik
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
128 views76 pages

Sample Project Report

The document discusses prediction of sarcasm through machine learning. It proposes using a pattern-based approach to classify tweets as sarcastic or non-sarcastic. Four sets of features are proposed that cover different forms of sarcasm. The approach achieves 83.1% accuracy and 91.1% precision in classification. The importance of each feature set is analyzed and the importance of pattern-based features for sarcasm detection is emphasized.

Uploaded by

simple sheik
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 76

PREDICTION OF SARCASM

Submitted by

HARIHARAN S (211516205033)

PREM SAI REDDY P (211516205083)

SARAVANA KUMAR M (211516205098)

in partial fulfilment for the award of the degree

of

BACHELOR OF TECHNOLOGY

IN
INFORMATION TECHNOLOGY

PANIMALAR INSTITUTE OF TECHNOLOGY

ANNA UNIVERSITY: CHENNAI 600 025


APRIL 2020
PANIMALAR INSTITUTE OF

TECHNOLOGY ANNA UNIVERSITY:

CHENNAI 600 025

BONAFIDE CERTIFICATE

Certified that this project report “PREDICTION OF SARCASM” is the bonafide


work of “HARIHARAN S (211516205033), PREM SAI REDDY P
(211516205083) and SARAVANA KUMAR M (211516205098)” who carried
out the project work under my supervision.

SIGNATURE SIGNATURE

Dr. R. JOSPHINELEELA Dr.J.CHENNI KUMARAN


M.E(CSE), Ph.D(CSE), B.E(CSE),M.Tech.(CSE),Ph.D(CSE),
HEAD OF THE DEPARTMENT, PROFESSOR,
Department of Information Technology, Department of Information Technology,
Panimalar Institute Of Technology, Panimalar Institute Of Technology,
Poonamallee, Chennai 600 123. Poonamallee, Chennai 600 123.

Certified that the candidates were examined in the university project viva-voce
held on at Panimalar Institute of Technology, Chennai
600 123.

INTERNAL EXAMINER EXTERNAL EXAMINER


ACKNOWLEDGEMENT
A project of this magnitude and nature requires kind co-operation and support from
many, for successful completion. We wish to express our sincere thanks to all those who
were involved in the completion of this project.
We seek the blessing from the Founder of our institution Dr.JEPPIAAR, M.A,
Ph.D, for having been a role model who has been our source of inspiration behind our
success in education in his premier institution. Our sincere thanks to the Honorable
Chairman of our prestigious institution Mrs.REMIBAI JEPPIAAR for her sincere
endeavor in educating us in her premier institution.
We would like to express our deep gratitude to our beloved Secretary and
Correspondent Dr.P.CHINNADURAI, M.A, Ph.D, for his kind words and enthusiastic
motivation which inspired us a lot in completing this project.
We also express our sincere thanks and gratitude to our dynamic Directors
Mrs.C.VIJAYA RAJESHWARI, Dr.C.SAKTHI KUMAR, M.E, Ph.D, and
Mrs.S.SARANYA SREE SAKTHI KUMAR, B.E, for providing us with necessary
facilities for completion of this project.
We also express our appreciation and gratefulness to our respected Principal Dr. T.
JAYANTHY, M.E, Ph.D, who helped us in the completion of the project. We wish to
convey our thanks and gratitude to our Head of the Department,
Dr.R.JOSPHINELEELA, M.E(CSE), Ph.D(CSE), for her full support by providing
ample time to complete our project.
Special thanks to our Project Coordinator Mrs.R.DHARANI, M.Tech., Associate
Professor and Internal Guide Dr.J.CHENNI KUMARAN, B.E(CSE), M.Tech.(CSE),
Ph.D(CSE), Professor, for their expert advice, valuable information and guidance
throughout the completion of the project.
Last, we thank our parents and friends for providing their extensive moral support and
encouragement during the course of the project.
ABSTRACT

Sarcasm is a sophisticated form of irony widely used in social networks and micro-blogging
websites. It is sometimes wont to convey implicit data inside the message an individual
transmits. Sarcasm can be used for various functions like criticism or mockery. However,
it's onerous even for humans to acknowledge. Therefore, recognizing sardonic statements is
terribly helpful to enhance automatic sentiment analysis of information collected from
micro blogging websites or social networks. Sentiment analysis refers to the Identification
and aggregation of attitudes and opinions expressed by Internet users towards a Specific
topic. In this paper, we tend to propose a pattern-based approach to observe humor on
Twitter. We propose four sets of options that cowl the various forms of humor we tend to
defined. We use those to classify tweets as sardonic and non-sarcastic. Our projected
approach reaches associate accuracy of eighty-three .1% with a precision equal to 91.1%.
We conjointly study the importance of every of the projected sets of options and measure its
side price to the classification. In particular we tend to emphasize the importance of pattern-
based options for the detection of sardonic statements.
I

TABLE OF CONTENTS
CHAPTER TITLE PAGE
NO. NO.
ABSTRACT I
LIST OF ABBREVATIONS IV

1. INTRODUCTION 1
2. SYSTEM DESCRIPTION 3
2.1 Existing system 3
2.2 Proposed system 4
3. LITERATURE SURVEY 5
4. MACHINE LEARNING 10
4.1Machine Learning Description 10
4.2 Types of Machine Learning 11
4.3 Steps in Machine Learning 13
5. DATA COLLECTION 21
5.1 Feature Selection 22
6. DATA PRE-PROCESSING 23
7. TRAINING THE SYSTEM 27
7.1 Model Selection 28
7.2 Classification 32

7.3 Measuring the Model Performance 35


7.4 Feasibility Study 38
8. REQUIREMENT SPECIFICATION 40

II

9.
SYSTEM DESIGN 41
9.1 Architecture Diagram 41

9.2 Use case Diagram 42

III 9.3 Class Diagram 42

9.4 Data Flow Diagram 43


LIST OF ABBREVATIONS

NLP Neuro-Linguistic Programming


MW Merriam-Webster
PNI Positive Negative Analysis
SA Sentiment Analysis
SVM Support Vector Machine
VE Virtual Environment
PMD Package Management an Deployment
CSV Comma – Separated Values
AI Artificial Intelligence
ML Machine Learning
IDE Integrated Development Environment
TP True Positive
TN True Negative
FP False Positive
FN False Negative
LDA Latent Dirichlet Allocation
AUC Area Under the Curve
ROC Receiver Operating Characteristic
curve

IV
1.INTRODUCTION

Sarcasm is part of human nature and perhaps an evolutionarily noble entity. It is


the routine of remarks that undoubtedly refer to the opposite of what the
individuals say and made in order to miffed someone’s feelings or to disparage
something in a hysterical way. The understanding the delicacy of this practice
needs second-order elucidation of the narrator's or author's objectives; different
parts of the brain must slog together to understand sarcasm. Sarcasm appears to
work out the brain more than genuine testimonials do. Sarcasm has a two-faced
quality: it’s both comical and means. So, the researchers show curiosity in sarcasm
detection of social media text, especially in tweets. The rapid growth of tweets
leads to critical in the analysis of data. It is also known as opinion mining that
derives the opinion of a person or attitude of a speaker. Many researchers focus
their interest towards sentimental analysis particularly in the field of the social
network from the past few years. Machine learning methods and algorithms pave a
new way for sentiment analysis particularly sarcasm detection by providing a set of
algorithms and procedures.

Sarcasm is "a sharp, bitter, or cutting expression or remark; a bitter gibe


or taunt". Sarcasm may employ ambivalence, although sarcasm is not necessarily
ironic. Most noticeable in spoken word, sarcasm is mainly distinguished by the
inflection with which it is spoken and is largely context-dependent. Sarcasm does
not translate into text-only mediums, such as online chat.

Sarcasm is sometimes used as merely a synonym of irony, but the word has a more
specific sense: irony that's meant to mock or convey contempt. This meaning is
found in its etymology. In Greek, sarkazein meant "to tear flesh; to wound." When
you use sarcasm, you really tear into them. A clever person coined the variant
spelling sarchasm (a blend of sarcasm and chasm) and defined it as "the gap
between the author of sarcastic wit and the person who doesn't get it."

A field that takes ideas from machine learning and applies them to text data. Your
email spam filter is an application of NLP; there is a learning algorithm that learns
how to differentiate a spam email from a regular email by looking at the text
content of the email. It had just came out in the news that the U.S. secret service
was looking for a sarcasm detector to improve their intelligence coming from
Twitter and I was curious to It wasn't clear to me that this was possible because
sarcasm is a complicated concept. Let's go back to the spam filter example for a
minute. If you look at a spam filter algorithm, the features that will be most
relevant to the classification of emails will be certain keywords: Not spam, Free
access or Enlarge your ... for instance. A good learning algorithm will learn the
vocabulary associated with spam emails, so when presented with an email which
contains words in that vocabulary the classifier will classify that email as spam.
My initial intuition was that sarcasm detection is more complicated than spam
detection, because I didn't think there was a vocabulary associated with sarcastic
sentences. I thought sarcasm is hidden in the tone and the ambivalence of the
sentence. Merriam-Webster defines sarcasm as the use of words that mean the
opposite of what you really want to say especially in order to insult someone, to
show irritation, or to be funny. So to detect sarcasm properly a computer would
have to figure out that you meant the opposite of what you just said. It is
sometimes hard for Humans to detect sarcasm, and Humans have a much better
grasp at the English language than computers do, so this was not going to be an
easy task.

2
2. SYSTEM DESCRIPTION

2.1 EXISTING SYSTEM

The extraction method of sarcastic sentences in product reviews. Sarcasm, which


expresses a negative meaning with positive words, often lead to mistakes in
sentiment analysis. Therefore, sarcasm detection is an important task in sentiment
analysis. For our method, we collected sarcastic sentences to analyze them in
advance. We manually labeled 70 sentences as sarcastic sentences from 10,000
reviews. We generated extraction rules on the basis of the analysis of the
sentences. The rate of sarcastic sentences contained in reviews was low
(70/10,000). However, 21 sarcastic sentences appeared in 233 reviews with 1-
point, which is the worst point in this review dataset. In other words,
approximately 10 contained sarcastic sentences. This fact denotes that the detection
of sarcastic sentences lead to the improvement of sentiment analysis, namely
positive-negative identification, because conventional PN identification methods
without

sarcastic detection can not recognize the polarity of the reviews correctly. This
result shows a significance of sarcasm extraction even if the number of sarcastic
sentences in reviews is small. In the experiment, we compared our method with a
baseline based on a simple rule. As a result, Our method outperformed the baseline
However, some approaches to extract sarcastic sentences have, such as Riloff’s
method. Comparison with state-of-the-art methods is important future work to
evaluate our method. In Addition, the accuracy of our method was insufficient,
especially the precision rate. The result is due to the lack of analysis. Although we
analyzed sarcastic sentences in our data, the data contains only 70 sarcastic

3
sentences. Collecting new sarcastic sentences and analyzing the sentences
manually are important.

2.2 PROPOSED SYSTEM

Manual analysis of numerous sentences is costly. Therefore, generating rules


automatically becomes necessary .In this we are going to use Large dataset and
many classifiers to get the maximum accuracy. The classifiers like SVM ,Random
Forest , Logistic Regression , Decision Tree, Neural Networks, Naive Bayes In this
project we use 21 special features along with usual unigrams and bigrams for
classification.

These 21 features were divided in to 4 categories:

i) Text expression-based features

ii) Emotion-based features

iii) Familiarity-based features

iv) Contrast-based features

So that the accuracy of our sarcasm detection will be improved.

4
3.LITERATURE SURVEY

3.1 A Review on Sarcasm Detection from Machine-Learning Perspective:

Wicana, Setra Genyang, Taha Yasin İbisoglu, and Uraz Yavanoglu.

Description :

In this paper, we want to review one of the challenging problems for the opinion
mining task, which is sarcasm detection. To be able to do that, many researchers
tried to explore such properties in sarcasm like theories of sarcasm, syntactical
properties, psycholinguistic of sarcasm, lexical feature, semantic properties, etc.
Studies done in the last 15 years not only made progress in semantic features, but
also show increasing amount of method of analysis using a machine-learning
approach to process data. Because of this reason, this paper will try to explain
current mostly used method to detect sarcasm. Lastly, we will present a result of
our finding, which might help other researchers to gain a better result in the future.

Publication:  2017 IEEE 11th International Conference on Semantic Computing


(ICSC). IEEE, 2017.

3.2 Sarcasm Extraction Method Based on Patterns of Evaluation Expressions

Hiai, Satoshi, and Kazutaka Shimada.

Description :

Sarcasm presents a negative meaning with positive expressions and is a non-


literalistic expression. Sarcasm detection is an important task because it contributes

5
directly to the improvement of the accuracy of sentiment analysis tasks. In this
study, we propose a extraction method of sarcastic sentences in product reviews.
First, we analyze sarcastic sentences in product reviews and classify the sentences
into 8 classes by focusing on evaluation expressions. Next, we generate
classification rules for each class and use them to extract sarcastic sentences. Our
method consists of three stage; judgment processes based on rules for 8 classes,
boosting rules and rejection rules. In the experiment, we compare our method with
a baseline based on a simple rule. The experimental result shows the effectiveness
of our method.

Publications:  2016 5th IIAI International Congress on Advanced Applied


Informatics (IIAI-AAI). IEEE, 2016.

3.3 Natural Language Processing Based Features for Sarcasm Detection: An


Investigation Using Bilingual Social Media Texts

Suhaimin, Mohd Suhairi Md, et al.

Description :

The presence of sarcasm in text can hamper the performance of sentiment analysis.
The challenge is to detect the existence of sarcasm in texts. This challenge is
compounded when bilingual texts are considered, for example using Malay social
media data. In this paper a feature extraction process is proposed to detect sarcasm
using bilingual texts; more specifically public comments on economic related posts
on Face book. Four categories of feature that can be extracted using natural
language processing are considered; lexical, pragmatic, prosodic and syntactic. We
also investigated the use of idiosyncratic feature to capture the peculiar and odd
comments found in a text. To determine the effectiveness of the proposed process,
a non-linear Support Vector Machine was used to classify texts, in terms of the
6
identified features, according to whether they included sarcastic content or not. The
results obtained demonstrate that a combination of syntactic, pragmatic and
prosodic features produced the best performance with an F-measure score of 0.852.

Publications:  2017 8th International Conference on Information Technology


(ICIT). IEEE, 2017.

3.4 Opinion Mining in Twitter

Dave, Anandkumar D., and Nikita P. Desai.

Description :

Opinion mining and sentiment analysis refer to the identification and the
aggregation of attitudes or opinions expressed by internet users towards a specific
topic. However, due to the limitation in terms of characters (i.e. 140 characters per
tweet) and the use of informal language, the state-of-the-art approaches of
sentiment analysis present lower performances in Twitter than that when they are
applied on longer texts. Moreover, presence of sarcasm makes the task even more
challenging .Sarcasm is when a person conveys implicit information, usually the
opposite of what is said, within the message he transmits. In this paper we propose
a method that makes use of a minimal set of features, yet, efficiently classifies
tweets regardless of their topic. We also study the importance of detecting sarcastic
tweets automatically, and demonstrate how the accuracy of sentiment analysis can
be enhanced knowing which tweets are sarcastic and which are not.

Publications: 2016 International Conference on Electrical, Electronics, and


Optimization Techniques (ICEEOT). IEEE, 2016.

3.5 A comprehensive study of classification techniques for sarcasm detection


on textual data

7
Dave, A. D., & Desai, N. P.

Description:

During the last decade majority of research has been carried out in the area of
sentiment Analysis of textual data available on the web. Sentiment Analysis has its
challenges, and one of them is Sarcasm. Classification of sarcastic sentences is a
difficult task due to representation variations in the textual form sentences. This
can affect many Natural Language Processing based applications. Sarcasm is the
kind of representation to convey the different sentiment than presented. In our
study we have tried to identify different supervised classification techniques
mainly used for sarcasm detection and their features. Also we have analyzed
results of the classification techniques, on textual data available in various
languages on review related sites, socialmedia sites and micro-blogging sites.
Furthermore, for each method studied, our paper presents the analysis of data set
generation and feature selection process used thereof. We also carried out
preliminary experiment to detect sarcastic sentences in “Hindi” language. We
trained SVM classifier with 10X validation with simple Bag-Of-Words as features
and TF-IDF as frequency measure of the feature. We found that this simple model
based on “bag-of-words” feature accurately classified 50% of sarcastic sentences.
Thus, primary experiment has revealed the fact that simple Bag-of-Words are not
sufficient for sarcasm detection.

Publications: 2016 International Conference on Electrical, Electronics, and


Optimization Techniques (ICEEOT) (pp. 1985-1991). IEEE.

3.6 A sarcasm extraction method based on patterns of evaluation expressions

NAME: Hiai, S., & Shimada.

8
Description:

Sarcasm presents a negative meaning with positive expressions and is a non-


literalistic expression. Sarcasm detection is an important task because it contributes
directly to the improvement of the accuracy of sentiment analysis tasks. In this
study, we propose a extraction method of sarcastic sentences in product reviews.
First, we analyze sarcastic sentences in product reviews and classify the sentences
into 8 classes by focusing on evaluation expressions. Next, we generate
classification rules for each class and use them to extract sarcastic sentences. Our
method consists of three stage, judgment processes based on rules for 8 classes,
boosting rules and rejection rules. In the experiment, we compare our method with
a baseline based on a simple rule. The experimental result shows the effectiveness
of our method.

Publication:  2016 5th IIAI International Congress on Advanced Applied


Informatics (IIAI-AAI). IEEE, 2016.

9
4. MACHINE LEARNING

4.1 Machine Learning Description

We know humans learn from their past experiences and machines follow
instructions given by humans but what if humans can turn the machines to learn
from the past data. But it’s a lot more than just learning it’s also about
understanding and reasoning. In machine learning, computers learn patterns from a
set of data. Once it learns those patterns, it can apply the lessons to new, unseen
data. Machine learning is an exciting field, and its usage is exploding and will
continue to reshape modern business and technology for the next decade. So, it is
good time for all of us to learn a machine learning for the new exciting field, so
machine learning is disrupting the entire industries from agriculture to health care
to finance multiple ways the business, the enterprises are getting disrupted.

Nor so, Machine learning is improving global business practices across marketing,
human resources, e commerce we are seeing every day and also Machine learning
powering emerging technologies like a self-driving cars of the future and we are
parking about virtual reality and augmented reality and Machine learning is
playing an important role and also as we observe in a enter prenatal areal every
week, actually machine learning start up servicing huge investments, so beneficial
action is happening.

It is the practice of teaching computers how to learn patterns from data, often for
making decisions or predictions. Practical machine learning focuses on intuition

10
and simplicity, with a strong emphasis on results whereas academic machine
learning focuses on Math and Theory, with a strong emphasis on writing
algorithms from scratch.

It is very to write programs that solve problems like recognizing a face. We can
easily extend this definition easily to our Artificial Intelligence systems. Machine
learning is learning in neural networks will be very different to learning in rule-
based systems.

It is the art and science of giving computers the ability to learn to make decision
from data without being explicitly programmed. The value of machine learning is
only just beginning to show itself. There is a lot of data in the world today
generated not only by people, but also by computers, phones and other devices. We
see machine learning all around us in the products we use today. However, it isn’t
always apparent that machine learning is behind it all. Today, machine learning’s
immediate applications are already quite wide-ranging, including image
recognition, fraud detection and recommendation systems, as well as text and
speech systems too.

These powerful capabilities can be applied to a wide range of fields, from diabetic
retinopathy and skin cancer detection to retail and of course, transportation in the
form of self-parking and self-driving vehicles. It wasn’t that long ago that when a
company or product had machine learning in its offerings, it was considered novel.

4.2 Types of Machine Learning

So, every machine learning algorithm in a very broad way we can classify into
three different category one is the supervised learning wonder there is the
unsupervised learning and it is a reinforcement learning. So this supervised
learning again we can classify into regression and the classification, So we see one
11
by one what are the characteristic of individual category and we’ll see about what
is supervised learning what is unsupervised learning what are the different
parameter associated with it what kind of problem individual category is trying to
solve in tutor machine learning system reinforcement learning we’ll see into the
subsequent lecture. So, let us begin with the supervised learning.

So this is the supervised learning just concentrate on this very simple functional
mapping from Y to X so what we do into supervised learning this is the our input
data what X is our input data, so we supply huge amount of data into X we have a
label associated with all individual records what we called as a Y and what is the
problem in the supervised learning is our task is to predict this function f, so you
have been given this input data and a label in short the output data and your task is
to find this predicted function f. Now let us try to understand with the help of very
simple I example this is a very simple example of recreation that you fit the data
into regression algorithm and eventually you will get this model.

This model is nothing but this predicted function f. So, let us try to understand with
a very simple example of this predicting the house price state how this house price
calculation problem can be solved with the help of regression. Let’s try to
concentrate on this is the house price is data so just see there are the two columns
and each row is associated with a single record, so in the first column it is been
written that what is the size of individual house and for this size what are the price
associated with that house so there is a huge amount of data has been given for
example I have seen some six data point here it is been read as a size of two
thousand one hundred and four square foot of house. Let’s try to see what is the
classification those this is another kind of supervised learning algorithm in a
machine learning system, so see in a regression what we have seen that all outputs
are the continuous output, so this classification how it is different from the

12
recreation because overall it is another kind of supervised learning so it holds all
properties of the supervised learning algorithm or the only difference in
classification with the regression algorithms are this output values at the discrete
value it has no continuous value.

So, there is a fixed amount of output value either ten or five or seven or nine or
even thousand class is also possible. So let’s try to understand with the help of very
simple example just concentrate on this data this data uploaded on to two-
dimensional graph, so in earlier case of whatever the output we have there was a
continuous but here what this classification algorithm will try to find the boundary
between this kind of data. So this class of data will lie into the one class and this
class of data lies into the second class, so whenever we have been given suppose
some new example so it will try to classify this and to this kind of group so there
are only two output value associated in this class there are no continuous value, so
that is the only difference between the regression and to the class even so that’s it
for the supervised learning.

4.3 Steps in Machine Learning

But in order to train the model, we need to collect data to train on. We’ll call these
our features from now on colour and alcohol. Get some equipment to do our
measurements a spectrometer for measuring the colour and a hydrometer to
measure the alcohol content. Once this equipment and then booze we got it all set
up it’s time for our first real step of machine learning gathering that data. In this
case, the data we collect will be the colour and alcohol content of each drink. This
will yield us a table of colour, alcohol content, and whether it’s beer or wine. This
will be our training data.

13
So, a few hours of measurements later, we’ve gathered our training data and had a
few drinks, perhaps. And now it’s time for our next step of machine learning is
data preparation. This is also a good time to do any pertinent visualizations of your
data, helping you see if there are any relevant relationships between different
variables as well as show you if there are any data imbalances. For instances, if we
collected way more data points about beer than wine, the model we train will be
heavily biased towards guessing that virtually everything. If a beer since it would
be right most of the time.

We don’t have to use the same data that the model was trained on for evaluation
since then it would just be able to memorize the questions, just as you wouldn’t
want to use the questions from your math homework on the math exam.
Sometimes the data we collected needs other forms of adjusting and manipulation
things like duplication, normalization, error correction, and others. It would all
happen at the data preparation step. In this case, we don’t have any further data
preparation needs, so let’s move on forward.

The next step in this workflow is choosing a model. There are many models that
researchers and data scientists have created over the years. Some are very well
suited for image data, others for sequences, such as text or music, some for
numerical data, and others for text-based data. In this case, we have just two
features colour and alcohol percentage. Now we move on to what is often
considered the bulk of machine learning the training.

We’ll use this data to incrementally improve our model’s ability to predict
whether a given drink is wine or beer. So, let’s look at what that means more
concretely for our data set. When we first start the training, it’s like it drew a

14
random line through the data. Then each step of the training progresses, the line
moves step by step closer to the ideal separation of the wine and beer.

Once training is complete, it’s time to see if the model is any good. Using
evaluation, this is where that data set that we set aside earlier comes into play.
Evaluation allows us to test our model against data that has never been used for
training. This metric allows us to see how the model might perform against data
that it has not yet seen. This is meant to be representative of how the model might
perform in the real world.

Python
Python is an interpreter, object-oriented, high-level programming language with
dynamic semantics. Its high-level built in data structures, combined with dynamic
typing and dynamic binding make it very attractive for Rapid Application
Development, as well as for use as a scripting or glue language to connect existing
components together. Python's simple, easy to learn syntax emphasizes readability
and therefore reduces the cost of program maintenance. Python supports modules
and packages, which encourages program modularity and code reuse. The Python
interpreter and the extensive standard library are available in source or binary form
without charge for all major platforms, and can be freely distributed Often,
programmers fall in love with Python because of the increased productivity it
provides. Since there is no compilation step, the edit-test-debug cycle is incredibly
fast. Debugging Python programs is easy: a bug or bad input will never cause a
segmentation fault. Instead, when the interpreter discovers an error, it raises an
exception. When the program doesn't catch the exception, the interpreter prints a
stack trace. A source level debugger allows inspection of local and global
variables, evaluation of arbitrary expressions, setting breakpoints, stepping through
the code a line at a time, and so on. The debugger is written in Python itself,

15
testifying to Python's introspective power. On the other hand, often the quickest
way to debug a program is to add a few print statements to the source: the fast edit-
test-debug cycle makes this simple approach very effective
Python: Dynamic programming language which supports several different
programming paradigms:
 Procedural programming
 Object oriented programming
 Functional programming
Standard: Python byte code is executed in the Python interpreter (similar to Java)
platform independent code
 Extremely versatile language
Website development, data analysis, server maintenance, numerical analysis,
 Syntax is clear, easy to read and learn (almost pseudo code)
 Common language
 Intuitive object oriented programming
 Full modularity, hierarchical packages
 Comprehensive standard library for many tasks
 Big community
 Simply extendable via C/C++, wrapping of C/C++ libraries
 Focus: Programming speed
Anaconda

Anaconda is a free and open-source[5] distribution of


the Python and R programming languages for scientific computing (data
science, machine learning applications, large-scale data processing, predictive
analytics, etc.), that aims to simplify package management and deployment.
Package versions are managed by the package management system conda The

16
Anaconda distribution is used by over 12 million users and includes more than
1400 popular data-science packages suitable for Windows, Linux, and MacOS.

Anaconda will enable you to create virtual environments and install packages
needed for data science and deep learning. With virtual environments you can
install specific package versions for a particular project or a tutorial without
worrying about version conflicts.

Download Anaconda for your platform and choose the Python 3.6


version: https://www.anaconda.com/download.

By downloading Anaconda, you get conda, Python, Jupyter Notebook and


hundreds of other open source packages.

Conda is a package manager to manage virtual environment and install packages.


Here are some helpful commands using conda:

#update conda in your default environment


$ conda upgrade conda
$ conda upgrade --all
# create a new environment with conda
$ conda create -n [my-env-name]
$ conda cerate -n [my-env-name] python=[python-version]
# activate the environment you created
$ source activate [my-env-name]
# take a look at the environment you created
$ conda info
$ conda list

17
# install a package with conda and verify it's installed
$ conda install numpy
$ conda list
# take a look at the list of environments you currently have
$ conda info -e
# remove an environment
$ condaenv remove --name [my-env-name]

I highly recommend you download and print out the Anaconda cheatsheet here.

Condavs Pip install

You can use either conda or pip for installation in an virtual environment created


with conda. They are both open source package managers. Here are some
differences:

 conda install — installs any software package.


 pip install — installs python packages only and it’s the defacto python
package manager.

Numpy
NumPy is the fundamental package for scientific computing in Python. It is a
Python library that provides a multidimensional array object, various derived
objects (such as masked arrays and matrices), and an assortment of routines for fast
operations on arrays, including mathematical, logical, shape manipulation, sorting,
selecting, I/O, discrete Fourier transforms, basic linear algebra, basic statistical
operations, random simulation and much more. At the core of the NumPy package,

18
is the ndarray object. This encapsulates n-dimensional arrays of homogeneous data
types, with many operations being performed in compiled code for performance.
There are several important differences between NumPy arrays and the standard
Python sequences:

 NumPy arrays have a fixed size at creation, unlike Python lists (which can
grow dynamically). Changing the size of an array will create a new array
and delete the original.
 The elements in a NumPy array are all required to be of the same data type,
and thus will be the same size in memory. The exception: one can have
arrays of (Python, including NumPy) objects, thereby allowing for arrays of
different sized elements.
 NumPy arrays facilitate advanced mathematical and other types of
operations on large numbers of data. Typically, such operations are
executed more efficiently and with less code than is possible using Python’s
built-in sequences.
 A growing plethora of scientific and mathematical Python-based packages
are using NumPy arrays; though these typically support Python-sequence
input, they convert such input to NumPy arrays prior to processing, and they
often output NumPy arrays. In other words, in order to efficiently use much
(perhaps even most) of today’s scientific/mathematical Python-based
software, just knowing how to use Python’s built-in sequence types is
insufficient - one also needs to know how to use NumPy arrays.

Pandas
Data processing is important part of analyzing the data, because data is not all the
time accessible in preferred format. Various dispensation are necessary before

19
analyzing the data such as cleaning, restructuring or merging etc. Numpy, Scipy,
Cython and Panda are the tools available in python which can be used fast
processing of the data. Further, Pandas are built on the top of Numpy. Pandas
provides rich set of functions to process various types of data. Further, working
with Panda is fast, easy and more expressive than other tools. Pandas provides fast
data processing as Numpy along with flexible data manipulation techniques as
spreadsheets and relational databases. Lastly, pandas integrates well with
matplotlib library, which makes it very handy tool for analyzing the data.

Pandas provides two very useful data structures to process the data i.e. Series and
DataFrame . The Series is a one-dimensional array that can store various data
types, including mix data types. The row labels in a Series are called the index.
Any list, tuple and dictionary can be converted in to Series using ‘series’ .

DataFrame is the widely used data structure of pandas. Note that, Series are used to
work with one dimensional array, whereas DataFrame can be used with two
dimensional arrays. DataFrame has two different index i.e. column-index and row-
index. The most common way to create a DataFrame is by using the dictionary of
equal-length list as shown below. Further, all the spreadsheets and text files are
read as DataFrame, therefore it is very important data structure of pandas.

20
5. DATA COLLECTION

To train an algorithm to detect sarcasm, we first need some data to train our
algorithm on. Classification is a supervised learning exercise, which means
we need to have some sentences labeled as sarcastic and sentences labeled as
non-sarcastic so that our classifier can learn the difference between the two.
One option would be to go over an online corpus which might contain some
sarcastic sentences, for example online reviews or comments, and label the
sentences by hand. This can be a very tedious exercise if we want to have a
large data set. The other option is to rely on the people writing the sentences
to tell us whether their sentences are sarcastic or not. This is what we are
going to do. The idea here is to use the Twitter API to stream tweets with the
label #sarcasm, these will be our sarcastic texts, and other tweets that don't
have the The obvious advantage of taking our data from Twitter is that we
can have as many samples as we want. Every day people write new sarcastic
tweets, we can simply stream them and store them in a database. I ended up
collecting 20 000 clean sarcastic tweets and 100 000 clean non-sarcastic
tweets over a period of three weeks in June-July 2014 (see section below to
understand what a clean tweet is). Since tweets are often about what is

21
currently happening in the world, it is important to collect the positive
(sarcastic) and negative (non-sarcastic) samples during the same time period
in order to isolate the sarcasm variable. However, there is a drawback to
taking our data from Twitter; it's noisy. Some people use the #sarcasm hash
tag to point out that their tweet was meant to be sarcastic, but a Human
would not have been able to guess that the tweet is sarcastic without the
label #sarcasm (example: What a great summer vacation I've been having so
far :) #sarcasm). One may argue however that this is not really noise since
the tweet is still sarcastic, at least according to the tweet's owner, and that
sarcasm is in the eyes of the beholder. The converse also happens, someone
may write a tweet which is clearly sarcastic but without the label #sarcasm.
There are also instances of sarcastic tweets where the sarcasm is in a linked
picture or article. Sometimes tweets are responses to other tweets, in which
case the sarcasm can only be understood within the context of the previous
tweets. Sometimes the label #sarcasm is meant to indicate that, while the
tweet itself is not sarcastic, some of its hashtags are (example: Time to do
my homework #yay #sarcasm). I will discuss in the next section how to
remove most of that noise, but short of reading all the tweets and labeling
them by hand we cannot remove all the noise.

5.1 Feature Selection

Feature selection is based on which features will make an impact in


our project and which feature we don’t need to use. The Features we need to
use is extracted from the dataset and other features are left as it is. The
Feature can be multiple class as well as Single feature so we need to decide
how our feature should come.

22
6. DATA PREPROCESSING

Before extracting features from our text data, it is important to clean it up.
To remove the possibility of having sarcastic tweets in which the sarcasm is
either in an attached link or in response to another tweet, we simply discard
all tweets that have http addresses in them and all tweets that start with the
@ symbol. Ideally we would only collect tweets that are written in English.
When we collect sarcastic tweets, the requirement that it contains the label
#sarcasm makes it very likely that the tweet will be in English. To maximize
the number of English tweets when we collect non-sarcastic tweets, we
require that the location of the tweet is either San-Francisco or New-York. In
addition to these steps, we remove tweets which contain Non-ASCII
characters. We then remove all the hashtags, all the friend tags and all
mentions of the word sarcasm or sarcastic from the remaining tweets. If after
this pruning stage the tweet is at least 3 words long, we add it to our dataset.
We add this last requirement in order to remove some noise from the

23
sarcastic dataset since I do not believe that one can be sarcastic with only 2
words. Finally, we remove duplicates.

Analysing of the data helps in screaming of the data carefully which can
rectify misleading results. Pre-processing is done in three major steps like

i) Feature Extraction and Feature Engineering

ii) Feature Cleaning

i) Feature Extraction and Feature Engineering:

The text must be parsed to eliminate words, called tokenization. Then the words
need to be determined as integers or floating-point value for use as input to a
machine learning algorithm, called feature extraction.

This is the most important phase in the development of the system. Before
applying feature extraction algorithms, the stemming of words was performed.
Stemming is the process in which the words are shortened and normalized to their
stem and their tenses are ignored. For example, “cats running ran cactus cactuses
cacti community communities” will be stemmed to ‘cat run ran cactu cactus cacti
commun’. The root of the word is preserved for better efficiency of feature
extraction and to reduce redundancy. This system takes into account the features
developed from N-grams, sentiments, topics, pos-tags, capitalization, etc. The
features from N grams are majorly unigrams i.e. containing one word (For
example, “beautiful”, “happy”, etc.) and bigramsi.e. Containing two words

24
(For example,“heythere”,“whatsup”). Next we consider topics as features. Topics
are basically word which have a high probability of appearing together. For
example, “saturday”, “night”, “party”, “fever” are mostly used together. We extract
the topics from the dataset and assign separate scores to them.For example,
according to our training performed words like “just what”, “yay” have high
occurrence in the tweets according to the scores that are generated. The sentiments
from the previous step are loaded and its features are generated. For better
accuracy the dates are then spitted into 2 and 3 parts respectively and the scores are
generated.

This is really the meat of the algorithm. The question here is, what are the variables
in a tweet that make it sarcastic or non-sarcastic? And how do we extract them
from the tweet? To this end I engineered several features that might help the
classification of tweets and I tested them on a cross-validation set (I will discuss
metrics for evaluating cross-validation in a later section). The most important
features that came out of this analysis are the following: 

n-grams: More precisely, unigrams and bigrams. These are just collections of one
word (example: really, great, awesome, etc.) and two words (example: really
great, super awesome, very weird, etc.). To extract those, each tweet
was tokenized, stemmed, uncapitalized and then each n-gram was added to a
binary feature dictionary. 

25
Sentiments: My hypothesis here is that sarcastic tweets might be more negative
than non-sarcastic tweets or the other way around. Moreover, there is often a big
contrast of sentiments in sarcastic tweets. What I mean by this is that tweets often
start with a very positive sentiment and end with a very negative sentiment
(example: I love being cheated on #sarcasm). Tweets is a subject on its own so the
idea here is to have something simple that can test my hypothesis. To this end I
first split each tweet in one, two and three parts, and then do a sentiment analysis
on all parts of the three splitting. I used two distinct sentiment analyzers. The first
one is my own quick and dirty implementation which uses the  dictionary. This
dictionary gives a positive and a negative sentiment score to each word of the
English language. By looking up words in this dictionary, we can give a sentiment
score to each part of the tweets. The other implementation of the sentiment
analysis used the python library which has a built-in sentiments core function
.There are words that are often grouped together in the same tweets
(example: Saturday, party, night, friends, etc.). We call these groups of words
topics. If we first learn the topics, then the classifier will just have to learn which
topics are more associated with sarcasm and that will make the supervised learning
easier and more accurate. To learn the topics, I used the python
library gensim which implements topic modeling using latent Dirichlet
allocation (LDA). We first feed all the tweets to the topic modeler which learns the
topics. Then each tweet can be decomposed as a sum of topics, which we use as
features.

ii) Feature cleaning:

After the feature extraction we’ll search for the any null parameters in my data’s. If
there is any null parameters there means we need to fill the parameters with the

26
related content. In this process the steaming of the words are done. The similar
words are been considered as one and the model is being build.

7. TRAINING THE SYSTEM

There is a very wide range of machine learning algorithms to choose from,


most of which are available in the python library Scikit-learn. However, most of
the implementations of these algorithms do not accept sparse matrices as inputs,
and since we have a large number of nominal features coming from our n-grams
features it is imperative that we encode our features in a sparse matrix. Out of the
algorithms that do support sparse matrices in Scikit-learn, I ended up trying naive
Bayes, logistic regression and support vector machine (SVM) with a linear kernel.
I got the best results in cross validation using SVM with aneuclidean regularization
coefficient of 0.1. 

27
The metric I used to guide my cross-validation is the F-score. This is a good metric
when we have a lot more samples from one category than from the other
categories. In our case we have 5 times more non-sarcastic tweets than sarcastic
tweets. If we just use for our metric the accuracy, that is the number of correct
predictions divided by the total number of tweets in our cross-validation set, then a
simple classifier which always predicts the tweets as non-sarcastic would get a
83% accuracy. This is obviously very misleading so we need a better metric. We
can do better by considering precision and recall for the sarcastic category.
Precision is the number sarcastic tweets correctly identified divided by the total
number of tweets classified as sarcastic, while recall is the number sarcastic tweets
correctly identified divided by the total number of sarcastic tweets in the cross
validation set. Both precision and recall would be equal to 0% with a dumb
classifier which always predicts tweets to be non-sarcastic, so these are already
much better scores to quantify the quality of a sarcasm classifier. The F-score is
simply the harmonic mean of precision and recall.

7.1 Model Selection list possible ways done by machine learning

i) Supervised learning

ii) Unsupervised learning

iii) Semi-supervised learning

iv) Reinforcement learning

i) Supervised learning

28
Supervised learning is the task of inferring a function from labeled training data.
By fitting to the labeled training set, we want to find the most optimal model
parameters to predict unknown labels on other objects (test set). If the label is a
real number, we call the task regression. If the label is from the limited number of
values, where these values are unordered, then it’sclassification.

ii) Unsupervised learning

In unsupervised learning we have less information about objects, in


particular, the train set is unlabeled. What is our goal now? It’s possible to observe
some similarities between groups of objects and include them in appropriate
clusters. Some objects can differ hugely from all clusters, in this way we assume
these objects to be anomalies.

29
iii) Semi-supervised learning:

Semi-supervised learning tasks include both problems we described earlier: they


use labeled and unlabeled data. That is a great opportunity for those who can’t
afford labeling their data. The method allows us to significantly improve accuracy,
because we can use unlabeled data in the train set with a small amount of
labeleddata

30
.

iv) Reinforcement learning

Reinforcement learning is not like any of our previous tasks because we don’t have
labeled or unlabeled datasets here. RL is an area of machine learning concerned
with how software agents ought to take actions in some environment to maximize
some notion of cumulative reward.

31
Imagine, you’re a robot in some strange place, you can perform the activities and
get rewards from the environment for them. After each action your behavior is
getting more complex and clever, so you are training to behave the most effective
way on each step. In biology, this is called adaptation to natural environment.

7.2 CLASSIFICATION

Naive Bayes

Naive Bayes is based on two assumption. Firstly, all features in an entrance that
needs to be classify are causative evenly in the decision (equally important).
Secondly, all attributes are statistically self-determining, meaning that, knowing an
attribute’s value does not indicate whatever thing about other attributes’ values
which is not always true in practice . The process of classifying an instance is done
by applying the Bayes rule for each class given the occurrence. In the fraud
detection task, the following formula is calculated for each of the two classes
(fraudulent and legitimate) and the class associated with the higher prospect is the
predicted class for the instance.

Support Vector Machine

In machine learning, support-vector machines (SVM’s, also support-vector


networks are supervised learning models with associated learning algorithms that
analyze data used for classification and regression analysis. Given a set of training
examples, each marked as belonging to one or the other of two categories, an SVM
training algorithm builds a model that assigns new examples to one category or the
other, making it a non-probabilistic binary linear classifier (although methods such
as Platt scaling exist to use SVM in a probabilistic classification setting). A SVM

32
model is a representation of the examples as points in space, mapped so that the
examples of the separate categories are divided by a clear gap that is as wide as
possible. New examples are then mapped into that same space and predicted to
belong to a category based on which side of the gap they fall.

In addition to performing linear classification, SVMs can efficiently perform a


non-linear classification using what is called the kernel trick, implicitly mapping
their inputs into high-dimensional feature spaces.

When data is unlabelled, supervised learning is not possible, and an unsupervised


learning approach is required, which attempts to find natural clustering of the
data to groups, and then map new data to these formed groups. The support-vector
clustering algorithm, created by HavaSiegelmann and Vladimir Vapnik, applies the
statistics of support vectors, developed in the support vector machines algorithm,
to categorize unlabeled data, and is one of the most widely used clustering
algorithms in industrial applications

Random Forest

Random forests or random decision forests are an ensemble learning method


for classification, regression and other tasks that operates by constructing a
multitude of decision trees at training time and outputting the class that is
the mode of the classes (classification) or mean prediction (regression) of the
individual trees. Random decision forests correct for decision trees' habit
of overfitting to their training set. The first algorithm for random decision forests
was created by Tin Kam Ho using the random subspace method, which, in Ho's

33
formulation, is a way to implement the "stochastic discrimination" approach to
classification proposed by Eugene Kleinberg.

An extension of the algorithm was developed by Leo Breiman and Adele


Cutler, who registered "Random Forests" as a trademark (as of 2019, owned
by Minitab, Inc.). The extension combines Breiman's "bagging" idea and random
selection of features, introduced first by Ho and later independently by Amit
and Geman in order to construct a collection of decision trees with controlled
variance.

Neural Network

A neural network is a network or circuit of neurons, or in a modern sense,


an artificial neural network, composed of artificial neurons or nodes. Thus a neural
network is either a biological neural network, made up of real biological neurons,
or an artificial neural network, for solving artificial intelligence (AI) problems. The
connections of the biological neuron are modeled as weights. A positive weight
reflects an excitatory connection, while negative values mean inhibitory
connections. All inputs are modified by a weight and summed. This activity is
referred as a linear combination. Finally, an activation function controls
the amplitude of the output. For example, an acceptable range of output is usually
between 0 and 1, or it could be −1 and 1.

Unlike von Neumann model computations, artificial neural networks do not


separate memory and processing and operate via the flow of signals through the net
connections, somewhat akin to biological networks.

These artificial networks may be used for predictive modeling, adaptive control


and applications where they can be trained via a dataset. Self-learning resulting

34
from experience can occur within networks, which can derive conclusions from a
complex and seemingly unrelated set of information.

7.3 Measuring the model performance Confusion Metrices

The first thing you will see here is ROC curve and we can determine whether our
ROC curve is good or not by looking at AUC (Area Under the Curve) and other
parameters which are also called as Confusion Metrics. A confusion matrix is a
table that is often used to describe the performance of a classification model on a
set of test data for which the true values are known. All the measures except AUC
can be calculated by using left most four parameters. So, let’s talk about those four
parameters first.

Predicated Class
Class=yes Class=no
Class=yes True positive False Negative
Actual Class
Class=no False positive True negative

True positive and true negatives are the observations that are correctly predicted
and therefore shown in green. We want to minimize false positives and false
negatives so they are shown in red color. These terms are a bit confusing. So let’s
take each term one by one and understand it fully.

True Positives (TP) - These are the correctly predicted positive values which
means that the value of actual class is yes and the value of predicted class is also
yes. E.g. if actual class value indicates that this passenger survived and predicted
class tells you the same thing.

35
True Negatives (TN) - These are the correctly predicted negative values which
means that the value of actual class is no and value of predicted class is also no.
E.g. if actual class says this passenger did not survive and predicted class tells you
the same thing. False positives and false negatives, these values occur when your
actual class contradicts with the predicted class.

False Positives (FP) – When actual class is no and predicted class is yes. E.g. if
actual class says this passenger did not survive but predicted class tells you that
this passenger will survive.

False Negatives (FN) – When actual class is yes but predicted class in no. E.g. if
actual class value indicates that this passenger survived and predicted class tells
you that passenger will die.
Once you understand these four parameters then we can calculate Accuracy,
Precision, Recall and F1 score.

Accuracy - Accuracy is that the most intuitive performance live and it's merely a
magnitude relation of properly foretold observation to the entire observations. One
might imagine that, if we've high accuracy then our model is best. Yes, accuracy
may be a nice live however only you have got radially symmetrical datasets
wherever values of false positive and false negatives area unit virtually same.
Therefore, you have got to appear at different parameters to judge the performance
of your model. For our model, we've got zero.803 which means our model is
approx. 80% accurate.

Accuracy = TP+TN/TP+FP+FN+TN

36
Precision - preciseness is that the magnitude relation of properly foretold positive
observations to the entire foretold positive observations. The question that this
metric answer is of all passengers that labelled as survived, what percentage really
survived? High preciseness relates to the low false positive rate. We have got
zero.788 preciseness that is pretty sensible.

Precision = TP/TP+FP

Recall (Sensitivity) - Recall is that the magnitude relation of properly foretold


positive observations to the all observations in actual category - affirmative. The
question recall answers is: Of all the passengers that actually survived, what
percentage did we tend to label? We have got recall of zero.631 that is nice for this
model as it’s on top of zero.5.

Recall = TP/TP+FN

F1score - F1 Score is that the weighted average of preciseness and Recall.


Therefore, this score takes each false positives and false negatives under
consideration. Intuitively it's not as straightforward to know as accuracy, however
F1 is sometimes a lot of helpful than accuracy, particularly if you have got AN
uneven category distribution. Accuracy works best if false positives and false
negatives have similar price. If the value of false positives and false negatives area
unit terribly totally different, it’s higher to appear at each preciseness and Recall.
In our case, F1 score is 0.701.

F1 Score = 2*(Recall * Precision) / (Recall + Precision)

37
7.4 FEASIBLITY STUDY

The objective of feasibility study is not only to solve the difficulty but also to
obtain a sense of its scope. During the study, the problem definition was
crystallized and aspects of the problem to be included in the system are
determined. Consequently benefits are estimated with greater accuracy at this
stage. The key considerations are:

i) Economic feasibility

ii) Technical feasibility

iii) Social feasibility

i) Economic feasibility

This study is carried out to check the economic impact that the system will have on
the organization. The amount of fund that the company can pour into the research
and development of the system is limited. The expenditures must be justified. Thus
the developed system as well within the budget and this was achieved because
most of the technologies used are freely available. Only the customized products
had to be purchased.

ii) Technical feasibility


This study is carried out to check the technical feasibility, that is, the
technical requirements of the system. Any system developed must not have a high
demand on the available technical resources. This will lead to high demands on the
available technical resources. This will lead to high demands being placed on the

38
client. The developed system must have a modest requirement, as only minimal or
null changes are required for implementing this system.

iii) Social feasibility


The aspect of study is to check the level of acceptance of the system by the
user. This includes the process of training the user to use the system efficiently.
The user must not feel threatened by the system, instead must accept it as a
necessity. The level of acceptance by the users solely depends on the methods that
are employed to educate the user about the system and to make him familiar with
it. His level of confidence must be raised so that he is also able to make some
constructive criticism, which is welcomed, as he is the final user of the system

39
8.REQUIREMENT SPECIFICATION

Hardware Requirements

Processor : Pentium Dual Core 2.3 GHz or higher

Hard Disk : 250 GB or Higher

RAM : 2 GB (Min)

Software Requirements

Operating System : Windows 7 or Higher

Languages used : Python ( Pandas, Numpy, Sklearn )

Tools : Anaconda, Jupyter Notebook and Spyder

Server : Flask 5000

IDE : Python(Front end),NLP(Back end)

40
9.SYSTEM DESIGN
9.1 Architecture Diagram

41
9.2 Use Case Diagram:

9.3 Class Diagram:

42
9.4 Dataflow Diagram:

9.4.1 Data Flow Level 0

43
9.4.2 Data Flow Level 1

9.4.3 Data Flow Level 2

44
9.5 Collaboration Diagram:

9.6 Activity diagram:

45
10. CONCLUSION AND FUTURE ENHANCEMENT
46
The way of up the existent caustic remark detection algorithms by as well
as higher pre-processing and text mining techniques like emoji and slang
detection area unit given. For classifying tweets as sarcasm and no sarcasm
there are various techniques used, however, the paper takes up a classification
algorithm and suggests various improvements that directly contribute to the
advance of accuracy. The project derived analytical views from a social media
dataset and also filtered out or reverses analyzed sarcastic tweets to achieve a
comprehensive accuracy in the classification of the info that's given. The model
has been tested in time period and may capture live streaming tweets by filtering
through hash tags so perform immediate classification.

47
11.APPENDIX I
SOURCE CODE
# Import packages

Import os
import pandas as pd fromsklearn
import metrics fromsklearn.model_selection
import cross_val_predict fromsklearn.linear_model
import LogisticRegression fromsklearn.svm
import SVC fromsklearn.naive_bayes
import GaussianNB fromsklearn.metrics
import accuracy_score fromsklearn.neural_network
import MLPClassifierfromsklearn.tree
import DecisionTreeClassifier fromsklearn.ensemble
import RandomForestRegressor
import warnings warnings.filterwarnings('ignore')
df = pd.read_csv( os.curdir + "/data/feature_list.csv")
data = df

def LR_CV(data):
acc = []
logreg = LogisticRegression(C=1e-6, multi_class='ovr', penalty='l2',
random_state=0)
predict = cross_val_predict(logreg, data.drop(['label'],axis=1), data['label'],cv=10)
acc.append(accuracy_score(predict, data['label']))
#print (metrics.classification_report(data['label'], predict))
F1 = metrics.f1_score(data['label'], predict)
48
P = metrics.precision_score(data['label'], predict)
R = metrics.recall_score(data['label'], predict)
return (float(sum(acc) / len(acc))) * 100, F1 * 100, P * 100, R * 100

def SVM_CV(data):
acc = []
SVM = SVC(C=0.1, kernel='linear')
predict = cross_val_predict(SVM, data.drop(['label'],axis=1), data['label'],cv=10)
acc.append(accuracy_score(predict, data['label']))
#print metrics.classification_report(data['label'], predict)
F1 = metrics.f1_score(data['label'], predict)
P = metrics.precision_score(data['label'], predict)
R = metrics.recall_score(data['label'], predict)
return (float(sum(acc) / len(acc))) * 100, F1 * 100, P * 100, R * 100

def DT_CV(data):
acc = []
classifier = DecisionTreeClassifier()
predict = cross_val_predict(classifier, data.drop(['label'], axis=1), data['label'],
cv=10)
acc.append(accuracy_score(predict, data['label']))
#print metrics.classification_report(data['label'], predict)
F1 = metrics.f1_score(data['label'], predict)
P = metrics.precision_score(data['label'], predict)
R = metrics.recall_score(data['label'], predict)
return (float(sum(acc) / len(acc))) * 100, F1 * 100, P * 100, R * 100

49
def NB_CV(data):
acc = []
classifier = GaussianNB()
predict = cross_val_predict(classifier, data.drop(['label'], axis=1), data['label'],
cv=10)
acc.append(accuracy_score(predict, data['label']))
#print metrics.classification_report(data['label'], predict)
F1 = metrics.f1_score(data['label'], predict)
P = metrics.precision_score(data['label'], predict)
R = metrics.recall_score(data['label'], predict)
return (float(sum(acc) / len(acc)))*100, F1*100, P*100, R*100

# Neural Network Model

def NN_CV(data):
acc = []
classifier = MLPClassifier(hidden_layer_sizes=(100, 100, 100), max_iter=1000)
predict = cross_val_predict(classifier, data.drop(['label'], axis=1), data['label'],
cv=10)
acc.append(accuracy_score(predict, data['label']))
#print metrics.classification_report(data['label'], predict)
F1 = metrics.f1_score(data['label'], predict)
P = metrics.precision_score(data['label'], predict)
R = metrics.recall_score(data['label'], predict)
return (float(sum(acc) / len(acc))) * 100, F1 * 100, P * 100, R * 100

# Random Forest Model


50
defRandForest_CV(data):
acc = []
classifier = RandomForestRegressor(n_estimators = 1000, random_state = 42)
predict = cross_val_predict(classifier, data.drop(['label'], axis=1), data['label'],
cv=10)
acc.append(accuracy_score(predict.round(), data['label']))
#print metrics.classification_report(data['label'], predict.round())
F1 = metrics.f1_score(data['label'], predict.round())
P = metrics.precision_score(data['label'], predict.round())
R = metrics.recall_score(data['label'], predict.round())
return (float(sum(acc) / len(acc))) * 100, F1 * 100, P * 100, R * 100

features = ['User mention', 'Exclamation', 'Question mark', 'Ellipsis', 'Interjection',


'UpperCase', 'RepeatLetters',
'SentimentScore', 'positive word count', 'negative word count', 'polarity flip',
'Nouns', 'Verbs',
'PositiveIntensifier', 'NegativeIntensifier', 'Bigrams', 'Trigram', 'Skipgrams',
'Emoji Sentiment',
'Passive aggressive count','Emoji_tweet_polarity flip']

# Calculate the accuracies for the required model


dflr = pd.DataFrame(columns=['Feature', 'Accuracy-LR', 'f1', 'Precison','Recall'])
print ("Model: " + "LR")
for feature in features:
tiny_data = data[[feature, 'label']]
Acc, F1, P, R = LR_CV(tiny_data)
51
#print (feature)
#print ("Acc: "+str(Acc)+" F1: "+str(F1)+ " P: "+str(P)+" R: "+str(R))
dflr.loc[feature] = [feature,Acc,F1,P,R]

dflr.to_csv(os.curdir + "/data/LR.csv",index=False)

dfdt = pd.DataFrame(columns=['Feature', 'Accuracy-DT', 'f1', 'Precison','Recall'])


print ("Model: " + "DT")
for feature in features:
tiny_data = data[[feature, 'label']]
Acc, F1, P, R = DT_CV(tiny_data)
#print (feature)
#print ("Acc: "+str(Acc)+" F1: "+str(F1)+ " P: "+str(P)+" R: "+str(R))
dfdt.loc[feature] = [feature,Acc,F1,P,R]

dfdt.to_csv(os.curdir + "/data/DT.csv",index=False)

dfnb = pd.DataFrame(columns=['Feature', 'Accuracy-NB', 'f1', 'Precison','Recall'])


print ("Model: " + "NB")
for feature in features:
tiny_data = data[[feature, 'label']]
Acc, F1, P, R = NB_CV(tiny_data)
#print (feature)
#print ("Acc: "+str(Acc)+" F1: "+str(F1)+ " P: "+str(P)+" R: "+str(R))
dfnb.loc[feature] = [feature,Acc,F1,P,R]

dfnb.to_csv(os.curdir + "/data/NB.csv",index=False)
52
dfnn = pd.DataFrame(columns=['Feature', 'Accuracy-NN', 'f1', 'Precison','Recall'])
print ("Model: " + "NN")
for feature in features:
tiny_data = data[[feature, 'label']]
Acc, F1, P, R = NN_CV(tiny_data)
#print (feature)
#print ("Acc: "+str(Acc)+" F1: "+str(F1)+ " P: "+str(P)+" R: "+str(R))
dfnn.loc[feature] = [feature,Acc,F1,P,R]

dfnn.to_csv(os.curdir + "/data/NN.csv",index=False)

dfrand = pd.DataFrame(columns=['Feature', 'Accuracy-RandForest', 'f1',


'Precison','Recall'])
print ("Model: " + "RandForest")
for feature in features:
tiny_data = data[[feature, 'label']]
Acc, F1, P, R = RandForest_CV(tiny_data)
#print (feature)
#print ("Acc: "+str(Acc)+" F1: "+str(F1)+ " P: "+str(P)+" R: "+str(R))
dfrand.loc[feature] = [feature,Acc,F1,P,R]

dfrand.to_csv(os.curdir + "/data/RAND.csv",index=False)

dfsvm = pd.DataFrame(columns=['Feature', 'Accuracy-SVM', 'f1',


'Precison','Recall'])
print ("Model: " + "SVM")
53
for feature in features:
tiny_data = data[[feature, 'label']]
Acc, F1, P, R = SVM_CV(tiny_data)
#print (feature)
#print ("Acc: "+str(Acc)+" F1: "+str(F1)+ " P: "+str(P)+" R: "+str(R))
dfsvm.loc[feature] = [feature,Acc,F1,P,R]

dfsvm.to_csv(os.curdir + "/data/SVM.csv",index=False)

Django
fromdjango.shortcuts
import render,redirect from .models
import Register,Comments
import os fromdjango.conf
import settings
import pickle fromsklearn.feature_extraction.text
import CountVectorizer
import pandas as pd fromsklearn.naive_bayes
import MultinomialNB fromsklearn.ensemble
import RandomForestClassifier fromsklearn.model_selection
import train_test_split fromdjango.http
import HttpResponse

def home(request):
return render(request,"sarcasm/Spam.html")

defloginv(request):
54
ifrequest.method == "POST":
name = request.POST['name']
pwd = request.POST['pwd']
verified = Register.objects.get(name=name)
if verified.pwd == pwd:
return render(request,"sarcasm/Spam.html")
return render(request,"sarcasm/Login.html")

def register(request):
ifrequest.method == "POST":
name = request.POST['name']
pwd = request.POST['pwd']
mailid = request.POST['mailid']
ph = request.POST['ph']
if name == "" or pwd == "":
return redirect('/post/')
else:
reg = Register(
name=name,
pwd=pwd,
mailid=mailid,
ph=ph
)
reg.save()
return render(request,"sarcasm/Login.html")
return render(request,"sarcasm/Register.html")

55
def post(request):

data=pd.read_csv( "E:/ML-Project/Sarcasm-copy/Sarcastic.csv")
X=data['Tweet']
Y=data['Class']
cv = CountVectorizer()
X=cv.fit(X)
clf = MultinomialNB()
file = "E:/ML-Project/Sarcasm-copy/RF.sav"
loaded_model = pickle.load(open(file,'rb'))
data = Comments.objects.filter(spam="Normal")
ifrequest.method == "GET":
return render(request,"sarcasm/Post.html",{'data':data})
else:
cmd = request.POST["cmd"]
data = [cmd]
vect = cv.transform(data).toarray()
prediction = loaded_model.predict(vect)
if prediction == 1:
comments = Comments(
feed=cmd,
spam="Sarcastic"
)
comments.save()
data = Comments.objects.filter(spam="Normal")
return redirect('/post/')
else:
56
comments = Comments(
feed=cmd,
spam="Normal"
)
comments.save()
data = Comments.objects.filter(spam="Normal")
return redirect('/post/')
#return HttpResponse("Empty")
Model
fromdjango.db import models

class Register(models.Model):
name = models.CharField(max_length=500)
pwd = models.CharField(max_length=500)
mailid = models.CharField(max_length=500)
ph = models.CharField(max_length=500)

def __str__(self):
return self.name

class Comments(models.Model):
feed = models.CharField(max_length=500)
spam = models.CharField(max_length=10)

# Create your models here.

57
12.APPENDIX II

EXPERIMENTAL RESULTS

Sentiment analysis:

58
Webpage:

59
Training the system:

Measuring model accuracy:

60
Graphs :
Naïve Bayes

61
Neural Network

62
Random Forest

63
SVM

64
REFERENCES

1. Wicana, SetraGenyang, TahaYasinİbisoglu, and UrazYavanoglu. "A


Review on sarcasm detection from machine-learning perspective." 2017
IEEE 11th International Conference on Semantic Computing (ICSC).
IEEE, 2017.

2. Selvan, LokmanyathilakGovindanSankar, and Teng-Sheng Moh. "A


framework for fast-feedback opinion mining on Twitter data
streams." 2015 International Conference on Collaboration Technologies
and Systems (CTS). IEEE, 2015.
3. Suhaimin, MohdSuhairiMd, et al. "Natural language processing based
features for sarcasm detection: An investigation using bilingual social
media texts." 2017 8th International Conference on Information
Technology (ICIT). IEEE, 2017.
4. Joshi, A., Tripathi, V., Patel, K., Bhattacharyya, P., & Carman, M. (2016).
Are word embedding-based features useful for sarcasm detection?. arXiv
preprint arXiv:1610.00883.
5. Dave, Anandkumar D., and Nikita P. Desai. "A comprehensive study of
classification techniques for sarcasm detection on textual data." 2016
International Conference on Electrical, Electronics, and Optimization
Techniques (ICEEOT). IEEE, 2016.

65
6. Hiai, Satoshi, and Kazutaka Shimada. "A sarcasm extraction method
based on patterns of evaluation expressions." 2016 5th IIAI International
Congress on Advanced Applied Informatics (IIAI-AAI). IEEE, 2016.
7. Dave, A. D., & Desai, N. P. (2016, March). A comprehensive study of
classification techniques for sarcasm detection on textual data. In  2016
International Conference on Electrical, Electronics, and Optimization
Techniques (ICEEOT) (pp. 1985-1991). IEEE.
8. Zhang, Meishan, Yue Zhang, and Guohong Fu. "Tweet sarcasm detection
using deep neural network." Proceedings of COLING 2016, The 26th
International Conference on Computational Linguistics: Technical Papers.
2016.
9. Bharti, Santosh Kumar, et al. "Sarcasm analysis on twitter data using
machine learning approaches." Trends in Social Network Analysis.
Springer, Cham, 2017. 51-76.
10.Ahmad, Tanvir, et al. "Satire detection from web documents using
machine learning methods." 2014 International Conference on Soft
Computing and Machine Intelligence. IEEE, 2014.
11.Dmitry Davidov, Oren Tsur and Ari Rappoport, Semi-Supervised
Recognition of Sarcastic Sentences in Twitter and Amazon 
12. Ellen Riloff, AshequlQadir, PrafullaSurve, Lalindra De Silva, Nathan
Gilbert and Ruihong Huang, Sarcasm as Contrast between a Positive
Sentiment and Negative Situation 
13.Roberto Gonzalez-Ibanez, SmarandaMuresan and Nina
Wacholder, Identifying Sarcasm in Twitter: A Closer Look 
14.Christine Liebrecht, Florian Kunneman and Antal Van den Bosch,  The
perfect solution for detecting sarcasm in tweets #not 

66
67

You might also like