Text Sentimental Analysis
Text Sentimental Analysis
Given a customer review, classify whether the message is of positive, negative, or neutral
sentiment. For messages conveying both a positive and negative sentiment, whichever is the
stronger sentiment should be chosen.
2
1.3 Project Description and Details
Sentiment analysis, also refers as opinion mining, is a sub machine learning task where we want
to determine which is the general sentiment of a given review. Using machine learning
techniques and natural language processing we can extract the subjective information of a review
and try to classify it according to its polarity such as positive, neutral or negative. It is a really
useful analysis since we could possibly determine the overall opinion about a selling objects, or
predict stock markets for a given company like, if most people think positive about it, possibly
its stock markets will increase, and so on. Sentiment analysis is actually far from to be solved
since the language is very complex (objectivity/subjectivity, negation, vocabulary, grammar) but
it is also why it is very interesting to working on.
In this project we choose to try to classify reviews from flipkart into “positive”,” neutral” or
“negative” sentiment by building a model based on probabilities. Flipkart is one of India’s
leading e-commerce marketplaces. Various Techniques such as data processing, data filtering,
feature extraction are done on reviews before using machine learning models such as naive bayes
to find sentiment. Data processing involves Tokenization which is the process of splitting the
reviews into individual words called tokens. Tokens can be split using whitespace or punctuation
characters. It can be unigram or bigram depending on the classification model used. Tokens
acquired after data processing still has a portion of raw information in it which we may or may
not find useful for our application. Thus, these reviews are further filtered by removing stop
words, numbers and punctuations. Stop words: For example, tweets contain stop words which
are extremely common words like “is”, “am”, “are” and holds no additional information. These
words serve no purpose and this feature is implemented using a Countvectorizer. TF-IDF is a
feature vectorization method used in text mining to find the importance of a term to a document
in the corpus.
Sentiment analysis is done by using Naive Bayes algorithm which finds polarity as below:
Naive Bayes is a family of probabilistic algorithms that take advantage of probability theory and
Bayes’ Theorem to predict the tag of a text (like a customer review). They are probabilistic,
which means that they calculate the probability of each tag for a given text, and then output the
tag with the highest one. The way they get these probabilities is by using Bayes’ Theorem, which
describes the probability of a feature, based on prior knowledge of conditions that might be
related to that feature.
Sentiment Analysis output: The output for a given review will be 0, -1 or +1. “1” shows positive
review. “0” shows neutral review. “-1” show negative review.
3
1.4 Definition, Acronyms and Abbreviations
Data Flow Diagram [DFD]: A data flow diagram (DFD) is a graphical representation of
the "flow" of data through an information system, modelling its process aspects. A
DFD is often use as a preliminary step to create an overview of the system without
going into detail, which can later be elaborated.
Low Level Design [LLD]: Low-level design (LLD) is a process that follows a
systematic refinement process. This process can be used for designing data
structures, required software architecture, source code and ultimately, performance
Algorithms.
4
5
SYSTEM REQUIREMENTS SPECIFICATIONS
FUNCTIONAL REQUIREMENTS
User Interface:
Describe the logical characteristics of each interface. This includes sample screen images, GUI
standards, screen layout constraints, standard buttons and functions (e.g., help) that will appear
on every screen. Details of the user interface design should be documented in a separate user
interface specification.
Hardware interface:
Describe the logical and physical characteristics of each interface. This may include the
supported device types, the nature of the data and control interactions between the software and
the hardware.
Software Interface:
Describe the connections between this product and other specific software components (name
and version), including databases, operating systems, tools, libraries, and integrated commercial
components. Identify the data items or messages coming into the system and going out and
describe the purpose of each. Describe the services needed and the nature of communications.
Refer to documents that describe detailed application programming interface protocols. Identify
data that will be shared across software components.
Communication Interface:
Describe the requirements associated with any communications functions required by this
product, including e-mail, web browser, network server communications protocols, electronic
forms, and so on. Define any pertinent message formatting.
6
NON-FUNCTIONAL REQUIREMENTS
Performance Requirements:
If there are performance requirements for the product under various circumstances, state them here
and explain their rationale, to help the developers understand the intent and make suitable design
choices. Specify the timing relationships for real time systems. Make such requirements as specific
as possible. You may need to state performance requirements for individual functional
requirements or features.
Safety Requirements:
Specify those requirements that are concerned with possible loss, damage, or harm that could result
from the use of the product. Define any safeguards or actions that must be taken, as well as actions
that must be prevented. Refer to any external policies or regulations that state safety issues that
affect the product’s design or use. Define any safety certifications that must be satisfied.
Security Requirements:
Specify any requirements regarding security or privacy issues surrounding use of the product or
protection of the data used or created by the product. Define any user identity authentication
requirements. Refer to any external policies or regulations containing security issues that affect the
product. Define any security or privacy certifications that must be satisfied.
7
8
3.1 Process Model (Data Flow Diagram (DFD))
9
10
4.1 Screen Designs (User Interface):
This is Graphical User Interface of the Text Sentiment Analysis in which customer can write
his/her review which is used by trained sentiment model.
User will get the following response when the entered review’s sentiment is positive.
11
User will get the following response when the entered review’s sentiment is neutral.
User will get the following response when the entered review’s sentiment is negative.
12
13
5.1 Functions Details Description and Prototype
Data in the form of raw reviews is retrieved by using the selenium for web scraping to get
reviews from e-commerce websites like flipkart, etc. Web scraping is a technique for extracting
information from the internet automatically using a selenium that simulates human web surfing.
Web scraping helps us extract large volumes of data about customers, products, people, stock
markets, etc. It is usually difficult to get this kind of information on a large scale using traditional
data collection methods. We can utilize the data collected from a website such as e-commerce
portal, social media channels to understand customer behaviors and sentiments, buying patterns,
and brand attribute associations which are critical insights for any business. Selenium, on the
other hand, uses a driver that basically opens up a version of your web browser that can be
controlled by python. This has the advantage that the website you are visiting views you
basically like any other human surfer allowing you to access information in the same way.
14
5.1.3. Data Filtering:
A review acquired after data processing still has a portion of raw information in it which we may
or may not find useful for our application. Thus, these reviews are further filtered by removing
stop words, numbers and punctuations.
Stop words: For example, reviews contain stop words which are extremely common words like
“is”, “am”, “are” and holds no additional information. These words serve no purpose and this
feature is implemented using a list stored in stopfile.dat. We then compare each word in a tweet
with this list and delete the words matching the stop list. Removing non-alphabetical characters:
Symbols such as “#@” and numbers hold no relevance in case of sentiment analysis and are
removed using pattern matching.
Stemming: Stemming is the process of reducing a word to its word stem that affixes to suffixes
and prefixes or to the roots of words known as a lemma. Stemming is important in natural
language understanding (NLU) and natural language processing (NLP). Stemming uses a number
of approaches to reduce a word to its base from whatever inflected form is encountered.
Stemming is also a part of queries and Internet search engines.
Lemmatization: Lemmatization is the process of converting a word to its base form. The
difference between stemming and lemmatization is, lemmatization considers the context and
converts the word to its meaningful base form, whereas stemming just removes the last few
characters, often leading to incorrect meanings and spelling errors. Lemmatization and
Stemming are done with the help of text blob.
For example, lemmatization would correctly identify the base form of ‘caring’ to ‘care’, whereas,
stemming would cut off the ‘ing’ part and convert it to car.
15
5.1.4. Feature Extraction:
TF-IDF: TF-IDF is a feature vectorization method used in text mining to find the importance of
a term to a document in the corpus. TFIDF is another way to convert textual data to numeric
form, and is short form Term Frequency-Inverse Document Frequency. The vector value it yields
is the product of these two terms; TF and IDF.
Let’s first look at Term Frequency. Let’s say we have two documents in our corpus as below.
1. I love dogs
2. I hate dogs and knitting
Relative term frequency is calculated for each term within each document as below.
For example, if we calculate relative term frequency for ‘I’ in both document 1 and document 2,
it will be as below.
Next, we need to get Inverse Document Frequency, which measures how important a word is to
differentiate each document by following the calculation as below.
16
If we calculate inverse document frequency for ‘I’,
Once we have the values for TF and IDF, now we can calculate TFIDF as below.
Word2Vec: Word2vec is a two-layer neural net that processes text. Its input is a text corpus and
its output is a set of vectors: feature vectors for words in that corpus. While Word2vec is not a
deep neural network, it turns text into a numerical form that deep nets can understand. The
purpose and usefulness of Word2vec is to group the vectors of similar words together in vector
space. That is, it detects similarities mathematically. Word2vec creates vectors that are
distributed numerical representations of word features, features such as the context of individual
words. It does so without human intervention.
Sentiment analysis is done by using Naive Bayes algorithm which finds polarity as below:
Naive Bayes is a family of probabilistic algorithms that take advantage of probability theory and
Bayes’ Theorem to predict the tag of a text (like a customer review). They are probabilistic,
which means that they calculate the probability of each tag for a given text, and then output the
tag with the highest one. The way they get these probabilities is by using Bayes’ Theorem, which
describes the probability of a feature, based on prior knowledge of conditions that might be
related to that feature.
17
5.2 Data type and Data structure
Data Type:
There are different types of python data types. Some built-in python data types are:
In Python we need not to declare datatype while declaring a variable like C or C++. We can
simply just assign values in a variable. But if we want to see what type of numerical value is it
holding right now, we can use type ().
The string is a sequence of characters. Python supports Unicode characters. Generally, strings are
represented by either single or double quotes. Strings can be output to screen using the print
function.
Like many other popular programming languages, strings in Python are arrays of bytes
representing Unicode characters. However, Python does not have a character data type, a single
character is simply a string with a length of 1. Square brackets can be used to access elements of
the string.
18
Data Structure:
There are different types of python data structures. Some built-in python data structures are:
List
A List is a data structure that holds an ordered collection of items i.e. you can store a sequence of
items in a list. This is easy to imagine if you can think of a shopping list where you have a list of
items to buy, except that you probably have each item on a separate line in your shopping list
whereas in Python you put commas in between them. The list of items should be enclosed in
square brackets so that Python understands that you are specifying a list. Once you have created
a list, you can add, remove or search for items in the list. Since we can add and remove items, we
say that a list is a mutable data type i.e. this type can be altered.
Tuples
Tuples are used to hold together multiple objects. Think of them as similar to lists, but without
the extensive functionality that the list class gives you. One major feature of tuples is that they
are immutable like strings i.e. you cannot modify tuples. Tuples are defined by specifying items
separated by commas within an optional pair of parentheses. Tuples are usually used in cases
where a statement or a user-defined function can safely assume that the collection of values (i.e.
the tuple of values used) will not change.
Dictionary
A dictionary is like an address-book where you can find the address or contact details of a person
by knowing only his/her name i.e. we associate keys (name) with values (details). Note that the
key must be unique just like you cannot find out the correct information if you have two persons
with the exact same name. Note that you can use only immutable objects (like strings) for the
keys of a dictionary but you can use either immutable or mutable objects for the values of the
dictionary. This basically translates to say that you should use only simple objects for keys.
Pairs of keys and values are specified in a dictionary by using the notation
d = {key1 : value1, key2 : value2 }. Notice that the key-value pairs are separated by a colon and
the pairs are separated themselves by commas and all this is enclosed in a pair of curly braces.
19
5.3 Data Visualization
20
Fig 5.3.3 Negative Review
21
5.4 Algorithm
The basis of Naive Bayes algorithm is Bayes’ theorem or alternatively known as Bayes’ rule or
Bayes’ law. It gives us a method to calculate the conditional probability, i.e., the probability of
an event based on previous knowledge available on the events. More formally, Bayes’ Theorem
is stated as the following equation:
Let us understand the statement first and then we will look at the proof of the statement. The
components of the above statement are:
P(A|B): Probability (conditional probability) of occurrence of event A given the event B is true
P(A) and P(B): Probabilities of the occurrence of event A and B respectively.
The terminology in the Bayesian method of probability (more commonly used) is as follows:
P(A) is called the prior probability of proposition and P(B) is called the prior probability of
evidence.
22
5.4 UNIT TEST PLAN
Unit Testing is a level of software testing where individual units/ components of software are
tested. The purpose is to validate that each unit of the software performs as designed. A unit is
the smallest testable part of any software. It usually has one or a few inputs and usually a single
output. In procedural programming, a unit may be an individual program, function, procedure,
etc. In object-oriented programming, the smallest unit is a method, which may belong to a base/
super class, abstract class or derived/ child class.
When is it performed?
Unit testing is the first level of software testing and is performed prior to Integration
Testing.
23
1. Unit Test Plan Scope (In Scope – Out of Scope)
24
4.1 Predicting Values Data Sentiment Value
25
5.5 Standard Error Messages
This document identifies some of the error codes and messages that software return. Specifically,
the errors listed here are in the global, or default, domain for software.
Input URL Invalid The url givenby usercannot Re-enter correct Url
be resolved
WebScrapingFailed The webpage given doesn’t Try another page, text data
support scraping
Features out of bound Too many features taken in Try updating software,
consideration change input source
26
27
WHITE BOX TESTING:
(Enriches the
data)
28
Data Model Selection <<Enter <<Enter value>> Selects best PASS
Modeling value>> model for on
hand problem
29
Black Box Testing
BLACK BOX TESTING, also known as Behavioral Testing, is a software testing method in
which the internal structure/design/implementation of the item being tested is not known to the
tester. These tests can be functional or non-functional, though usually functional.
This method is named so because the software program, in the eyes of the tester, is like a black
box; inside which one cannot see. This method attempts to find errors in the following
categories:
Data Pre- <<<EnterData <<<EnterData>> Takes raw data and returns a pass
processin >> comprehensive dataset
g
30
7. Conclusion and Future Scope
The task of sentiment analysis, especially in the domain of finding main entity from review, is
still in the developing stage and far from complete.There are also some modifications that
can be applied to our classifier in order to get a better accuracy, but are out of the scope of this
project.
Right now we have worked with only the very simplest unigram models, we can improve
those models by adding extra information like closeness of the word with a negation word. We
could specify a window prior to the word (a window could for example be of 2 or 3
words) under consideration and the effect of negation may be incorporated into the model
if it lies within that window. The closer the negation word is to the unigram word whose
prior polarity is to be calculated, the more it should affect the polarity. For example if the
negation is right next to the word, it may simply reverse the polarity of that word and farther
the negation is from the word the more minimized ifs effect should be.Apart from this, we
are currently only focusing on unigrams and the effect of bigrams and trigrams may be
explored. As reported in the literature review section when bigrams are used along with
unigrams this usually enhances performance.
In this project we are focussing on general sentiment analysis. There is potential of work in
the field of sentiment analysis with partially known context. For example we noticed that
users generally use our website for specific types of keywords which can divided into a
couple of distinct classes, namely: politics/politicians, celebrities, products/brands,
sports/sportsmen, media/movies/music. So we can attempt to perform separate sentiment
analysis on reviews that only belong to one of these classes (i.e. the training data would
not be general but specific to one of these categories) and compare the results we get if we
apply general sentiment analysis on it instead.
In this project, we compared various approaches to tackle sentiment analysis. Among the
challenges we encountered: the lack of unbalanced data online forced us drop 20,000 reviews
from flipkart, which can improve by using three different data; oversampled, downsampled,
original, to see how different sampling techniques affect the learning of a classifier. Neural
Network model such as LSTM can also be applied.
31
32