0% found this document useful (0 votes)
379 views

Text Sentimental Analysis

Sentiment analysis, also refers as opinion mining, is a sub machine learning task where we want to determine which is the general sentiment of a given review. Using machine learning techniques and natural language processing we can extract the subjective information of a review and try to classify it according to its polarity such as positive, neutral or negative. It is a really useful analysis since we could possibly determine the overall opinion about a selling objects, or predict stock market

Uploaded by

game of throne
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
379 views

Text Sentimental Analysis

Sentiment analysis, also refers as opinion mining, is a sub machine learning task where we want to determine which is the general sentiment of a given review. Using machine learning techniques and natural language processing we can extract the subjective information of a review and try to classify it according to its polarity such as positive, neutral or negative. It is a really useful analysis since we could possibly determine the overall opinion about a selling objects, or predict stock market

Uploaded by

game of throne
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 32

1

Problem Statement and Team Architecture

1.1 Problem Statement

Given a customer review, classify whether the message is of positive, negative, or neutral
sentiment. For messages conveying both a positive and negative sentiment, whichever is the
stronger sentiment should be chosen.

1.2 Team architecture

Prashant chaturvedi (Project Leader, Designer and Coder)

Megha Teckchandani (Requirement gatherer and System Analyst)

Naman Jain (System Analyst and Designer)

Parth Tripathi (Coder and Tester)

2
1.3 Project Description and Details

Sentiment analysis, also refers as opinion mining, is a sub machine learning task where we want
to determine which is the general sentiment of a given review. Using machine learning
techniques and natural language processing we can extract the subjective information of a review
and try to classify it according to its polarity such as positive, neutral or negative. It is a really
useful analysis since we could possibly determine the overall opinion about a selling objects, or
predict stock markets for a given company like, if most people think positive about it, possibly
its stock markets will increase, and so on. Sentiment analysis is actually far from to be solved
since the language is very complex (objectivity/subjectivity, negation, vocabulary, grammar) but
it is also why it is very interesting to working on.

In this project we choose to try to classify reviews from flipkart into “positive”,” neutral” or
“negative” sentiment by building a model based on probabilities. Flipkart is one of India’s
leading e-commerce marketplaces. Various Techniques such as data processing, data filtering,
feature extraction are done on reviews before using machine learning models such as naive bayes
to find sentiment. Data processing involves Tokenization which is the process of splitting the
reviews into individual words called tokens. Tokens can be split using whitespace or punctuation
characters. It can be unigram or bigram depending on the classification model used. Tokens
acquired after data processing still has a portion of raw information in it which we may or may
not find useful for our application. Thus, these reviews are further filtered by removing stop
words, numbers and punctuations. Stop words: For example, tweets contain stop words which
are extremely common words like “is”, “am”, “are” and holds no additional information. These
words serve no purpose and this feature is implemented using a Countvectorizer. TF-IDF is a
feature vectorization method used in text mining to find the importance of a term to a document
in the corpus.

Sentiment analysis is done by using Naive Bayes algorithm which finds polarity as below:
Naive Bayes is a family of probabilistic algorithms that take advantage of probability theory and
Bayes’ Theorem to predict the tag of a text (like a customer review). They are probabilistic,
which means that they calculate the probability of each tag for a given text, and then output the
tag with the highest one. The way they get these probabilities is by using Bayes’ Theorem, which
describes the probability of a feature, based on prior knowledge of conditions that might be
related to that feature.

Sentiment Analysis output: The output for a given review will be 0, -1 or +1. “1” shows positive
review. “0” shows neutral review. “-1” show negative review.

3
1.4 Definition, Acronyms and Abbreviations

User Requirement Specification [URS]: The user requirement(s) document (URD)


or user requirement(s) specification (URS) is a document usually used in software
engineering that specifies what the user expects the software to be able to do.

Data Flow Diagram [DFD]: A data flow diagram (DFD) is a graphical representation of
the "flow" of data through an information system, modelling its process aspects. A
DFD is often use as a preliminary step to create an overview of the system without
going into detail, which can later be elaborated.

Software Requirement Specification [SRS]: A System Requirements


Specification (SRS) (also known as a Software Requirements Specification) is
a document or set of documentation that describes the features and behavior of
a system or software application.

Low Level Design [LLD]: Low-level design (LLD) is a process that follows a
systematic refinement process. This process can be used for designing data
structures, required software architecture, source code and ultimately, performance
Algorithms.

Structured Oriented language: Structured Programming is a design, which focuses


on process/ logical structure and then data required for that process. Structured Programming is
also known as Modular Programming.

4
5
SYSTEM REQUIREMENTS SPECIFICATIONS

FUNCTIONAL REQUIREMENTS

1. Internal Interface Requirements


• Collect reviews in a real time fashion
• Remove redundant information from these collected reviews.
• Store the formatted reviews in database.
• Perform Sentiment Analysis on the reviews stored in the database to classify their nature viz.
positive, negative or neutral.
• Use of Naive Bayes will predict the ‘mood’ of the people.

2. External Interface Requirements


We classify External Interface in 4 types, those are:

User Interface:
Describe the logical characteristics of each interface. This includes sample screen images, GUI
standards, screen layout constraints, standard buttons and functions (e.g., help) that will appear
on every screen. Details of the user interface design should be documented in a separate user
interface specification.

Hardware interface:
Describe the logical and physical characteristics of each interface. This may include the
supported device types, the nature of the data and control interactions between the software and
the hardware.

Software Interface:
Describe the connections between this product and other specific software components (name
and version), including databases, operating systems, tools, libraries, and integrated commercial
components. Identify the data items or messages coming into the system and going out and
describe the purpose of each. Describe the services needed and the nature of communications.
Refer to documents that describe detailed application programming interface protocols. Identify
data that will be shared across software components.

Communication Interface:
Describe the requirements associated with any communications functions required by this
product, including e-mail, web browser, network server communications protocols, electronic
forms, and so on. Define any pertinent message formatting.

6
NON-FUNCTIONAL REQUIREMENTS

Performance Requirements:

If there are performance requirements for the product under various circumstances, state them here
and explain their rationale, to help the developers understand the intent and make suitable design
choices. Specify the timing relationships for real time systems. Make such requirements as specific
as possible. You may need to state performance requirements for individual functional
requirements or features.

Safety Requirements:
Specify those requirements that are concerned with possible loss, damage, or harm that could result
from the use of the product. Define any safeguards or actions that must be taken, as well as actions
that must be prevented. Refer to any external policies or regulations that state safety issues that
affect the product’s design or use. Define any safety certifications that must be satisfied.

Security Requirements:
Specify any requirements regarding security or privacy issues surrounding use of the product or
protection of the data used or created by the product. Define any user identity authentication
requirements. Refer to any external policies or regulations containing security issues that affect the
product. Define any security or privacy certifications that must be satisfied.

Software Quality Attributes:


Specify any additional quality characteristics for the product that will be important to either the
customers or the developers. Some to consider are: adaptability, availability, correctness, flexibility,
interoperability, maintainability, portability, reliability, reusability, robustness, testability, and
usability. Write these to be specific, quantitative, and verifiable when possible. At the least, clarify
the relative preferences for various attributes, such as ease of use over ease of learning.

7
8
3.1 Process Model (Data Flow Diagram (DFD))

Fig 3.1.1 Context Level DFD (Level 0)

Fig 3.1.2 Level 1 DFD

9
10
4.1 Screen Designs (User Interface):

This is Graphical User Interface of the Text Sentiment Analysis in which customer can write
his/her review which is used by trained sentiment model.

Fig 4.2.1: User Interface

User will get the following response when the entered review’s sentiment is positive.

Entered Customer Review is Positive.

Fig 4.2.2: Positive output

11
User will get the following response when the entered review’s sentiment is neutral.

Entered Customer Review is Neutral.

Fig 4.2.3: Neutral Output

User will get the following response when the entered review’s sentiment is negative.

Entered Customer Review is Negative.

Fig 4.2.3: Negative Output

12
13
5.1 Functions Details Description and Prototype

5.1.1 Data collection:

Data in the form of raw reviews is retrieved by using the selenium for web scraping to get
reviews from e-commerce websites like flipkart, etc. Web scraping is a technique for extracting
information from the internet automatically using a selenium that simulates human web surfing.
Web scraping helps us extract large volumes of data about customers, products, people, stock
markets, etc. It is usually difficult to get this kind of information on a large scale using traditional
data collection methods. We can utilize the data collected from a website such as e-commerce
portal, social media channels to understand customer behaviors and sentiments, buying patterns,
and brand attribute associations which are critical insights for any business. Selenium, on the
other hand, uses a driver that basically opens up a version of your web browser that can be
controlled by python. This has the advantage that the website you are visiting views you
basically like any other human surfer allowing you to access information in the same way.

5.1.2. Data Processing:


Data processing involves Tokenization which is the process of splitting the review into
individual words called tokens. Tokens can be split using whitespace or punctuation characters.
It can be unigram or bigram depending on the classification model used. The bag-of-words
model is one of the most extensively used model for classification. It is based on the fact of
assuming text to be classified as a bag or collection of individual words with no link or
interdependence. The simplest way to incorporate this model in our project is by using unigrams
as features. It is just a collection of individual words in the text to be classified, so, we split each
review using whitespace.
For example, the review “Nice Camera and Good Gaming Performance !!” is split from each
whitespace as follows.
{
Nice, Camera,
and,
Good, Gaming, Performance,
!!”
}
The next step in data processing is normalization by conversion of review into lowercase.
Reviews are normalized by converting it to lowercase which makes its comparison with a
dictionary easier.

14
5.1.3. Data Filtering:
A review acquired after data processing still has a portion of raw information in it which we may
or may not find useful for our application. Thus, these reviews are further filtered by removing
stop words, numbers and punctuations.

Stop words: For example, reviews contain stop words which are extremely common words like
“is”, “am”, “are” and holds no additional information. These words serve no purpose and this
feature is implemented using a list stored in stopfile.dat. We then compare each word in a tweet
with this list and delete the words matching the stop list. Removing non-alphabetical characters:
Symbols such as “#@” and numbers hold no relevance in case of sentiment analysis and are
removed using pattern matching.

Stemming: Stemming is the process of reducing a word to its word stem that affixes to suffixes
and prefixes or to the roots of words known as a lemma. Stemming is important in natural
language understanding (NLU) and natural language processing (NLP). Stemming uses a number
of approaches to reduce a word to its base from whatever inflected form is encountered.
Stemming is also a part of queries and Internet search engines.

Lemmatization: Lemmatization is the process of converting a word to its base form. The
difference between stemming and lemmatization is, lemmatization considers the context and
converts the word to its meaningful base form, whereas stemming just removes the last few
characters, often leading to incorrect meanings and spelling errors. Lemmatization and
Stemming are done with the help of text blob.

For example, lemmatization would correctly identify the base form of ‘caring’ to ‘care’, whereas,
stemming would cut off the ‘ing’ part and convert it to car.

‘Caring’ -> Lemmatization -> ‘Care’


‘Caring’ -> Stemming -> ‘Car’

15
5.1.4. Feature Extraction:

TF-IDF: TF-IDF is a feature vectorization method used in text mining to find the importance of
a term to a document in the corpus. TFIDF is another way to convert textual data to numeric
form, and is short form Term Frequency-Inverse Document Frequency. The vector value it yields
is the product of these two terms; TF and IDF.

Let’s first look at Term Frequency. Let’s say we have two documents in our corpus as below.

1. I love dogs
2. I hate dogs and knitting

Relative term frequency is calculated for each term within each document as below.

For example, if we calculate relative term frequency for ‘I’ in both document 1 and document 2,
it will be as below.

Next, we need to get Inverse Document Frequency, which measures how important a word is to
differentiate each document by following the calculation as below.

16
If we calculate inverse document frequency for ‘I’,

Once we have the values for TF and IDF, now we can calculate TFIDF as below.

Word2Vec: Word2vec is a two-layer neural net that processes text. Its input is a text corpus and
its output is a set of vectors: feature vectors for words in that corpus. While Word2vec is not a
deep neural network, it turns text into a numerical form that deep nets can understand. The
purpose and usefulness of Word2vec is to group the vectors of similar words together in vector
space. That is, it detects similarities mathematically. Word2vec creates vectors that are
distributed numerical representations of word features, features such as the context of individual
words. It does so without human intervention.

5.1.5 Sentiment Analysis

Sentiment analysis is done by using Naive Bayes algorithm which finds polarity as below:
Naive Bayes is a family of probabilistic algorithms that take advantage of probability theory and
Bayes’ Theorem to predict the tag of a text (like a customer review). They are probabilistic,
which means that they calculate the probability of each tag for a given text, and then output the
tag with the highest one. The way they get these probabilities is by using Bayes’ Theorem, which
describes the probability of a feature, based on prior knowledge of conditions that might be
related to that feature.

17
5.2 Data type and Data structure

Data Type:
There are different types of python data types. Some built-in python data types are:

Python Data Type – Numeric

Python numeric data type is used to hold numeric values like;

1. int – holds signed integers of non-limited length.


2. long- holds long integers (exists in Python 2.x, deprecated in Python 3.x).
3. float- holds floating precision numbers and it’s accurate upto 15 decimal places.
4. complex- holds complex numbers.

In Python we need not to declare datatype while declaring a variable like C or C++. We can
simply just assign values in a variable. But if we want to see what type of numerical value is it
holding right now, we can use type ().

Python Data Type – String

The string is a sequence of characters. Python supports Unicode characters. Generally, strings are
represented by either single or double quotes. Strings can be output to screen using the print
function.

For example: print("hello").

Like many other popular programming languages, strings in Python are arrays of bytes
representing Unicode characters. However, Python does not have a character data type, a single
character is simply a string with a length of 1. Square brackets can be used to access elements of
the string.

18
Data Structure:
There are different types of python data structures. Some built-in python data structures are:

List
A List is a data structure that holds an ordered collection of items i.e. you can store a sequence of
items in a list. This is easy to imagine if you can think of a shopping list where you have a list of
items to buy, except that you probably have each item on a separate line in your shopping list
whereas in Python you put commas in between them. The list of items should be enclosed in
square brackets so that Python understands that you are specifying a list. Once you have created
a list, you can add, remove or search for items in the list. Since we can add and remove items, we
say that a list is a mutable data type i.e. this type can be altered.

Tuples
Tuples are used to hold together multiple objects. Think of them as similar to lists, but without
the extensive functionality that the list class gives you. One major feature of tuples is that they
are immutable like strings i.e. you cannot modify tuples. Tuples are defined by specifying items
separated by commas within an optional pair of parentheses. Tuples are usually used in cases
where a statement or a user-defined function can safely assume that the collection of values (i.e.
the tuple of values used) will not change.

Dictionary
A dictionary is like an address-book where you can find the address or contact details of a person
by knowing only his/her name i.e. we associate keys (name) with values (details). Note that the
key must be unique just like you cannot find out the correct information if you have two persons
with the exact same name. Note that you can use only immutable objects (like strings) for the
keys of a dictionary but you can use either immutable or mutable objects for the values of the
dictionary. This basically translates to say that you should use only simple objects for keys.

Pairs of keys and values are specified in a dictionary by using the notation
d = {key1 : value1, key2 : value2 }. Notice that the key-value pairs are separated by a colon and
the pairs are separated themselves by commas and all this is enclosed in a pair of curly braces.

19
5.3 Data Visualization

Fig 5.3.1 Overall Review

Fig 5.3.2 Positive Review

20
Fig 5.3.3 Negative Review

Fig 5.3.4 Neutral Review

Fig 5.3.5 Balanced Data Set

21
5.4 Algorithm

The basis of Naive Bayes algorithm is Bayes’ theorem or alternatively known as Bayes’ rule or
Bayes’ law. It gives us a method to calculate the conditional probability, i.e., the probability of
an event based on previous knowledge available on the events. More formally, Bayes’ Theorem
is stated as the following equation:

P(A|B) =P(B|A) P(A)P(B)

Let us understand the statement first and then we will look at the proof of the statement. The
components of the above statement are:

P(A|B): Probability (conditional probability) of occurrence of event A given the event B is true
P(A) and P(B): Probabilities of the occurrence of event A and B respectively.

P(B|A): probability of the occurrence of event B given the event A is true.

The terminology in the Bayesian method of probability (more commonly used) is as follows:

A is called the proposition and B is called the evidence.

P(A) is called the prior probability of proposition and P(B) is called the prior probability of
evidence.

P(A|B) is called the posterior.

P(B|A) is the likelihood.

This sums the Bayes’ theorem as

Posterior=(Likelihood). (Proposition prior probability) Evidence prior probability

22
5.4 UNIT TEST PLAN
Unit Testing is a level of software testing where individual units/ components of software are
tested. The purpose is to validate that each unit of the software performs as designed. A unit is
the smallest testable part of any software. It usually has one or a few inputs and usually a single
output. In procedural programming, a unit may be an individual program, function, procedure,
etc. In object-oriented programming, the smallest unit is a method, which may belong to a base/
super class, abstract class or derived/ child class.

It is performed by using the White Box Testing method.

When is it performed?

Unit testing is the first level of software testing and is performed prior to Integration
Testing.

Who performs it?

It is normally performed by software developers themselves or their peers. In rare cases, it


may also be performed by independent software testers.

Unit Testing Benefits


 Unit testing increases confidence in changing/ maintaining code. If good unit tests are
written and if they are run every time any code is changed, we will be able to promptly
catch any defects introduced due to the change. Also, if codes are already made less
interdependent to make unit testing possible, the unintended impact of changes to any
code is less.
 Codes are more reusable. In order to make unit testing possible, codes need to be
modular. This means that codes are easier to reuse.
 Development is faster. The effort required to find and fix defects found during unit
testing is very less in comparison to the effort required to fix defects found during system
testing or acceptance testing.
 The cost of fixing a defect detected during unit testing is lesser in comparison to that of
defects detected at higher levels
 Debugging is easy. When a test fails, only the latest changes need to be debugged. With
testing at higher levels, changes made over the span of several days/weeks/months need
to be scanned.
 Codes are more reliable.

23
1. Unit Test Plan Scope (In Scope – Out of Scope)

In Scope Out of Scope

Data Extraction, Tokenization, Stopword Library accuracy, Stopword Definition, File


Removal, Stemming, Lemming, Tf-Idf, storage Capability, Tree Definition
Word2vac, Naïve-Bayes Modeling, Support
Vector Machine, Gradient Boosted Trees

2. Unit Test Cases

ID Test Cases Input Value Expected Output

1.1 Data Extraction via Web Scraping Url Raw Data

2.1 Creating Tokens Raw String Tokens

2.2 Filtering Stopword from tokens Tokens <<Enter Value>>

2.3 Stemming the data <<EnterValue>> <<Enter Value>>

2.4 Lemming the data <<EnterValue>> <<Enter Value>>

2.5 Feature Extraction <<EnterValue>> <<EnterValue>>

2.6 Vac form validity <<EnterValue>> Dataframe

3.1 Training Machine Learning Model DataFrame M.L. Model

3.2 Boosting the model Model Boosted Accuracy Model

24
4.1 Predicting Values Data Sentiment Value

5.1 Efficiency Management Model Efficiency Measures


Parameters

25
5.5 Standard Error Messages
This document identifies some of the error codes and messages that software return. Specifically,
the errors listed here are in the global, or default, domain for software.

Error Message Description Resolution

Input URL Invalid The url givenby usercannot Re-enter correct Url
be resolved

WebScrapingFailed The webpage given doesn’t Try another page, text data
support scraping

Features out of bound Too many features taken in Try updating software,
consideration change input source

Token Creation Failed Failure of tokenization Contact Developer

Model out of bounds Model created is too large. Contact Developer

26
27
WHITE BOX TESTING:

WHITE BOX TESTING is a software testing method in which the internal


structure/design/implementation of the item being tested is known to the tester. The tester
chooses inputs to exercise paths through the code and determines the appropriate outputs.
Programming know-how and the implementation knowledge is essential.

Module : FUNCTION/MET RETURNTYP PARAMETERS DESCRIPTION RES


HOD E ULT

Data Main Void (String) Input the user PASS


Extractio Url and returns
n Web scraped
Data.

Data Tokenization <<Enter <<Enter value>> Creats tokens PASS


Preproces value>> from raw data
sing

Stopword_Remova <<Enter <<Enter value>> Removes PASS


l value>> common
stopwords

(Enriches the
data)

Stemming and <<Enter <<Enter value>> Crops the tokens PASS


Lemming value>> to remove
unnecessary
redundancies

Tf-Idf <<Enter <<Enter value>> Term frequency- PASS


value>> inverse
document
frequency,used
for text mining

28
Data Model Selection <<Enter <<Enter value>> Selects best PASS
Modeling value>> model for on
hand problem

Naïve-Bayes <<Enter <<Enter value>> Used for PASS


Modeling value>> classification of
dataset into
required
categories

Gradient Boosting <<Enter <<Enter value>> Used for PASS


value>> boosting
classification
model accuracy
via aggregate
tree diagrams

Model Evaluation <<Enter <<Enter value>> Used to PASS


value>> determine
efficiency of
model selected

Data Sentiment <<Enter <<Enter value>> Used to classify PASS


Predictio Prediction value>> user data on the
n basis of model
created

Results Display Sentiment <<Enter <<Enter value>> Displays results PASS


Rating value>> of theprocess to
the user

29
Black Box Testing
BLACK BOX TESTING, also known as Behavioral Testing, is a software testing method in
which the internal structure/design/implementation of the item being tested is not known to the
tester. These tests can be functional or non-functional, though usually functional.

This method is named so because the software program, in the eyes of the tester, is like a black
box; inside which one cannot see. This method attempts to find errors in the following
categories:

 Incorrect or missing functions


 Interface errors
 Errors in data structures or external database access
 Behavior or performance errors
 Initialization and termination errors

Module : RETURNTYP PARAMETERS DESCRIPTION RESULT


E

Data <<<EnterData <<<EnterData>> Input user url and return PASS


Extractio >> WebScraped Data
n

Data Pre- <<<EnterData <<<EnterData>> Takes raw data and returns a pass
processin >> comprehensive dataset
g

Data <<<EnterData <<<EnterData>> Creates a M.L. Model on Dataset


Modeling >>

Data <<<EnterData <<<EnterData>> Predicts values on the basis of pass


Predictio >> model created
n

Results <<<EnterData <<<EnterData>> Returns the results to the user PASS


>>

30
7. Conclusion and Future Scope

The task of sentiment analysis, especially in the domain of finding main entity from review, is
still in the developing stage and far from complete.There are also some modifications that
can be applied to our classifier in order to get a better accuracy, but are out of the scope of this
project.
Right now we have worked with only the very simplest unigram models, we can improve
those models by adding extra information like closeness of the word with a negation word. We
could specify a window prior to the word (a window could for example be of 2 or 3
words) under consideration and the effect of negation may be incorporated into the model
if it lies within that window. The closer the negation word is to the unigram word whose
prior polarity is to be calculated, the more it should affect the polarity. For example if the
negation is right next to the word, it may simply reverse the polarity of that word and farther
the negation is from the word the more minimized ifs effect should be.Apart from this, we
are currently only focusing on unigrams and the effect of bigrams and trigrams may be
explored. As reported in the literature review section when bigrams are used along with
unigrams this usually enhances performance.
In this project we are focussing on general sentiment analysis. There is potential of work in
the field of sentiment analysis with partially known context. For example we noticed that
users generally use our website for specific types of keywords which can divided into a
couple of distinct classes, namely: politics/politicians, celebrities, products/brands,
sports/sportsmen, media/movies/music. So we can attempt to perform separate sentiment
analysis on reviews that only belong to one of these classes (i.e. the training data would
not be general but specific to one of these categories) and compare the results we get if we
apply general sentiment analysis on it instead.

In this project, we compared various approaches to tackle sentiment analysis. Among the
challenges we encountered: the lack of unbalanced data online forced us drop 20,000 reviews
from flipkart, which can improve by using three different data; oversampled, downsampled,
original, to see how different sampling techniques affect the learning of a classifier. Neural
Network model such as LSTM can also be applied.

31
32

You might also like