Sentiment Analysis
Sentiment Analysis
Sentiment Analysis
A Mini-Project Report
Submitted to
BACHELOR OF TECHNOLOGY
in
COMPUTER SCIENCE AND ENGINEERING
By
CERTIFICATE
DECLARATION
ACKNOWLEDGEMENT ................................................................................. I
ABSTRACT ....................................................................................................... II
LIST OF FIGURES ........................................................................................ III
1. INTRODUCTION .......................................................................................... 1
1.1 Statement of the problem ................................................................................................ 2
1.2 Objectives ........................................................................................................................ 2
1.3 Scope of project ............................................................................................................... 3
1.4 Statement of the problem ................................................................................................ 3
1.5 Statement of the problem ................................................................................................ 3
2. LITERATURE SURVEY .............................................................................. 4
2.1 Existing System ............................................................................................................... 4
2.2 Proposed System ............................................................................................................. 5
3. FEASIBILITY ANALYSIS........................................................................... 9
3.1 Technical Feasibility ....................................................................................................... 9
3.2 Operational Feasibility .................................................................................................. 10
3.3 Economic Feasibility ..................................................................................................... 10
3.4 Schedule Feasibility ...................................................................................................... 10
3.5 Requirement Feasibility ................................................................................................ 10
3.5.1Functional Requirements .................................................................................... 11
3.5.2 Non-Functional Requirements ........................................................................... 11
4. SYSTEM DESIGN AND ARCHITECTURE ........................................... 12
4.1 Importance of Design .................................................................................................... 12
4.2 UML Diagrams ............................................................................................................. 12
4.2.1 Use Case Diagram.............................................................................................. 13
4.2.2 Sequence Diagram ............................................................................................. 14
4.2.3 Activity Diagram ............................................................................................... 15
4.2.4 System Flow Diagram........................................................................................ 16
4.2.3 Flow Chart ......................................................................................................... 17
5. METHODOLOGY....................................................................................... 18
5.1 Machine Learning ......................................................................................................... 18
5.1.1 Naïve Bayes Classifier (NB) .............................................................................. 19
5.2 Natural Language Processing ........................................................................................ 23
5.3 Programming Tools ....................................................................................................... 24
5.3.1 Python ................................................................................................................ 24
5.3.2Natural Language Toolkit (NLTK) ..................................................................... 24
5.3.3matplotlib ............................................................................................................ 24
6. TESTING ...................................................................................................... 26
6.1 Importance of Testing ................................................................................................... 26
6.2 Types of Testing ............................................................................................................ 26
7. ANALYSIS AND RESULTS ...................................................................... 29
7.1 Analysis ......................................................................................................................... 29
7.2 Result ............................................................................................................................. 30
8. LIMITATIONS AND FUTURE ENHANCEMENTS ............................. 34
8.1 Limitations .................................................................................................................... 34
8.2 Future Enhancements .................................................................................................... 34
CONCLUSION................................................................................................. 35
REFERENCES ................................................................................................. 36
ACKNOWLEDGEMENT
Last but not the least, our heartiest gratitude to our parents and friends for
their continuous encouragement and blessings. Without their support this work
would not have been possible.
I
ABSTRACT
The data analysis is all about the analysing whether the data that is given in different formats
mainly such as reviews therefore here sentimental analysis about the reviews given by a
person is being recognized by the format of negative, positive, neutral format. The
sentimental analysis or opinion mining is computational study of people’s opinions,
sentiments, attitudes and emotions expressed in written language. It mainly has a wide range
of applications because opinions are central to almost all human activities and are the key
influences of our behaviour. Whenever we make a decision, we want to hear other’s opinions.
Sentimental analysis is the procedure by which information is extracted from the opinions,
emotions of people in regards to entities, events and attributes. In decision making, the
opinions of others have a significant effect on customer ease, making choices with regards to
online shopping. Choosing events, products, entities.
II
LIST OF FIGURES
III
CHAPTER 1
INTRODUCTION
1
on certain products of companies or brands or performed by political leaders. In order to do
this, we analysed comments. The comments are a reliable source of information mainly
because people comment about anything and everything, they do include buying new
products and reviewing them. Besides, all the comments also contain hash tags which make
identifying relevant data a simple task. A number of research works has already been done on
data. Most of which mainly demonstrates how useful this information is to predict various
outcomes. Our current research deals with outcome prediction and explores localized
outcomes.
We collected data dynamically which allows developers to enter data
programmatically. The collected data, because of the random and casual nature of entering
the data, need to be filtered to remove unnecessary information. Filtering out these and other
problematic data such as redundant ones, and ones with no proper sentences was done next.
As the pre-processing phase was done in certain extent it was possible to guarantee that
analysing these filtered comments will give reliable results. We do not provide the gender as
a query parameter so it is not possible to obtain the gender of a user from his or her
comments. It turned out that our project does not ask for user gender while entering the so
that information is seemingly unavailable.
1.2 Objectives
The objectives of this project are:
• To implement an algorithm for automatic classification of text into positive and
negative
• Sentiment Analysis to determine the attitude of the mass is positive, negative or
neutral towards the subject of interest
2
• Graphical representation of the sentiment in form of Pie-Chart.
3
CHAPTER 2
LITERATURE SURVEY
4
They use polarity predictions from three websites as noisy labels to train a model and use
1000 manually labelled data for tuning and another 1000 manually labelled data for testing.
They however do not mention how they collect their test data. They propose the use of syntax
features of like repetition, hashtags, link, punctuation and exclamation marks in conjunction
with features like prior polarity of words and POS of words. We extend their approach by
using real valued prior polarity, and by combining prior polarity with POS. Our results show
that the features that enhance the performance of our classifiers the most are features that
combine prior polarity of words with their parts of speech. The data syntax features help but
only marginally. Gamon (2004) perform sentiment analysis on feedback data from Global
Support Services survey. One aim of their paper is to analyse the role of linguistic features
like POS tags. They perform extensive feature analysis and feature selection and demonstrate
that abstract linguistic analysis features contribute to the classifier accuracy. we perform
extensive feature analysis and show the output in a pie chart format.
5
In the sentiment analysis, the NLP analyses the sentiment of the collected data by performing
by the following steps:
• It first performs tokenization.
• Then it performs sentence splitting known as split.
• Next step is to parse the sentence for syntactic analysis.
• Finally, it decides the sentiment value of the tweet based on the results of the above
steps.
The final step is to design a web forum for providing final results to the users and suggest few
other comments or results for the analysed text. Steps involved in this process are:
• Get the positive data from the sentiment analysis result.
• Develop a value comparator logic and apply it to the collected positive data, which
provides the list of suggestions given by large number of users.
The basic architectural diagram of the implemented system. Basically, it consists three main
steps, they are:
• Collecting data
• Pre-Processing the data
• Sentiment Analysis
6
At first, the data should be collected from database and few out sources. Collected data are
stored as data set and is pre-processed and parsed by removing common unwanted words,
symbols, characters, numbers and converts the upper-case letters to lower case letters. After
pre-processing, the sentiments will be analyzed by using Natural language processing tool.
Each sentence is provided with sentiment value, based on this sentiment value the data is
catalogued as positive or negative. Both positive and negative data are analyzed and similar
data are identified. Then by using a web application, the result is displayed to the users. In
addition, users are provided with few suggestions.
2.2.3 Pre-processing:
In the pre-processing step, the parsed tweets are collected and it removes unwanted
words, numbers, symbols, special characters. In pre-processing, the complete data is changed
to lower case letters. If there are any uppercase, bold letters or words in the collected data,
7
they are converted into lower case letters. The output of pre-processed data becomes more
meaningful and readable when compared to the collected data.
8
CHAPTER 3
FEASIBILITY ANALYSIS
• Technical feasibility
• Operational feasibility
• Economic feasibility
• Schedule feasibility
9
3.2 Operational Feasibility
Proposed project is beneficial only if it can be turned into information systems that
will meet the operating requirements. Simply stated, this test of feasibility asks if the system
will work when it is developed and installed. Are there major barriers to Implementation?
The proposed was to make a simplified application that analyses given text. It is simpler to
operate and can be used in any python platform. It is free and not costly to operate.
10
categorized into the functional and non-functional requirements. These requirements are
listed below:
3.5.1 Functional Requirements
Functional requirement are the functions or features that must be included in any system
to satisfy the business needs and be acceptable to the users. Based on this, the functional
requirements that the system must require are as follows:
• System should be able to process new tweets stored in database after retrieval
• System should be able to analyse data and classify each tweet polarity.
11
CHAPTER 4
SYSTEM DESIGN
The purpose of the design phase is to plan a solution of the problem specified by the
requirement document. This phase is the first step in moving from the problem domain to the
solution domain. In other words, starting with what is needed, design takes us toward how to
satisfy the needs. The design of a system is perhaps the most critical factor affect the quality
of the software; it has a major impact on the later phase, particularly testing, maintenance.
The output of this phase is the design document. The design activity is often divided into two
separate phases System Design and Detailed Design.
System Design also called top-level design aims to identify the modules that should
be in the system, the specifications of these modules, and how they interact with each other to
produce the desired results. During this phase, the details of the data of a module is usually
specified in a high-level design description language, which is independent of the target
language in which the software will eventually be implemented.
In system design the focus is on identifying the modules, whereas during detailed
design, the focus is on designing the logic for each of the modules during the system design
activities, developers bridge the gap between the requirements specification, produced during
requirements elicitation and analysis, and the system that is delivered to the user.
12
4.2.1 Use Case Diagram
Description
Describes the functionality provided by a system in terms of actors, their goals represented as
use cases, and any dependencies among those use cases. In this use case diagram User and
the customer who entered the reviews/text are actors and the rest are the use cases. It
describes the functionality provided by a system in terms of actors, their goals represented as
use cases, and any dependencies among those use cases.
13
4.2.2 Sequence Diagram
Description
14
4.2.3 Activity Diagram
Description
Activity Diagram is another important diagram in UML to describe the dynamic aspects of
the system. Activity diagram is basically a flowchart to represent the flow from one activity
to another activity. The activity can be described as an operation of the system. The control
flow is drawn from one operation to another. An activity diagram contains activity states,
action states, transactions, objects where control flows from one state to another state passing
through joins and forks.
15
4.2.4 System Flow Diagram
Description
A system flow diagram is a way to show relationships between a business and its
components, such as customers (according to IT Toolbox.) System flow diagrams, also
known as process flow diagrams or data flow diagrams, are cousins to common flow
charts.
16
4.3 Flowchart
Description
17
CHAPTER 5
METHODOLOGY
There are primarily two types of approaches for Sentiment classification of opinionated texts:
• Using a Machine learning based text classifier such as Naïve Bayes
• Using Natural Language Processing
We will be using those machine learning and natural language processing for sentiment
analysis of tweet.
There are various training sets available on Internet such as Movie Reviews data set,
twitter dataset, etc. Class can be Positive, negative. For both the classes we need training
data sets.
18
5.1.1 Naïve Bayes Classifier (NB)
The Naïve Bayes classifier is the simplest and most commonly used classifier. Naïve
Bayes classification model computes the posterior probability of a class, based on the
distribution of the words in the document. The model works with the BOWs feature
extraction which ignores the position of the word in the document. It uses Bayes Theorem
to predict the probability that a given feature set belongs to a particular label.
P (label) is the prior probability of a label or the likelihood that a random feature set the
label. P (features | label) is the prior probability that a given feature set is being classified
as a label. P(features) is the prior probability that a given feature set is occurred. Given
the Naïve assumption which states that all features are independent, the equation could be
rewritten as follows:
Algorithm:
i. Dictionary generation
Count occurrence of all words in our whole data set and make a dictionary of
some most frequent words.
ii. Feature set generation
All document is represented as a feature vector over the space of dictionary
words.
For each document, keep track of dictionary words along with their number of
occurrences in that document.
19
Formula used for algorithms:
Training
In this phase we have to generate training data (words with probability of
occurrence in positive/negative train data files).
Calculate ∅ k|label = y for each label.
Calculate ∅ k|label = y for each dictionary words and store the result (Here: label
will be negative and positive).
Now we have, word and corresponding probability for each of the defined label.
Testing
Goal - Finding the sentiment of given test data file.
- Generate Feature set(x) for test data file.
-For each document is test set find
Decision1=log P (x| label= pos) + log P (label=pos)
20
Similarly calculate
Decision2=log P (x| label= neg) + log P (label=neg)
21
Documents with positive outcomes:
DOC I loved The Movies hated a great poor acting good Class
1 1 1 1 1 +
3 2 1 1 1 +
5 1 1 1 1 1 +
P (+) = 3/5=0.6
𝑛𝑘+11
P (+) = 3/5 = 0.6; P(WK|+) =
2𝑛+|𝑣𝑜𝑐𝑎𝑏𝑢𝑙𝑎𝑟𝑦|
22
Now, let’s look at the negative examples
DOC I loved The Movies hated a great poor acting good Class
2 1 1 1 1 -
4 1 1 -
P (-) = 2/5=0.4
Natural language processing (NLP) is a field of computer science, artificial intelligence, and
linguistics concerned with the interactions between computers and human (natural)
languages. This approach utilizes the publicly available library of SentiWordNet, which
provides a sentiment polarity values for every term occurring in the document. In this lexical
resource each term t occurring in WordNet is associated to three numerical scores obj (t),
pos(t) and neg(t), describing the objective, positive and negative polarities of the term,
23
respectively. These three scores are computed by combining the results produced by eight
ternary classifiers. WordNet is a large lexical database of English. Nouns, verbs, adjectives
and adverbs are grouped into sets of cognitive synonyms (synsets), each expressing a distinct
concept.
WordNet is also freely and publicly available for download. WordNet’s structure makes it a
useful tool for computational linguistics and natural language processing. It groups words
together based on their meanings. Synet is nothing but a set of one or more Synonyms. This
approach uses Semantics to understand the language. Major tasks in NLP that helps in
extracting sentiment from a sentence:
Basically, Positive and Negative scores got from SentiWordNet according to its part-of-
speech tag and then by counting the total positive and negative scores we determine the
sentiment polarity based on which class (i.e. either positive or negative) has received the
highest score.
5.3.1 Python
5.3.2 NLTK
NLTK is a leading platform for building Python programs to work with human language
data. It provides easy-to-use interfaces to over 50 corpora and lexical resources such as
WordNet, along with a suite of text processing libraries for classification, tokenization,
stemming, tagging, parsing, and semantic reasoning, wrappers for industrial-strength NLP
libraries, and an active discussion forum.
24
NLTK has been called “a wonderful tool for teaching, and working in, computational
linguistics using Python,” and “an amazing library to play with natural language.” NLTK is
suitable for linguists, engineers, students, educators, researchers, and industry users alike.
Natural Language Processing with Python provides a practical introduction to programming
for language processing. Written by the creators of NLTK, it guides the reader through the
fundamentals of writing Python programs, working with corpora, categorizing text, analyzing
linguistic structure, and more.
5.3.3 matplotlib
matplotlib.pyplot is a collection of command style functions that make matplotlib work like
MATLAB. Each pyplot function makes some change to a figure: e.g., creates a figure,
creates a plotting area in a figure, plots some lines in a plotting area, decorates the plot with
labels, etc.In matplotlib.pyplot various states are preserved across function calls, so that it
keeps track of things like the current figure and plotting area, and the plotting functions are
directed to the
current axes (please note that "axes" here and in most places in the documentation refers to
the axes part of figure and not the strict mathematical term for more than one axis).
25
CHAPTER 6
TESTING
Unit Testing
Unit testing involves the design of test cases that validate that the internal program
logic is functioning properly, and that program input produce valid outputs. All decision
branches and internal code flow should be validated. It is the testing of individual software
units of the application. It is done after the completion of an individual unit before
integration. This is a structural testing, that relies on knowledge of its construction and is
invasive. Unit tests perform basic tests at component level and test a specific business
process, application, and/or system configuration. Unit tests ensure that each unique path of
business process performs accurately to the documented specifications and contains clearly
defined inputs and expected results. It is the testing of individual software units of the
application.
Integration Testing
26
Functional Testing
Functional tests provide a systematic demonstration that functions tested are available
as specified by the business and technical requirements, system documentation, and user
manuals. Functional testing is centered on the following items:
System Testing
White Box testing is a testing in which the software tester has knowledge of the inner
workings, structure and language of the software, or at least its purpose. It is used to test areas
that cannot be reached from a black box level. It is a testing in which the software under test
is treated, as a black box.
27
Black Box Testing
Black Box testing the software without any knowledge of the inner workings,
structure or language of the module being tested. Black box tests, as most other kinds of tests,
must be written from a definitive source document, such as specification or requirements
document, such as specification or requirements document.
Performance Testing
It is done to test the run-time performance of the software within the context of
integrated system. These tests are carried out throughout the testing process. For example, the
performance of individual module is accessed during white box testing under unit testing.
The testing process is a part of broader subject referring to verification and validation.
We have to acknowledge the system specifications and try to meet the customer’s
requirements and for this sole purpose, we have to verify and validate the product to make
sure everything is in place. Verification and validation are two different things. One is
performed to ensure that the software correctly implements a specific functionality and other
is done to ensure if the customer requirements are properly met or not by the end product.
Verification is more like 'are we building the product, right?' and validation is more like 'are
we building the right product?'.
28
CHAPTER 7
7.1 Analysis
We collected dataset containing positive and negative data. Those datasets were
trained data and was classified using Naïve Bayes Classifier. Before training the classifier
unnecessary words, punctuations, meaning less words were cleaned to get pure data. To
determine positivity and negativity of data we collected from different sources. Those data
were stored in database and then retrieved back to remove those unnecessary word and
punctuations for pure data. To check polarity of test we train the classifier with the help of
trained data. Those results were continuously trained to the system whenever the program is
executed.
After facing a number of errors, successful elimination of those error we have completed our
project with continuous effort. At the end of the project the results can be summarized as:
• A user-friendly application.
• No expertise is required for using the application.
• Organizations can use the application to visualize product or brand review
graphically.
29
7.2 Results
When the input is given completely positive data i.e. when the data collected is
completely positive regarding the product or anything. Then the output of the sentiment
analysis system is as follows:
Input:
Output:
30
7.2.2 Test case 2:
When the input is given completely negative data i.e. when the data collected is
completely negative regarding the product or anything. Then the output of the sentiment
analysis system is as follows:
Input:
Output:
31
7.2.3 Test case 3:
When the input is given with a combination of both positive and negative data i.e.
when the data collected contains both positive and negative comments regarding the product
or anything. Then the output of the sentiment analysis system is as follows:
Input:
Ouput:
32
7.2.4 Test case 4:
When the input given is not relevant to the analysis then the output is as follows:
When you are analysing data which is not yet been commented by anyone then the following
message is being displayed on the screen:
33
CHAPTER 8
8.1 Limitation
The system we designed is used to determine the opinion of the people based on data given
dynamically. We somehow completed our project and was able to determine only positivity
and negativity of data. For neutral data we were unable to merge dataset.
Also, we are currently analysing only with few datasets. This may not give proper value and
results. The results are not much accurate.
34
CONCLUSION
We have completed our project using python as language with different modules for
analyzing and output presentation. Although there was a problem in integrating different
modules of python an, through numbers of tutorial we were able to integrate it.
We were able to determine the positivity and negativity of each data. Based on those
comments or data we represented them in a diagram like pie chart. All the diagrams related to
outcome are shown in results (chapter 7.2). A small conclusion is also shown during output
presentation based on product or brand entered. Our designed system is user friendly.
35
REFERENCES
1. Kim S-M, Hovy E (2004) Determining the sentiment of opinions in: Proceedings of the
20th international conference on Computational Linguistics, page 1367. Association for
Computational Linguistics, Stroudsburg, PA, USA.
2. Liu B (2010) Sentiment analysis and subjectivity in: Handbook of Natural Language
Processing, Second Edition. Taylor and Francis Group, Boca. Liu B, Hu M, Cheng J (2005)
Opinion observer: Analysing and comparing opinions on the web in: Proceedings of the 14th
International Conference on World Wide Web, WWW ’05, 342–351. ACM, New York, NY,
USA.
3. Pak A, Paroubek P (2010) Twitter as a corpus for sentiment analysis and opinion mining
in: Proceedings of the Seventh conference on International Language Resources and
Evaluation. European Languages Resources Association, Valletta, Malta.
5. Pang B, Lee L (2008) Opinion mining and sentiment analysis. Found Trends Inf Retr2(1-
2): 1–135.
8. Jahanbakhsh, K., & Moon, Y. (2014). The predictive power of social media: On the
predictability of U.S presidential elections using Twitter
Mukherjee A, Liu B, Glance N (2012) Spotting fake reviewer groups in consumer reviews In:
Proceedings of the 21st, International Conference on World Wide Web, WWW ’12, 191–
200.. ACM, New York, NY, USA.
10. Saif, H., He, Y., & Alani, H. (2012). Semantic sentiment analysis of twitter. The
Semantic Web (pp. 508– 524). ISWC
36
11.Tan LK-W, Na J-C, Theng Y-L, Chang K (2011) Sentence-level sentiment polarity
classification using a linguistic approach in: Digital Libraries: For Cultural Heritage,
Knowledge Dissemination, and Future Creation, 77–87... Springer, Heidelberg, Germany.
12.Liu B (2012) Sentiment Analysis and Opinion Mining. Synthesis Lectures on Human
Language Technologies. Morgan & Claypool Publishers.
13.Gann W-JK, Day J, Zhou S (2014) Twitter analytics for insider trading fraud detection
system in: Proceedings of the second ASE international conference on Big Data... ASE.
14.Joachims T. Probabilistic analysis of the Roccio algorithm with TFIDF for text
categorization. In: Presented at the ICML conference; 1997.
37