Sentiment Analysis

Download as pdf or txt
Download as pdf or txt
You are on page 1of 45
At a glance
Powered by AI
The document discusses sentiment analysis and covers topics such as existing systems, proposed system, feasibility analysis, system design, implementation, testing and results.

The topic of the document is sentiment analysis.

The main sections/chapters covered in the document include introduction, literature survey, feasibility analysis, system design, implementation, testing, results, limitation and future enhancement, conclusion and references.

SENTIMENT ANALYSIS

A Mini-Project Report
Submitted to

Jawaharlal Nehru Technological University, Hyderabad


In partial fulfilment of the requirements for the
award of the degree of

BACHELOR OF TECHNOLOGY
in
COMPUTER SCIENCE AND ENGINEERING
By

Sravya Gujjarlapudi (16VE1A0524)


Manjusha Kasturi (16VE1A0531)
N.ShivaCharan Kumar (16VE1A0542)

Under the Guidance of


Mrs. Joshi Padma

SREYAS INSTITUTE OF ENGINEERING AND TECHNOLOGY


DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
(Affiliated to JNTUH, Approved by A.I.C.T.E and Accredited by NAAC, New Delhi)
Bandlaguda, Beside Indu Aranya, Nagole,
Hyderabad-500068, Ranga Reddy Dist
(2016 – 2020)
SREYAS INSTITUTE OF ENGINEERING AND TECHNOLOGY
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING

CERTIFICATE

This is to certify that the Mini Project Report on “SENTIMENT ANALYSIS”


submitted by Sravya Gujjarlapudi, Manjusha Kasturi, Nathamgari Shiva
Charan Kumar bearing Hall ticket No.16VE1A0524, 16VE1A0531,
16VE1A0542 in partial fulfilment of the requirements for the award of the
degree of Bachelor of Technology in COMPUTER SCIENCE AND
ENGINEERING from Jawaharlal Nehru Technological University,
Kukatpally, Hyderabad for the academic year 2019-20 is a record of Bonafide
work carried out by them under our guidance and Supervision.

Internal Guide Head of the Department-CSE


Mrs. Joshi Padma Dr.V.GOUTHAM
Associate Professor Professor

Project Co-Ordinator External Examiner


Mr. P.Nagaraj
Assistant Professor
SREYAS INSTITUTE OF ENGINEERING AND TECHNOLOGY
DEPARTMENT OF COMPUTER SCIENCE ANDENGINEERING

DECLARATION

We, Sravya Gujjarlapudi, Manjusha Kasturi, Nathamgari Shiva Charan


Kumar bearing 16VE1A0524, 16VE1A0531, 16VE1A0542 hereby declare
that the Mini-Project titled “SENTIMENT ANALYSIS” done by us under the
guidance of Mrs. JOSHI PADMA, which is submitted in the partial fulfilment
of the requirement for the award of the B.Tech degree in Computer Science
and Engineering at Sreyas Institute of Engineering And Technology for
Jawaharlal Nehru technological university, Hyderabad is my original work.

SRAVYA GUJJARLAPUDI (16VE1A0524)


MANJUSHA KASTURI (16VE1A0531)
NATHAMGARI SHIVA CHARAN KUMAR (16VE1A0542)
TABLE OF CONTENTS

ACKNOWLEDGEMENT ................................................................................. I
ABSTRACT ....................................................................................................... II
LIST OF FIGURES ........................................................................................ III
1. INTRODUCTION .......................................................................................... 1
1.1 Statement of the problem ................................................................................................ 2
1.2 Objectives ........................................................................................................................ 2
1.3 Scope of project ............................................................................................................... 3
1.4 Statement of the problem ................................................................................................ 3
1.5 Statement of the problem ................................................................................................ 3
2. LITERATURE SURVEY .............................................................................. 4
2.1 Existing System ............................................................................................................... 4
2.2 Proposed System ............................................................................................................. 5
3. FEASIBILITY ANALYSIS........................................................................... 9
3.1 Technical Feasibility ....................................................................................................... 9
3.2 Operational Feasibility .................................................................................................. 10
3.3 Economic Feasibility ..................................................................................................... 10
3.4 Schedule Feasibility ...................................................................................................... 10
3.5 Requirement Feasibility ................................................................................................ 10
3.5.1Functional Requirements .................................................................................... 11
3.5.2 Non-Functional Requirements ........................................................................... 11
4. SYSTEM DESIGN AND ARCHITECTURE ........................................... 12
4.1 Importance of Design .................................................................................................... 12
4.2 UML Diagrams ............................................................................................................. 12
4.2.1 Use Case Diagram.............................................................................................. 13
4.2.2 Sequence Diagram ............................................................................................. 14
4.2.3 Activity Diagram ............................................................................................... 15
4.2.4 System Flow Diagram........................................................................................ 16
4.2.3 Flow Chart ......................................................................................................... 17
5. METHODOLOGY....................................................................................... 18
5.1 Machine Learning ......................................................................................................... 18
5.1.1 Naïve Bayes Classifier (NB) .............................................................................. 19
5.2 Natural Language Processing ........................................................................................ 23
5.3 Programming Tools ....................................................................................................... 24
5.3.1 Python ................................................................................................................ 24
5.3.2Natural Language Toolkit (NLTK) ..................................................................... 24
5.3.3matplotlib ............................................................................................................ 24
6. TESTING ...................................................................................................... 26
6.1 Importance of Testing ................................................................................................... 26
6.2 Types of Testing ............................................................................................................ 26
7. ANALYSIS AND RESULTS ...................................................................... 29
7.1 Analysis ......................................................................................................................... 29
7.2 Result ............................................................................................................................. 30
8. LIMITATIONS AND FUTURE ENHANCEMENTS ............................. 34
8.1 Limitations .................................................................................................................... 34
8.2 Future Enhancements .................................................................................................... 34
CONCLUSION................................................................................................. 35
REFERENCES ................................................................................................. 36
ACKNOWLEDGEMENT

The successful completion of any task would be incomplete without


mention of the people who made it possible through their guidance and
encouragement crowns all the effort with success.

We take this opportunity to acknowledge with thanks and deep sense of


gratitude to Mrs. Joshi Padma (Associate professor, Department of
Computer Science and Engineering) for her constant encouragement and
valuable guidance during the Project work.

A special note of thanks to Dr. V. Goutham, who has been a source of


continuous motivation and support. He had taken time and effort to guide and
correct us all through the span of work.

We owe very much to the Management, Principal and the Department


faculty who made our team at Sreyas Institute of Engineering and Technology
a stepping stone for our career. We treasure every moment we had spent in our
college.

Last but not the least, our heartiest gratitude to our parents and friends for
their continuous encouragement and blessings. Without their support this work
would not have been possible.

SRAVYA GUJJARLAPUDI (16VE1A0524)

MANJUSHA KASTURI (16VE1A0531)

NATHAMGARI SHIVA CHARAN KUMAR (16VE1A0542)

I
ABSTRACT

The data analysis is all about the analysing whether the data that is given in different formats
mainly such as reviews therefore here sentimental analysis about the reviews given by a
person is being recognized by the format of negative, positive, neutral format. The
sentimental analysis or opinion mining is computational study of people’s opinions,
sentiments, attitudes and emotions expressed in written language. It mainly has a wide range
of applications because opinions are central to almost all human activities and are the key
influences of our behaviour. Whenever we make a decision, we want to hear other’s opinions.

Sentimental analysis is the procedure by which information is extracted from the opinions,
emotions of people in regards to entities, events and attributes. In decision making, the
opinions of others have a significant effect on customer ease, making choices with regards to
online shopping. Choosing events, products, entities.

II
LIST OF FIGURES

S. No Figure No. Name Of Figure Page No.

1 2.1 Project Architecture 6

2 4.1 Use Case Diagram 13

3 4.2. Sequence Diagram 14

4 4.3 Activity diagram 15

5 4.4 System flow diagram 16

6 4.5 Flow chart diagram 17

7 5.1 List of documents 21

8 5.2 Feature Sets 21

9 5.3 Positive Vocabulary 22

10 5.4 Negative Vocabulary 23

III
CHAPTER 1

INTRODUCTION

Sentiment is an attitude, thought, or judgment prompted by feeling. Sentiment


analysis, which is also known as opinion mining, studies people’s sentiments towards certain
entities. Internet is a resourceful place with respect to sentiment information. From a user’s
perspective, people are able to post their own content through various social media, such as
forums, micro-blogs, or online social networking sites. From a researcher’s perspective,
many social media sites release their application programming interfaces (APIs), prompting
data collection and analysis by researchers and developers. We also can give the data which
is to be tested dynamically. Here the system is being trained all the time when we execute the
program. Hence, sentiment analysis seems having a strong fundament with the support of
massive online data and offline also.
However, those types of online data have several flaws that potentially hinder the process of
sentiment analysis. The first flaw is that since people can freely post their own content, the
quality of their opinions cannot be guaranteed. For example, instead of sharing topic-related
opinions, online spammers post spam on forums. Some spams are meaningless at all, while
others have irrelevant opinions also known as fake opinions. The second flaw is that ground
truth of such online data is not always available. A ground truth is more like a tag of a certain
opinion, indicating whether the opinion is positive, negative. So, the data which is given
dynamically can give the results with high accuracy since the comments or the data which we
give will be relevant to the information which we are going to check and doesn’t contain any
spam in it.
Micro blogging websites have evolved to become a source of varied kind of information.
This is due to nature of micro blogs on which people post real time messages about their
opinions on a variety of topics, discuss current issues, complain, and express positive
sentiment for products they use in daily life. In fact, companies manufacturing such products
have started to poll these micro blogs to get a sense of general sentiment for their product.
Many times, these companies study user reactions and reply to users on micro blogs. One
challenge is to build technology to detect and summarize an overall sentiment. Our project
Sentiment Analysis resembles the analyse of data (in the form of comments) by the peoples

1
on certain products of companies or brands or performed by political leaders. In order to do
this, we analysed comments. The comments are a reliable source of information mainly
because people comment about anything and everything, they do include buying new
products and reviewing them. Besides, all the comments also contain hash tags which make
identifying relevant data a simple task. A number of research works has already been done on
data. Most of which mainly demonstrates how useful this information is to predict various
outcomes. Our current research deals with outcome prediction and explores localized
outcomes.
We collected data dynamically which allows developers to enter data
programmatically. The collected data, because of the random and casual nature of entering
the data, need to be filtered to remove unnecessary information. Filtering out these and other
problematic data such as redundant ones, and ones with no proper sentences was done next.
As the pre-processing phase was done in certain extent it was possible to guarantee that
analysing these filtered comments will give reliable results. We do not provide the gender as
a query parameter so it is not possible to obtain the gender of a user from his or her
comments. It turned out that our project does not ask for user gender while entering the so
that information is seemingly unavailable.

1.1 Statement of the Problem


The problem at hand consists of two subtasks:
• Phrase Level Sentiment Analysis: Given a message containing a marked instance of
a word or a phrase, determine whether that instance is positive or negative in that
context.
• Sentence Level Sentiment Analysis: Given a message, decide whether the message
is of positive or negative sentiment. For messages conveying both a positive and
negative sentiment, whichever is the stronger sentiment should be chosen.

1.2 Objectives
The objectives of this project are:
• To implement an algorithm for automatic classification of text into positive and
negative
• Sentiment Analysis to determine the attitude of the mass is positive, negative or
neutral towards the subject of interest

2
• Graphical representation of the sentiment in form of Pie-Chart.

1.3 Scope of project


This project will be helpful to the companies, political parties as well as to the
common people. It will be helpful to political party for reviewing about the program that they
are going to do or the program that they have performed. Similarly, companies also can get
review about their new product on newly released hardware’s or software’s. Also, the movie
maker can take review on the currently running movie. By analyzing the tweets analyzer can
get result on how positive or negative or neutral are peoples about it.

1.4 System Overview


This proposal entitled “Sentiment analysis” is an application which is used to analyze
the data. We will be performing sentiment analysis in comments and determine where it is
positive or negative. This application can be used by any organization office to review their
works or by political leaders or by any others company to review about their products or
brands.

1.5 System Features


The main feature of our web application is that it helps to determine the opinion about
the peoples on products, government work, politics or any other by analyzing the data. Our
system is capable of training the new data by taking reference to previously trained data and
the related data. The computed or analyzed data will be represented in a Pie chart format.

3
CHAPTER 2

LITERATURE SURVEY

2.1 Existing System


Sentiment analysis has been handled as a Natural Language Processing task at many
levels of granularity. Starting from being a document level classification task (Turney, 2002;
Pang and Lee, 2004), it has been handled at the sentence level (Hu and Liu, 2004; Kim and
Hovy, 2004) and more recently at the phrase level (Wilson et al., 2005; Agarwal et al., 2009).
Microblog data, on which users post real time reactions to and opinions about “everything”,
poses newer and different challenges. Some of the early and recent results on sentiment
analysis are by Go et al. (2009), (Bermingham and Smeaton, 2010) and Pak and Paroubek
(2010) [3]. Go et al. (2009) use distant learning to acquire sentiment data. They use data
sending in positive emotions like “:)” “:-)” as positive and negative emoticons like “:(” “:-(”
as negative. They build models using Naive Bayes, Max Ent and Support Vector Machines
(SVM), and they report SVM outperforms other classifiers. In terms of feature space, they try
a Unigram, Bigram model in conjunction with parts-of-speech (POS) features. They note that
the unigram model outperforms all other models. Specifically, bigrams and POS features do
not help. Pak and Paroubek (2010) [3] collect data following a similar distant learning
paradigm. They perform a different classification task though: subjective versus objective.
For subjective data they collect the data ending with emoticons in the same manner as Go et
al. (2009). For objective data they crawl from popular newspapers like “New York Times”,
“Washington Posts” etc. They report that POS and bigrams both help (contrary to results
presented by Go et al. (2009)). Both these approaches, however, are primarily based on
ngram models. Moreover, the data they use for training and testing is collected by search
queries and is therefore biased. In contrast, we present features that achieve a significant gain
over a unigram baseline. In addition, we explore a different method of data representation and
report significant improvement over the unigram models. Another contribution of this paper
is that we report results on manually annotated data that does not suffer from any known
biases. Our data will be a random sample of streaming tweets unlike data collected by using
specific queries. The size of our hand-labelled data will allow us to perform cross validation
experiments and check forth variance in performance of the classifier across folds. Another
significant effort for sentiment classification on data is by Barbosa and Feng (2010).

4
They use polarity predictions from three websites as noisy labels to train a model and use
1000 manually labelled data for tuning and another 1000 manually labelled data for testing.
They however do not mention how they collect their test data. They propose the use of syntax
features of like repetition, hashtags, link, punctuation and exclamation marks in conjunction
with features like prior polarity of words and POS of words. We extend their approach by
using real valued prior polarity, and by combining prior polarity with POS. Our results show
that the features that enhance the performance of our classifiers the most are features that
combine prior polarity of words with their parts of speech. The data syntax features help but
only marginally. Gamon (2004) perform sentiment analysis on feedback data from Global
Support Services survey. One aim of their paper is to analyse the role of linguistic features
like POS tags. They perform extensive feature analysis and feature selection and demonstrate
that abstract linguistic analysis features contribute to the classifier accuracy. we perform
extensive feature analysis and show the output in a pie chart format.

2.2 Proposed System


In the proposed system, searching the information based on category and keywords from
the database is performed. Searching keywords is one of the hardest tasks because of the
diversity of the language and the slangs used by the people. In the proposed system, the first
step involves collection of data from different sources and making it as a data set, the second
step is pre-processing of the related data. In the third step, sentiment analysis is performed
using the Natural Language Processing (NLP)
algorithm, which is based on numerical statistics. Assigned sentiment value using NLP, is
used as a weighting factor in sentiment analysis. In the fourth step, similar data is identified
and analyzed, then by using a web application, the final results, which are suggestions are
available for the issues occurred in the specified process can be provided. The logics. The
tweets are collected based on the combination of Keyword and Category provided by the
user. In the next step, all the data is being pre-processed for unwanted words, symbols and
characters. Pre-processing consists of three steps which are as follows:
• Removing common stop words and misspelled words.
• Removing numbers, symbols and special characters.
• Converting upper case letters to lower case letters.

5
In the sentiment analysis, the NLP analyses the sentiment of the collected data by performing
by the following steps:
• It first performs tokenization.
• Then it performs sentence splitting known as split.
• Next step is to parse the sentence for syntactic analysis.
• Finally, it decides the sentiment value of the tweet based on the results of the above
steps.
The final step is to design a web forum for providing final results to the users and suggest few
other comments or results for the analysed text. Steps involved in this process are:
• Get the positive data from the sentiment analysis result.
• Develop a value comparator logic and apply it to the collected positive data, which
provides the list of suggestions given by large number of users.
The basic architectural diagram of the implemented system. Basically, it consists three main
steps, they are:
• Collecting data
• Pre-Processing the data
• Sentiment Analysis

Figure 2.1 Project Architecture

6
At first, the data should be collected from database and few out sources. Collected data are
stored as data set and is pre-processed and parsed by removing common unwanted words,
symbols, characters, numbers and converts the upper-case letters to lower case letters. After
pre-processing, the sentiments will be analyzed by using Natural language processing tool.
Each sentence is provided with sentiment value, based on this sentiment value the data is
catalogued as positive or negative. Both positive and negative data are analyzed and similar
data are identified. Then by using a web application, the result is displayed to the users. In
addition, users are provided with few suggestions.

2.2.1 Collecting Data


Once the code developed then the developer can add keyword and category to the
application depending on the analysis we want to do i.e. for example if we are analyzing the
movie reviews then the data that we are gathering must belong the selected domain of dataset.
Whenever a keyword or category is added it gets update in the database. For a particular
category, a user can add any number of keywords. When the user wants to collect data, it
needs to select a file in which it wants to store the collected data. Once the file is selected, the
user can start collecting the data.

2.2.2 Parsing Data


Parsing is nothing but syntactic analysis. It is a process of analysing a string of
symbols in natural language according to rules of grammar. Once the data is collected, the
developer arranges the data in a particular manner. Collected data may start at first line and
then may end up at any line number, in order to differentiate one form of data from another.
While parsing the developer removes all those ending words. The developer then parses the
data or comments which have many blank spaces, or empty newline and make it as a single
line data. The parser even replaces the abusive words with “*****” indicating the word is
abusive and then removes the “*****” from the sentences.

2.2.3 Pre-processing:
In the pre-processing step, the parsed tweets are collected and it removes unwanted
words, numbers, symbols, special characters. In pre-processing, the complete data is changed
to lower case letters. If there are any uppercase, bold letters or words in the collected data,

7
they are converted into lower case letters. The output of pre-processed data becomes more
meaningful and readable when compared to the collected data.

2.2.4 Natural Language Processing


Sentiment analysis is a process which determines the intended emotion of the data. In
sentiment analysis, the polarity of each sentence in the given data set is identified as positive
or negative. In this project, sentiment analysis is performed by using an Natural Language
Processing (NLP) algorithm. Natural Language Processing is an interaction between human
languages and computers. The NLP algorithm is based on statistical machine learning. In the
NLP algorithm the machine actually understands the context, sentence arrangement and
focuses mainly on the succession of a string of words. The NLP algorithm makes a
probabilistic decision based on sentiment value of each input.

8
CHAPTER 3

FEASIBILITY ANALYSIS

A feasibility study is a preliminary study which investigates the information of


prospective users and determines the resources requirements, costs, benefits and feasibility of
proposed system. A feasibility study takes into account various constraints within which the
system should be implemented and operated. In this stage, the resource needed for the
implementation such as computing equipment, manpower and costs are estimated. The
estimated are compared with available resources and a cost benefit analysis of the system is
made. The feasibility analysis activity involves the analysis of the problem and collection of
all relevant information relating to the project. The main objectives of the feasibility study are
to determine whether the project would be feasible in terms of economic feasibility, technical
feasibility and operational feasibility and schedule feasibility or not. It is to make sure that the
input data which are required for the project are available. Thus we evaluated the feasibility
of the system in terms of the following categories:

• Technical feasibility
• Operational feasibility
• Economic feasibility
• Schedule feasibility

3.1 Technical Feasibility


Evaluating the technical feasibility is the trickiest part of a feasibility study. This is
because, at the point in time there is no any detailed designed of the system, making it
difficult to access issues like performance, costs (on account of the kind of technology to be
deployed) etc. A number of issues have to be considered while doing a technical analysis;
understand the different technologies involved in the proposed system. Before commencing
the project, we have to be very clear about what are the technologies that are to be required
for the development of the new system. Is the required technology available? Our system
"Sentiment Analysis" is technically feasible since all the required tools are easily available.
Python can be easily handled. Although all tools seem to be easily available there are
challenges too.

9
3.2 Operational Feasibility
Proposed project is beneficial only if it can be turned into information systems that
will meet the operating requirements. Simply stated, this test of feasibility asks if the system
will work when it is developed and installed. Are there major barriers to Implementation?
The proposed was to make a simplified application that analyses given text. It is simpler to
operate and can be used in any python platform. It is free and not costly to operate.

3.3 Economic Feasibility


Economic feasibility attempts to weigh the costs of developing and implementing a
new system, against the benefits that would accrue from having the new system in place. This
feasibility study gives the top management the economic justification for the new system. A
simple economic analysis which gives the actual comparison of costs and benefits are much
more meaningful in this case. In addition, this proves to be useful point of reference to
compare actual costs as the project progresses. There could be various types of intangible
benefits on account of automation. These could increase improvement in product quality,
better decision making, and timeliness of information, expediting activities, improved
accuracy of operations, better documentation and record keeping, faster retrieval of
information. This is an application which gives the accurate results. Creation of application is
not costly.

3.4 Schedule Feasibility


A project will fail if it takes too long to be completed before it is useful. Typically,
this means estimating how long the system will take to develop, and if it can be completed in
a given period of time using some methods like payback period. Schedule feasibility is a
measure how reasonable the project timetable is. Given our technical expertise, are the
project deadlines reasonable? Some project is initiated with specific deadlines. It is necessary
to determine whether the deadlines are mandatory or desirable.
A minor deviation can be encountered in the original schedule decided at the beginning of the
project. The application development is feasible in terms of schedule.

3.5 Requirement Definition


After the extensive analysis of the problems in the system, we are familiarized with
the requirement that the current system needs. The requirement that the system needs is

10
categorized into the functional and non-functional requirements. These requirements are
listed below:
3.5.1 Functional Requirements
Functional requirement are the functions or features that must be included in any system
to satisfy the business needs and be acceptable to the users. Based on this, the functional
requirements that the system must require are as follows:
• System should be able to process new tweets stored in database after retrieval
• System should be able to analyse data and classify each tweet polarity.

3.5.2 Non-Functional Requirements


Non-functional requirements are a description of features, characteristics and attribute
of the system as well as any constraints that may limit the boundaries of the proposed system.
The non-functional requirements are essentially based on the performance, information,
economy, control and security efficiency and services.
Based on these the non-functional requirements are as follows:
• User friendly
• System should provide better accuracy
• To perform with efficient throughput and response time

11
CHAPTER 4

SYSTEM DESIGN

4.1 Importance of Design

The purpose of the design phase is to plan a solution of the problem specified by the
requirement document. This phase is the first step in moving from the problem domain to the
solution domain. In other words, starting with what is needed, design takes us toward how to
satisfy the needs. The design of a system is perhaps the most critical factor affect the quality
of the software; it has a major impact on the later phase, particularly testing, maintenance.
The output of this phase is the design document. The design activity is often divided into two
separate phases System Design and Detailed Design.

System Design also called top-level design aims to identify the modules that should
be in the system, the specifications of these modules, and how they interact with each other to
produce the desired results. During this phase, the details of the data of a module is usually
specified in a high-level design description language, which is independent of the target
language in which the software will eventually be implemented.

In system design the focus is on identifying the modules, whereas during detailed
design, the focus is on designing the logic for each of the modules during the system design
activities, developers bridge the gap between the requirements specification, produced during
requirements elicitation and analysis, and the system that is delivered to the user.

4.2 UML Diagrams

The Unified Modelling Language is a standard language for specifying, visualizing,


constructing and documenting the system and its components is a graphical language which
provides a vocabulary and set of semantic and rules. The UML focuses on the conceptual and
physical representation of the system. It is used to understand, design, configure and control
information about the systems. UML is a pictorial language used to make software blueprints.

12
4.2.1 Use Case Diagram

Figure 4.1 Use Case diagram

Description

Describes the functionality provided by a system in terms of actors, their goals represented as
use cases, and any dependencies among those use cases. In this use case diagram User and
the customer who entered the reviews/text are actors and the rest are the use cases. It
describes the functionality provided by a system in terms of actors, their goals represented as
use cases, and any dependencies among those use cases.

13
4.2.2 Sequence Diagram

Figure 4.2 Sequence Diagram

Description

A sequence diagram is an interaction diagram that emphasizes the time-ordering of messages.


Sequence diagrams and collaboration diagrams are isomorphic, meaning that you can take
one and transform it into the other. Sequence diagram generally contains objects and
messages which emphasizes the time ordering of messages. In this diagram objects are User,
Customers, Testing data, Feature Extract and Classifier; where message is a specification of a
communication between objects that conveys the information with the expectation that the
activity will ensue.

14
4.2.3 Activity Diagram

Figure 4.3 Activity Diagram

Description

Activity Diagram is another important diagram in UML to describe the dynamic aspects of
the system. Activity diagram is basically a flowchart to represent the flow from one activity
to another activity. The activity can be described as an operation of the system. The control
flow is drawn from one operation to another. An activity diagram contains activity states,
action states, transactions, objects where control flows from one state to another state passing
through joins and forks.

15
4.2.4 System Flow Diagram

Figure 4.4 System Flow Diagram

Description

A system flow diagram is a way to show relationships between a business and its
components, such as customers (according to IT Toolbox.) System flow diagrams, also
known as process flow diagrams or data flow diagrams, are cousins to common flow
charts.

16
4.3 Flowchart

Figure 4.5 Flowchart

Description

Flowchart is a graphical representation of an algorithm. Programmers often use it as a


program-planning tool to solve a problem. It makes use of symbols which are connected
among them to indicate the flow of information and processing. Using flowchart, we can
easily understand a program. Flowchart is not language specific.

The process of drawing a flowchart for an algorithm is known as “flowcharting”.

17
CHAPTER 5

METHODOLOGY

There are primarily two types of approaches for Sentiment classification of opinionated texts:
• Using a Machine learning based text classifier such as Naïve Bayes
• Using Natural Language Processing
We will be using those machine learning and natural language processing for sentiment
analysis of tweet.

5.1 Machine Learning


The machine learning based text classifiers are a kind of supervised machine learning
paradigm, where the classifier needs to be trained on some labelled training data before it can
be applied to actual classification task. The training data is usually an extracted portion of the
original data hand labelled manually. After suitable training they can be used on the actual
test data. The Naive Bayes is a statistical classifier whereas Support Vector Machine is a kind
of vector space classifier. The statistical text classifier scheme of Naive Bayes (NB) can be
adapted to be used for sentiment classification problem as it can be visualized as a 2-class text
classification problem: in positive and negative classes. Support Vector machine (SVM) is a
kind of vector space model based classifier which requires that the text documents should be
transformed to feature vectors before they are used for classification. Usually the text
documents are transformed to multidimensional vectors. The entire problem of classification
is then classifying every text document represented as a vector into a particular class. It is a
type of large margin classifier. Here the goal is to find a decision boundary between two
classes that is maximally far from any document in the training data.

This approach needs

• A good classifier such as Naïve Bayes


• A training set for each class

There are various training sets available on Internet such as Movie Reviews data set,
twitter dataset, etc. Class can be Positive, negative. For both the classes we need training
data sets.

18
5.1.1 Naïve Bayes Classifier (NB)

The Naïve Bayes classifier is the simplest and most commonly used classifier. Naïve
Bayes classification model computes the posterior probability of a class, based on the
distribution of the words in the document. The model works with the BOWs feature
extraction which ignores the position of the word in the document. It uses Bayes Theorem
to predict the probability that a given feature set belongs to a particular label.

𝑃 (𝑙𝑎𝑏𝑒𝑙)∗𝑃 (𝑓𝑒𝑎𝑡𝑢𝑟𝑒𝑠 |𝑙𝑎𝑏𝑒𝑙)


P (label | features) = 𝑃 (𝑓𝑒𝑎𝑡𝑢𝑟𝑒𝑠)

P (label) is the prior probability of a label or the likelihood that a random feature set the
label. P (features | label) is the prior probability that a given feature set is being classified
as a label. P(features) is the prior probability that a given feature set is occurred. Given
the Naïve assumption which states that all features are independent, the equation could be
rewritten as follows:

𝑃 (𝑙𝑎𝑏𝑒𝑙)∗𝑃 (𝑓1|𝑙𝑎𝑏𝑒𝑙 )∗……..∗𝑃 (𝑓𝑛|𝑙𝑎𝑏𝑒𝑙)


P(label | features) =
𝑃(𝑓𝑒𝑎𝑡𝑢𝑟𝑒𝑠)

5.1.1.1 Multinomial Naïve Bayes Classifier

Accuracy - around 75%

Algorithm:

i. Dictionary generation
Count occurrence of all words in our whole data set and make a dictionary of
some most frequent words.
ii. Feature set generation
All document is represented as a feature vector over the space of dictionary
words.
For each document, keep track of dictionary words along with their number of
occurrences in that document.

19
Formula used for algorithms:

Training
In this phase we have to generate training data (words with probability of
occurrence in positive/negative train data files).
Calculate ∅ k|label = y for each label.

Calculate ∅ k|label = y for each dictionary words and store the result (Here: label
will be negative and positive).
Now we have, word and corresponding probability for each of the defined label.
Testing
Goal - Finding the sentiment of given test data file.
- Generate Feature set(x) for test data file.
-For each document is test set find
Decision1=log P (x| label= pos) + log P (label=pos)

20
Similarly calculate
Decision2=log P (x| label= neg) + log P (label=neg)

Compare decision 1&2 to compute whether it has Negative or Positive sentiment.


The following diagrams and calculations show details on tweet data processing,
feature extraction, analysis and tweet polarity classification based on Naïve Bayes
Algorithm and Classifier.
You have a document and a classification
DOC TEXT CLASS
1 I loved the movie +
2 I hated the movie -
3 A great movie, good movie +
4 Poor acting -
5 Great acting +

Figure 5.1 List of documents


Ten Unique words:
<I, loved, the, movies, hated, a, great, poor, acting, good>
Convert the document into feature sets, where the attributes are possible words,
and the values are the number of times a word occurs in the given document.
DO I love Th Movie hate a grea poo actin goo Clas
C d e s d t r g d s
1 1 1 1 1 +
2 1 1 1 1 -
3 2 1 1 1 +
4 1 1 -
5 1 1 1 1 1 +

Figure 5.2 List of feature sets

21
Documents with positive outcomes:
DOC I loved The Movies hated a great poor acting good Class
1 1 1 1 1 +
3 2 1 1 1 +
5 1 1 1 1 1 +

Figure 5.3 Positive Vocabulary

P (+) = 3/5=0.6

Compute: p (i|+); p (love|+); p (the|+); p (movies|+);


P (a|+); p (great|+); p (acting|+); p (good|+)
Let n be the number of words in the (+) case: 14. nk the number of times word k
occurs in these cases (+)
𝑛𝑘+11
Let P (Wk|+) =
2𝑛+|𝑣𝑜𝑐𝑎𝑏𝑢𝑙𝑎𝑟𝑦|

Documents with positive outcomes:


DOC I loved The Movies hated a great poor acting good Class
1 1 1 1 1 +
3 2 1 1 1 +
5 1 1 1 1 1 +

𝑛𝑘+11
P (+) = 3/5 = 0.6; P(WK|+) =
2𝑛+|𝑣𝑜𝑐𝑎𝑏𝑢𝑙𝑎𝑟𝑦|

P(i|+) = (1+1)/ (14+10) = 0.0833; P(loved|+) = (1+1)/ (14+10) = 0.0833;


P(the|+) = (1+1)/ (14+10) = 0.0833; P(movies|+) = (5+1)/ (14+10) = 0.2083;
P(a|+) = (2+1)/ (14+10) = 0.125; P(great|+) = (2+1)/ (14+10) = 0.125;
P(acting|+) = (1+1)/ (14+10) = 0.0833; P(good|+) = (2+1)/ (14+10) = 0.125;
P(hated|+) = (0+1)/ (14+10) = 0.0417; P(poor|+) = (0+1)/ (14+10) = 0.0417;

22
Now, let’s look at the negative examples

DOC I loved The Movies hated a great poor acting good Class
2 1 1 1 1 -
4 1 1 -

Figure 5.4 Negative Vocabulary

P (-) = 2/5=0.4

P (i|-) = (1+1)/ (6+10) = 0.125; P (loved|-) = (0+1)/ (6+10) = 0.0625;

P (the|-) = (1+1)/ (6+10) = 0.125; P(movies|-) = 1+1 6+10= 0.125;

P(a|-) = (0+1)/ (6+10) = 0.0625; P(great|-) = (0+1)/ (6+10) = 0.0625;

P(acting|-) = (1+1)/ (6+10) = 0.125; P(good|-) = (0+1)/ (6+10) = 0.0625;

P(hated|-) = (1+1)/ (6+10) = 0.125; P(poor|-) = (1+1)/ (6+10) = 0.125;

Now that we’ve trained our classifier,

Let’s classify a new sentence according to:

VNB = 𝑎𝑟𝑔𝑚𝑎𝑥𝑃(𝑣𝑗)𝑣𝑗∈𝑉 ∏𝑤∈𝑤𝑜𝑟𝑑𝑠 𝑃(𝑊/𝑣𝑗)

where v stands for “value” or “class”

“I hated the poor acting”

If Vj =+; p (+) p (i|+) p (hated|+) p (the|+) p (poor|+) p (acting|+) = 6.03*10-7

If Vj = -; p (-) p (i|-) p (hated|-) p (the|-) p (poor|-) p (acting|-) = 1.22*10-5

5.2 Natural Language Processing

Natural language processing (NLP) is a field of computer science, artificial intelligence, and
linguistics concerned with the interactions between computers and human (natural)
languages. This approach utilizes the publicly available library of SentiWordNet, which
provides a sentiment polarity values for every term occurring in the document. In this lexical
resource each term t occurring in WordNet is associated to three numerical scores obj (t),
pos(t) and neg(t), describing the objective, positive and negative polarities of the term,

23
respectively. These three scores are computed by combining the results produced by eight
ternary classifiers. WordNet is a large lexical database of English. Nouns, verbs, adjectives
and adverbs are grouped into sets of cognitive synonyms (synsets), each expressing a distinct
concept.

WordNet is also freely and publicly available for download. WordNet’s structure makes it a
useful tool for computational linguistics and natural language processing. It groups words
together based on their meanings. Synet is nothing but a set of one or more Synonyms. This
approach uses Semantics to understand the language. Major tasks in NLP that helps in
extracting sentiment from a sentence:

• Extracting part of the sentence that reflects the sentiment


• Understanding the structure of the sentence
• Different tools which help process the textual data

Basically, Positive and Negative scores got from SentiWordNet according to its part-of-
speech tag and then by counting the total positive and negative scores we determine the
sentiment polarity based on which class (i.e. either positive or negative) has received the
highest score.

5.3 Programming tools

5.3.1 Python

Python is a widely used high-level, general-purpose, interpreted, dynamic programming


language. Its design philosophy emphasizes code readability, and its syntax allows
programmers to express concepts in fewer lines of code than possible in languages such as C
or Java. The language provides constructs intended to enable writing clear programs on both
a small and large scale.

5.3.2 NLTK

NLTK is a leading platform for building Python programs to work with human language
data. It provides easy-to-use interfaces to over 50 corpora and lexical resources such as
WordNet, along with a suite of text processing libraries for classification, tokenization,
stemming, tagging, parsing, and semantic reasoning, wrappers for industrial-strength NLP
libraries, and an active discussion forum.

24
NLTK has been called “a wonderful tool for teaching, and working in, computational
linguistics using Python,” and “an amazing library to play with natural language.” NLTK is
suitable for linguists, engineers, students, educators, researchers, and industry users alike.
Natural Language Processing with Python provides a practical introduction to programming
for language processing. Written by the creators of NLTK, it guides the reader through the
fundamentals of writing Python programs, working with corpora, categorizing text, analyzing
linguistic structure, and more.

5.3.3 matplotlib

matplotlib.pyplot is a collection of command style functions that make matplotlib work like
MATLAB. Each pyplot function makes some change to a figure: e.g., creates a figure,
creates a plotting area in a figure, plots some lines in a plotting area, decorates the plot with
labels, etc.In matplotlib.pyplot various states are preserved across function calls, so that it
keeps track of things like the current figure and plotting area, and the plotting functions are
directed to the

current axes (please note that "axes" here and in most places in the documentation refers to
the axes part of figure and not the strict mathematical term for more than one axis).

25
CHAPTER 6

TESTING

6.1 Importance of Testing

The purpose of testing is to discover errors. Testing is the process of trying to


discover every conceivable fault or weakness in a work product. It provides a way to check
the functionality of components, sub-assemblies, assemblies and/or a finished product. It is
the process of exercising software with the intent of ensuring that the software system meets
its requirements and user expectations and does not fail in an unacceptable manner. There are
various types of test. Each test type addresses a specific testing requirement.

6.2 Types of Testing

Unit Testing

Unit testing involves the design of test cases that validate that the internal program
logic is functioning properly, and that program input produce valid outputs. All decision
branches and internal code flow should be validated. It is the testing of individual software
units of the application. It is done after the completion of an individual unit before
integration. This is a structural testing, that relies on knowledge of its construction and is
invasive. Unit tests perform basic tests at component level and test a specific business
process, application, and/or system configuration. Unit tests ensure that each unique path of
business process performs accurately to the documented specifications and contains clearly
defined inputs and expected results. It is the testing of individual software units of the
application.

Integration Testing

Integration tests are designed to test integrated software components to determine if


they actually run as one program. Testing is event driven and is more concerned with the
basic outcome of screens or fields. Integration tests demonstrate that although the
components were individually satisfaction, as shown by successfully unit testing, the
combination of components is correct and consistent. Integration testing is specifically aimed
at exposing the problems that arise from the combination of components.

26
Functional Testing

Functional tests provide a systematic demonstration that functions tested are available
as specified by the business and technical requirements, system documentation, and user
manuals. Functional testing is centered on the following items:

• Valid Input – identified classes of valid input must be accepted.


• Invalid Input – identified classes of invalid input must be rejected.
• Functions – identified functions must be exercised.
• Output – identified classes of application outputs must be exercised.

Organizations and preparation of functional tests is focused on requirements, key


functions, or special test cases. In addition, systematic coverage pertaining to identify
business process flows; data fields, predefined processes, and successive processes must be
considered for testing. Before functional testing is complete, additional tests are identified
and the effective value of current tests is determined.

System Testing

We usually perform system testing to find errors resulting from unanticipated


interaction between the sub-system and system components. Software must be tested to
detect and rectify all possible errors once the source code is generated before delivering it to
the customers. For finding errors, series of test cases must be developed which ultimately
uncover all the possibly existing errors. Different software techniques can be used for this
process. These techniques provide systematic guidance for designing test that

• Exercise the internal logic of the software components.


• Exercise the input and output domains of a program to uncover errors in program
function, behavior and performance.

White Box Testing

White Box testing is a testing in which the software tester has knowledge of the inner
workings, structure and language of the software, or at least its purpose. It is used to test areas
that cannot be reached from a black box level. It is a testing in which the software under test
is treated, as a black box.

27
Black Box Testing

Black Box testing the software without any knowledge of the inner workings,
structure or language of the module being tested. Black box tests, as most other kinds of tests,
must be written from a definitive source document, such as specification or requirements
document, such as specification or requirements document.

Performance Testing

It is done to test the run-time performance of the software within the context of
integrated system. These tests are carried out throughout the testing process. For example, the
performance of individual module is accessed during white box testing under unit testing.

Verification and Validation

The testing process is a part of broader subject referring to verification and validation.
We have to acknowledge the system specifications and try to meet the customer’s
requirements and for this sole purpose, we have to verify and validate the product to make
sure everything is in place. Verification and validation are two different things. One is
performed to ensure that the software correctly implements a specific functionality and other
is done to ensure if the customer requirements are properly met or not by the end product.

Verification is more like 'are we building the product, right?' and validation is more like 'are
we building the right product?'.

28
CHAPTER 7

ANALYSIS AND RESULTS

7.1 Analysis

We collected dataset containing positive and negative data. Those datasets were
trained data and was classified using Naïve Bayes Classifier. Before training the classifier
unnecessary words, punctuations, meaning less words were cleaned to get pure data. To
determine positivity and negativity of data we collected from different sources. Those data
were stored in database and then retrieved back to remove those unnecessary word and
punctuations for pure data. To check polarity of test we train the classifier with the help of
trained data. Those results were continuously trained to the system whenever the program is
executed.

After facing a number of errors, successful elimination of those error we have completed our
project with continuous effort. At the end of the project the results can be summarized as:

• A user-friendly application.
• No expertise is required for using the application.
• Organizations can use the application to visualize product or brand review
graphically.

29
7.2 Results

7.2.1 Test case 1:

When the input is given completely positive data i.e. when the data collected is
completely positive regarding the product or anything. Then the output of the sentiment
analysis system is as follows:

Input:

Output:

30
7.2.2 Test case 2:

When the input is given completely negative data i.e. when the data collected is
completely negative regarding the product or anything. Then the output of the sentiment
analysis system is as follows:

Input:

Output:

31
7.2.3 Test case 3:

When the input is given with a combination of both positive and negative data i.e.
when the data collected contains both positive and negative comments regarding the product
or anything. Then the output of the sentiment analysis system is as follows:

Input:

Ouput:

32
7.2.4 Test case 4:

When the input given is not relevant to the analysis then the output is as follows:

7.2.5 Test case 5:

When you are analysing data which is not yet been commented by anyone then the following
message is being displayed on the screen:

33
CHAPTER 8

LIMITATION AND FUTURE ENHANCEMENT

8.1 Limitation

The system we designed is used to determine the opinion of the people based on data given
dynamically. We somehow completed our project and was able to determine only positivity
and negativity of data. For neutral data we were unable to merge dataset.

Also, we are currently analysing only with few datasets. This may not give proper value and
results. The results are not much accurate.

8.2 Future Enhancement

• Analysing sentiments on emoji/smiley.


• Determining neutrality.
• Potential improvement can be made to our data collection and analysis method.
• Future research can be done with possible improvement such as more refined data
and more accurate algorithm.

34
CONCLUSION

We have completed our project using python as language with different modules for
analyzing and output presentation. Although there was a problem in integrating different
modules of python an, through numbers of tutorial we were able to integrate it.

We were able to determine the positivity and negativity of each data. Based on those
comments or data we represented them in a diagram like pie chart. All the diagrams related to
outcome are shown in results (chapter 7.2). A small conclusion is also shown during output
presentation based on product or brand entered. Our designed system is user friendly.

All displaying results are displayed in a pie chart representation.

35
REFERENCES
1. Kim S-M, Hovy E (2004) Determining the sentiment of opinions in: Proceedings of the
20th international conference on Computational Linguistics, page 1367. Association for
Computational Linguistics, Stroudsburg, PA, USA.

2. Liu B (2010) Sentiment analysis and subjectivity in: Handbook of Natural Language
Processing, Second Edition. Taylor and Francis Group, Boca. Liu B, Hu M, Cheng J (2005)
Opinion observer: Analysing and comparing opinions on the web in: Proceedings of the 14th
International Conference on World Wide Web, WWW ’05, 342–351. ACM, New York, NY,
USA.

3. Pak A, Paroubek P (2010) Twitter as a corpus for sentiment analysis and opinion mining
in: Proceedings of the Seventh conference on International Language Resources and
Evaluation. European Languages Resources Association, Valletta, Malta.

4. Pang B, Lee L (2004) A sentimental education: Sentiment analysis using subjectivity


summarization based on minimum cuts in: Proceedings of the 42Nd Annual Meeting on
Association for Computational Linguistics, ACL’04... Association for Computational
Linguistics, Stroudsburg, PA, USA.

5. Pang B, Lee L (2008) Opinion mining and sentiment analysis. Found Trends Inf Retr2(1-
2): 1–135.

7. LiuB (2014) The science of detecting fake reviews. http://content26.com/blog/bingliu-the-


science-of-detecting-fake-reviews/

8. Jahanbakhsh, K., & Moon, Y. (2014). The predictive power of social media: On the
predictability of U.S presidential elections using Twitter

Mukherjee A, Liu B, Glance N (2012) Spotting fake reviewer groups in consumer reviews In:
Proceedings of the 21st, International Conference on World Wide Web, WWW ’12, 191–
200.. ACM, New York, NY, USA.

10. Saif, H., He, Y., & Alani, H. (2012). Semantic sentiment analysis of twitter. The
Semantic Web (pp. 508– 524). ISWC

36
11.Tan LK-W, Na J-C, Theng Y-L, Chang K (2011) Sentence-level sentiment polarity
classification using a linguistic approach in: Digital Libraries: For Cultural Heritage,
Knowledge Dissemination, and Future Creation, 77–87... Springer, Heidelberg, Germany.

12.Liu B (2012) Sentiment Analysis and Opinion Mining. Synthesis Lectures on Human
Language Technologies. Morgan & Claypool Publishers.

13.Gann W-JK, Day J, Zhou S (2014) Twitter analytics for insider trading fraud detection
system in: Proceedings of the second ASE international conference on Big Data... ASE.

14.Joachims T. Probabilistic analysis of the Roccio algorithm with TFIDF for text
categorization. In: Presented at the ICML conference; 1997.

15.Yung-Ming Li, Tsung-Ying Li Deriving market intelligence from microblogs

37

You might also like