Minor Fnal
Minor Fnal
Minor Fnal
INTRODUCTION
1.1 OVERVIEW
Sentiment is an attitude, thought, or judgment prompted by feeling. Sentiment analysis, which
is also known as opinion mining, studies people’s sentiments towards certain entities. Internet
is a resourceful place with respect to sentiment information. From a user’s perspective, people
are able to post their own content through various social media, such as forums, micro-blogs,
or online social networking sites. From a researcher’s perspective, many social media sites
release their application programming interfaces (APIs), prompting data collection and
analysis by researchers and developers. For instance, Twitter currently has three different
versions of APIs available, namely the REST API, the Search API, and the Streaming API.
With the REST API, developers are able to gather status data and user information; the Search
API allows developers to query specific Twitter content, whereas the Streaming API is able to
collect Twitter content in real time. Moreover, developers can mix those APIs to create their
own applications. Hence, sentiment analysis seems having a strong fundament with the support
of massive online data.
However, those types of online data have several flaws that potentially hinder the process of
sentiment analysis. The first flaw is that since people can freely post their own content, the
quality of their opinions cannot be guaranteed. For example, instead of sharing topic-related
opinions, online spammers post spam on forums. Some spam are meaningless at all, while
others have irrelevant opinions also known as fake opinions. The second flaw is that ground
truth of such online data is not always available. A ground truth is more like a tag of a certain
opinion, indicating whether the opinion is positive, negative, or neutral. The Stanford
Sentiment 140 Tweet Corpus is one of the datasets that has ground truth and is also public
available. The corpus contains 1.6 million machine-tagged Twitter messages. Each message is
tagged based on the emoticons (☺as positive, ☹as negative) discovered inside the message.
Data used in this paper is a set of product reviews collected from Amazon, between February
and April, 2014. The aforementioned flaws have been somewhat overcome in the following
1
two ways: First, each product review receives inspections before it can be posted a. Second,
each review must have a rating on it that can be used as the ground truth.
This paper tackles a fundamental problem of sentiment analysis, namely sentiment polarity
categorization. Flowchart that depicts our proposed process for categorization as well as the
outline of this paper. Our contributions mainly fall into Phase 2 and 3. In Phase 2: 1) An
algorithm is proposed and implemented for negation phrases identification; 2) A mathematical
approach is proposed for sentiment score computation; 3) A feature vector generation method
is presented for sentiment polarity categorization. In Phase 3: 1) Two sentiment polarity
categorization experiments are respectively performed based on sentence level and review
level; 2) Performance of three classification models are evaluated and compared based on their
experimental results.
Everyday enormous amount of data is created from review, blogs and other media and diffused
in to the World Wide Web. This huge data contains very crucial opinion related information that
can be used to benefit businesses and other aspects of commercial and scientific industries.
Manual tracking and extraction of this useful information is not possible, thus, Sentiment
analysis is required. Sentiment Analysis is the phenomenon of extracting sentiments or opinions
from reviews expressed by users over a particular subject, area or product online. It is an
application of natural language processing, computational linguistics, and text analytics to
identify subjective information from source data. It clubs the sentiments in to categories like
“positive” or “negative”. Thus, it determines the general attitude of the speaker or a writer with
respect to the topic in context.
Natural language processing (NLP) is the technology dealing with our most ubiquitous product:
human language, as it appears in emails, web pages, tweets, product descriptions, newspaper
stories, social media, and scientific articles, in thousands of languages and varieties. In the past
decade, successful natural language processing 2 Introduction applications have become part of
2
our everyday experience, from spelling and grammar correction in word processors to machine
translation on the web, from email spam detection to automatic question answering, from
detecting people’s opinions about products or services to extracting appointments from your
email. The greatest challenge of sentiment analysis is to design application-specific algorithms
and techniques that can analyze the human language linguistics accurately.
Following are the major applications of sentiment analysis in real world scenarios.
Product and Service reviews - The most common application of sentiment analysis is in the area
of reviews of consumer products and services. There are many websites that provide automated
summaries of reviews about products and about their specific aspects. A notable example of that
is “Google Product Search”.
Reputation Monitoring - Twitter and Facebook are a focal point of many sentiment analysis
applications. The most common application is monitoring the reputation of a specific brand on
Twitter and/or Facebook.
Result prediction - By analyzing sentiments from relevant sources, one can predict the probable
outcome of a particular event. For instance, sentiment analysis can provide substantial value to
candidates running for various positions. It enables campaign managers to track how voters feel
about different issues and how they relate to the speeches and actions of the candidates.
Decision making - Another important application is that sentiment analysis can be used as a
important factor assisting the decision making systems. For instance, in the financial markets
investment. There are numerous news items, articles, blogs, and tweets about each public
company. A sentiment analysis system can use these various sources to find articles that discuss
the companies and aggregate the sentiment about them as a single score that can be used by an
automated trading system. One such system is The Stock Sonar
3
1.3 OBJECTIVE
In Chapter 3, we discussed the project dependencies and interface required such as the hardware
and the software needed with the detailed discussion on the user interface and communication
interface.
In Chapter 4, the system design and the system requirements is described with the architecture.
In Chapter 5, we discuss our modules with the screenshots of the project made with the platform
required and other dependencies.
In Chapter 6 we give our conclusions and suggest future works and enhancements that could be
possible for the project work.
4
CHAPTER -2
LITERATURE SURVEY
2.1 INTRODUCTION
Since reviews of much work on sentiment analysis have already been included in , in this
section, we will only review some previous work, upon which our research is essentially based.
Hu and Liu summarized a list of positive words and a list of negative words, respectively,
based on customer reviews. The positive list contains 2006 words and the negative list has
4783 words. Both lists also include some misspelled words that are frequently present in social
media content. Sentiment categorization is essentially a classification problem, where features
that contain opinions or sentiment information should be identified before the classification.
For feature selection, Pang and Lee suggested to remove objective sentences by extracting
subjective ones. They proposed a text-categorization technique that is able to identify
subjective content using minimum cut. Gann et al. selected 6,799 tokens based on Twitter
data, where each token is assigned a sentiment score, namely TSI(Total Sentiment Index),
featuring itself as a positive token or a negative token. Specifically, a TSI for a certain token is
computed as:
where p is the number of times a token appears in positive tweets and n is the number of times
a token appears in negative tweets. tptn is the ratio of total number of positive tweets over total
number of negative tweets.
5
2.2 EXISTING SYSTEM
The main difference between sentiment analysis of product review and documents is that, this
based approaches are more specific towards determining the polarity of words (mainly
adjectives); whereas the document based approaches are specific towards the task of determining
features in the text. There are three major approaches for twitter specific sentiment analysis.
Lexical analysis approach and Machine learning approach. Using one or a combination of the
different approaches, one can employ one or a combination of lexical and machine learning
techniques. Specifically, one can use unsupervised techniques, supervised techniques or a
combination of them. First review the lexical approaches, which focus on building successful
dictionaries, then the machine learning approaches, which are primarily concerned with feature
vectors a combination of both i.e. hybrid approach.
Research reveals that sentiment analysis is more difficult than traditional topic-based text
classification, despite the fact that the number of classes in sentiment analysis are less than the
number of classes in topic-based classification. In sentiment analysis, the classes to which a
piece of text is assigned are usually negative or positive. They can also be other binary classes or
multi-valued classes like classification into “positive", “negative" and “neutral", but still they are
less than the number of classes in topic-based classification. Sentiment analysis is tougher
compared to topic-based classification as the latter relies on keywords for classification. Whereas
in the case of sentiment analysis keywords a variety of features have to be taken into account.
The main reason that sentiment analysis is more difficult than topic-based text classification is
that topic-based classification can be done with the use of keywords while this does not work
well in sentiment analysis. Other reasons for difficulty are: sentiment can be expressed in subtle
ways without any perceived use of negative words; it is difficult to determine whether a given
texts objective or subjective (there is always a newline between objective and subjective texts); it
6
is difficult to determine the opinion holder (example, is it the opinion of the author or the opinion
of the commenter); there are other factors such as dependency on domain and on order of words.
Other challenges of sentiment analysis are to deal with sarcasm, irony, and/or negation.
It is clear from the analysis of the literature that machine learning approaches have so far proved
to be outstanding in delivering accurate results. Depending upon the area of application both the
approaches have an edge. For less number of features Lexical analysis technique is respectable,
whereas for large features Machine learning analysis technique is overshadowing. Both
approaches have pros and cons. Lexical analysis is ready-to-go technique which does not require
any prior classification or training of the datasets. It can be directly applied on live data given
that the feature set is large. Where as in the Machine learning technique the classifier need to be
initially fed or “trained" withdraw datasets and tuned to cluster the sentiments into pre- defined
classes. But it works efficiently on large texts with large feature support. Low features lead to
less accuracy for this technique. In this chapter we presented a detailed literature review of the
existing approaches and techniques. The next chapter describes the detailed design and analysis
of the proposed hybrid naive bayes approach.
7
CHAPTER -3
METHODOLOGY
Software used for this study is scikit-learn, an open source machine learning software package
in Python. The classification models selected for categorization are: Naïve Bayesian, Random
Forest, and Support Vector Machine.
n
C X
P( )=∏ P( )
X k=1 K
8
votes.The decision tree algorithm implemented in scikit-learn is CART (Classification
and Regression Trees). CART uses Gini index for its tree induction. For D, the Gini
index is computed as:
m
Gini ( D ) =1−∑ pi 2
i=1
where p i is the probability that a tuple in D belongs to class C i . The Gini index
measures the impurity of D. The lower the index value is, the better D was partitioned.
For the detailed descriptions of CART.
Support vector machine (SVM) is a method for the classification of both linear and nonlinear
data. If the data is linearly separable, the SVM searches for the linear optimal separating
hyperplane (the linear kernel), which is a decision boundary that separates data of one class
from another. Mathematically, a separating hyperplane can be written as: W·X+b=0, where Wis
a weight vector and W=w 1,w2,...,w n. X is a training tuple. b is a scalar. In order to optimize
the hyperplane, the problem essentially transforms to the minimization of ∥W∥, which is
eventually computed as: ∑i=1nαiyixi∑i=1nαiyixi, where α i are numeric parameters, and y i are
labels based on support vectors, X i . That is: if y i =1 then ∑i=1nwixi≥1∑i=1nwixi≥1; if y i =−1
then ∑i=1nwixi≥−1∑i=1nwixi≥−1.
If the data is linearly inseparable, the SVM uses nonlinear mapping to transform the data into a
higher dimension. It then solve the problem by finding a linear hyperplane. Functions to
perform such transformations are called kernel functions. The kernel function selected for our
experiment is the Gaussian Radial Basis Function (RBF):
9
− y∨|xi −xj|∨¿2
K ( Xi , Xj )=e
where X i are support vectors, X j are testing tuples, and γ is a free parameter that uses the
default value from scikit-learn in our experiment. Figure 9 shows a classification example of
SVM based on the linear kernel and the RBF kernel.
10
CHAPTER -4
SYSTEM DESIGN
4.1 INTRODUCTION
The incorporation of adaptation methods and techniques allows the development of adaptive e-
learning systems, where each student receives personalized guidance during the learning process
(Broslowski, 2001). In order to provide personalization, it is necessary to store information about
each student in what is called the student model. The specific information to be collected and
stored depends on the goals of the adaptive Learning system (e.g., preferences, learning styles,
personality, emotional state, context, previous actions, and so on). In particular, affective and
emotional factors, among other aspects, seem to affect the student motivation and, in general, the
outcome of the learning process (Shen, Wang, & Shen, 2012). Therefore, in learning contexts,
being able to detect and manage information about the students’ emotions at a certain time can
contribute to know their potential needs at that time. On one hand, adaptive e-learning
environments can make use of this information to fulfill those needs at runtime: they can provide
the user with recommendations about activities to tackle or contents to interact with, adapted to
his/her emotional state at that time. On the other hand, information about the student emotions
towards a course can act as feedback for the teacher. This is especially useful for online courses,
in which there is little (or none) face-to-face contact between students and teachers and,
therefore, there are fewer opportunities for teachers to get feedback from the students. In general,
in order for a system to be able to take decisions based on information about the users, it is
necessary for it to get and store information about them. One of the most traditional procedures
to obtain information about users consists of asking them to fill in questionnaires. However, the
users can find this task too time-consuming. Recently, non-intrusive techniques are preferred, yet
without compromising the reliability of the model built. There are different ways to provide
11
4.2 SYSTEM ARCHITECTURE
The definition and modelling of an architecture dedicated to the activities of analysis of big data,
as the ones produced by social networks as Twitter, is currently still at an early stage of its
development and consolidation. Unlike traditional data warehouse or business intelligence
systems, whose architecture is designed for structured data, systems dedicated to big data work
instead with semi-structured data, or so called "raw data", i.e. without a particular structure. It
should also be pointed out that such systems should be able to allow processing and analysis of
data not only in batch mode, but also in a real real-time fashion. Nowadays a huge amount of
data, daily produced by social networks, can be processed and analyzed for different purposes.
These data are provided with several features, among which: - dimension; - peculiarities; -
source; - reliability; By the time the need to obtain the information and the way this information
must be processed has changed. Until recently it was thought that the data should be first
processed and subsequently made available, regardless of the time aspect. This type of
processing is commonly called batch processing. Nowadays the amount of data is increased
exponentially and now real-time processing is needed to get the most advantages from this data,
in different fields. Actually batch models do not allow to work with the data in a real time
fashion, due to the long time required by processing operations. Against the implementation of
real-time processing architecture could lead to lower accuracy One possible solution is to merge
the two concepts into a single architecture, capable of handling big data, but also with scalable
and fast processing features. A possible solution to this problem is the so called Lambda
Architecture (Marz and Warren, 2015), a software architecture made by 3 different levels:
12
4.2.1 WORK FLOW DIAGRAM
In below figure 4.2, simplified architecture of sentiment analysis is shown . In the figure it shown how
data
is collected social networking sites and processed using API like scrapy and textBlob . The output is then
displayed on web Inteferace using python platform .
4.2.2 DESCRIPTION
The data which is being collected from the social media is being pushed to our web server the
data is then processed through NLP and the useful insights are produced and displayed in the
web interface
13
Preprocessing
The tweets gathered from twitter are a mixture of urls, and other non-sentimental data like
hashtags \#", annotation \@" and retweets \RT". To obtain n-gram features, we first have to
tokenize the text input. Tweets pose a problem for standard tokenizers designed for formal and
regular text. The following figure displays the various intermediate processing feature steps. The
intermediate steps are the list of A sequence of intermediate preprocessing steps taking place at
this level. features to be taken account of by the classifier. We discuss each feature deployed
in brief.
Training Data
To precisely label the text into their respective classes and thus achieve highest possible
accuracy, we plan to train the classifier using pre-labelled twitter data it- self. Pre-labelled twitter
training data is not available freely, since this year Twitter changed its data privacy policies and it
no longer allows open/free sharing of twitter content. However, they mention that using or
downloading twitter content for individual research purposes is acceptable.
To test the proposed approach, we created a setup with the following system requirements. We
tested our approach on both Linux and Windows platforms. We used a Dell Optiplex 980
Windows-7 core-i5 (64 bit) machine equipped with 4 GB of RAM and a Linux server system
with Quardcore processor equipped with 8 GB of RAM.
CHAPTER -5
15
SYSTEM IMPLEMENTATION
5.1 Environment:
The implementation environment was a HP ab522-TX laptop computer with an Intel Core
i5-4210U 1.70 GHz CPU, 8 GBs of RAM and NVIDIA GeForce 940M GPU. The operating
system was Windows 10. The main software tool was Spyder which is an interactive scientific
Python development environment. The object detection system and its related methods were
implemented as a combination of preexisting and self-programmed Spyder tools and libraries.
Tensorflow:
Tensorflow is very important tool for neural networking. As the name suggest this library is
focused on effective work with tensors. It was originally developed for internal use in Google for
machine learning task but it was released as open source in 2015. Tensorflow computations are
expressed as stateful dataflow graphs, which enables efficient support for GPU aided
computation. Is currently advertised as one of the fastest frameworks for deep learning needs. Its
disadvantage is that it is very low level and direct usage for implementation of Deep learning
models is not ideal.
16
Anaconda:
It is a freemium, open source, distribution of the Python and R programming languages for
large-scale data processing, predictive analytics, and scientific computing, that aims to simplify
package management and deployment. Package versions are managed by the package
management system conda.
Spyder:
Spyder is the Scientific Python Development Environment. A powerful interactive
development environment for the Python language with advanced editing, interactive testing,
debugging and introspection features and a numerical computing environment thanks to the
support of IPython (enhanced interactive Python interpreter) and popular Python libraries such as
NumPy (linear algebra), SciPy (signal and image processing) or matplotlib (interactive 2D/3D
plotting).
We use an Nvidia GTX 960 in our setup to benefit from faster training times in deep
learning frameworks with the support of Cuda and CNN. We also choose this particular model
due to budget restrictions and Cuda compatibility. It has the minimum requirements for basic
CNNs. For example, we find that its 4 GB memory is almost used completely during training the
dataset model with a batch size of 64, which is another important reasons for the model choice,
since - although there are better-performing models such as GoogleNet and VGG - adding more
layers would require even smaller batch sizes and would reach hardware limitations more
quickly. A 256 GB SSD is used for quick access to applications and disk space for small data
such as documents. A 1 TB HDD is used for larger data such as stored, trained network
parameters, which for tensorflow are around a few hundred MB each, and datasets, which
quickly sum up due to copies and modifications. The CPU used is an Intel i5-1607 v3 with four
cores and 1.7 GHz. The system hardware architecture is shown in figure X.1,including
CPU,uart,tri-state bridge, ram and I/O controls, which are all reusable. Such a design
method not only make it modulization, but also greatly reduce the design cycle of the
system.
17
5.4 Overview of Platform
The most popular development platform for object Recognition are Python, Matlab,
dataset etc. Here we look each of them:
5.4.1 MATLAB
5.4.2 PYTHON
18
5.4.3 DATASETS
A data set (or dataset, although this spelling is not present in many contemporary
dictionaries like Merriam-Webster) is a collection of data. Most commonly a data set
corresponds to the contents of a single database table, or a single statistical data matrix, where
every column of the table represents a particular variable, and each row corresponds to a given
member of the data set in question. The data set lists values for each of the variables, such as
height and weight of an object, for each member of the data set. Each value is known as a datum.
The data set may comprise data for one or more members, corresponding to the number of rows.
The term data set may also be used more loosely, to refer to the data in a collection of closely
related tables, corresponding to a particular experiment or event. An example of this type is the
data sets collected by space agencies performing experiments with instruments aboard space
probes. Data sets that are so large that traditional data processing applications are inadequate to
deal with them are known as big data.[1]
In the open data discipline, dataset is the unit to measure the information released in a public
open data repository. The European Open Data portal aggregates more than half a million
datasets
19
In this section we generally used the algorithm or sample method to implement the object
detection.
5.5.2 SCREENSHOT
20
For our work, we investigated the performance of CPD as a feature selection method together
with other popular feature selection methods: IG and χ2. Two datasets were used in this research
and SentiWordNet was used to score the terms in each document. The SVM and Naïve Bayes
classifiers in the Weka data mining application were used for classifying the datasets into
positive and negative sentiments. Experimental results show that, CPD performs well as a feature
selection method for sentiment analysis and classification tasks yielding the best accuracy results
in three out of four experiments.
We noticed, however, that the accuracy became constant after 50% of the features were
eliminated due to the fact that most terms had similar CPD scores. As future work in this
research, we hope to study CPD in more detail by doing further work with the convote dataset to
eliminate any reviews which are not related, and possibly group reviews based on the topic being
debated. This will involve using both supervised and non-supervised approaches, as it has been
noticed that the combination of these approaches yield better results. We would also like to study
the performance of CPD on other datasets described in the sentiment analysis literature.
We hope to also use SentiWordNet with other scoring measures to arrive at better scores for
terms which will make up for the inaccurate scores generated sometimes form SentiWordNet. In
future, we hope to use f-measure values as cutoff values during feature selection and also
improve the time taken by CPD to generate scores for terms. This will greatly enhance the
classification step and also improve accuracy. Finally, we will like to investigate the use of
unigrams and bigrams in this research to see if accuracy can be improved.
21
22