Sentiment Analysis Using Feature Selection and Machine Learning Algorithms
Sentiment Analysis Using Feature Selection and Machine Learning Algorithms
Presented By
Shruti Pant
Under Guidance Of:
Supervised Learning
Classification
Regression
Unsupervised Learning
Reinforcement Learning
What people think?
What others think has always been an important piece of information
Post Web
I dont know who..but apparently its a good phone. It has good battery life and
Blogs (google blogs, livejournal)
E-commerce sites (amazon, ebay)
Review sites (CNET, PC Magazine)
Discussion forums (forums.craigslist.org,
forums.macrumors.com)
Friends and Relatives (occasionally)
Basics Of Sentiments
Holder (source) of attitude
Target (aspect) of attitude
Type of attitude
From a set of types
Like, love, hate, value, desire, etc.
Or (more commonly) simple weighted
Polarity: positive or negative
Text containing the attitude
Sentence or entire document
Sentiment
A thought, view, or attitude, especially one based
mainly on emotion instead of reason
Sentiment Analysis
aka opinion mining
use of natural language processing (NLP)
and computational techniques to automate
the extraction or classification of
sentiment from typically unstructured text
Identify the orientation of opinion in a piece of text
Quiz:
This is a beautiful bracelet..
Is this sentence subjective/objective?
Is it positive or negative ?
B. Document(post/review) Level
Classification
Assumption:
each document focuses on a single object
[2] extended the work using PMI and Latent Semantic Analysis
(LSA) and achieved the accuracy of 82.2%.
Feature Selection
Techniques Lexicon Or Machine
Based
Design Flow
Dataset (Phase 1)
Bo Pang, Lillian Lee, and Shivakumar Vaithyanathan. 2002.
Thumbs up? Sentiment Classification using Machine Learning
Techniques. EMNLP-2002, 7986.
Bo Pang and Lillian Lee. 2004. A Sentimental Education:
Sentiment Analysis Using Subjectivity Summarization Based
on Minimum Cuts. ACL, 271-278
Polarity detection:
Is an IMDB movie review positive or
negative?
Data: Polarity Data 2.0:
http://www.cs.cornell.edu/people/pabo/movi
e-review-data
IMDB data in the Pang and Lee
database
when _star wars_ came out some snake eyes is the most
twenty years ago , the image of aggravating kind of movie :
traveling throughout the stars has the kind that shows so much
become a commonplace image . potential then becomes
[] unbelievably disappointing .
when han solo goes light speed , its not just because this is a
the stars change to bright lines , brian depalma film , and since
going towards the viewer in lines hes a great director and one
that converge at an invisible point whos films are always greeted
cool . with at least some fanfare .
_october sky_ offers a much and its not even because this
simpler imagethat of a single was a film starring nicolas
white dot , traveling horizontally cage and since he gives a
across the night sky . [. . . ] brauvara performance , this
film is hardly worth his talents
Pre-Processing (Phase 2)
Pre-processing is done in our proposed methodology to remove
the words which impede our process of sentiment analysis by
increasing the number of false positives or false negative.
In our model stop words are removed using Tf-idf. Term
Frequency- Inverse Document Frequency is known to find the
important and no so important word in the document. NLTK
also comes with an in-built list of 128 stop words which is
also included in our model to select the not relevant words.
We have done this by importing stopwords from NLTK
corpus.
Stemming algorithms attempt to automatically remove
suffixes (and in some cases prefixes) in order to find the root
word or stem of a given word. NLTK provides several
stemmer interfaces. In our proposed method we have used
porter stemmer to find the root words.
Feature Selection (Phase 3)
Feature selection is used to increase the
effectiveness of the model. Features which are
important are selected and fed to the classifier.
In our proposed methodology we used chi square
as a scoring function with which we can find if
two terms are associated to each other
(collocation correlation of two words or words
that are more likely to occur together).
It helps us in understanding if a word is
informative or not. If a word mainly occurs in
positive review and rarely in negative reviews it
can main that the word is important. So we find
how common a word is in a particular class
compared to other classes.
Feature Selection (Phase 4)
In Machine learning, A nave Bayes
classifier is a family of simple,
baseline probabilistic classifier based
on Bayes theorem with strong but
nave independence assumptions.
Experimental Results
Accuracy
100
93
90 84.75
84
81.6
80 75.25 76.5
70
60 Accuracy
50
40
30
20
10
80
70
60 Precision
50
40
30
20
10
100 94
88
90 81.6 81
80
70
59.5
56.5
60 Recall
50
40
30
20
10
90
80
70
60 F-MEASURE
50 93.49
83.96 83.5 83.23
40 69.53 71.68
30
20
10
4
5
8
9
10
11
12
13
Thank You