100% found this document useful (1 vote)
65 views28 pages

Business Analytics Using Python Sentiment Analytics: Cyrus Lentin

The document discusses sentiment analysis and classification using Python. It defines sentiment as feelings, attitudes, emotions and opinions that are subjective and differ between individuals. Sentiment analysis uses machine learning to characterize the sentiment in text as positive, negative or neutral. Some challenges in sentiment analysis are ambiguity in language and context. The document outlines common sentiment analysis tasks and classification methods like baseline, Bayes and support vector machines. It provides examples of how Bayes classification calculates sentiment probabilities using the Bayes theorem.

Uploaded by

Chirag Tharwani
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
100% found this document useful (1 vote)
65 views28 pages

Business Analytics Using Python Sentiment Analytics: Cyrus Lentin

The document discusses sentiment analysis and classification using Python. It defines sentiment as feelings, attitudes, emotions and opinions that are subjective and differ between individuals. Sentiment analysis uses machine learning to characterize the sentiment in text as positive, negative or neutral. Some challenges in sentiment analysis are ambiguity in language and context. The document outlines common sentiment analysis tasks and classification methods like baseline, Bayes and support vector machines. It provides examples of how Bayes classification calculates sentiment probabilities using the Bayes theorem.

Uploaded by

Chirag Tharwani
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 28

Business Analytics Using Python

Sentiment Analytics

Cyrus Lentin
What is Sentiment?
▪ Sentiment = Feelings

Business Analytics With R – Cyrus Lentin 1


What is Sentiment?
▪ Sentiment = Feelings

▪ Attitudes
▪ Emotions
▪ Opinions

Business Analytics With R – Cyrus Lentin 2


What is Sentiment?
▪ Sentiment = Feelings

▪ Attitudes
▪ Emotions
▪ Opinions

▪ Subjective
▪ No Rational
▪ Will Differ From Person To Person
▪ Not Facts

Business Analytics With R – Cyrus Lentin 3


What is Sentiment?
▪ Sentiment = Feelings

▪ Attitudes
▪ Emotions
▪ Opinions

▪ Subjective
▪ No Rational
▪ Will Differ From Person To Person
▪ Not Facts

▪ Sentiment Analysis Are Machine Learning Methods To Extract, Identify, Or Otherwise Characterize The
Sentiment Content Of A Text Unit

Business Analytics With R – Cyrus Lentin 4


What is Sentiment?
▪ Sentiment = Feelings

▪ Attitudes
▪ Emotions
▪ Opinions

▪ Subjective
▪ No Rational
▪ Will Differ From Person To Person
▪ Not Facts

▪ Sentiment Analysis Are Machine Learning Methods To Extract, Identify, Or Otherwise Characterize The
Sentiment Content Of A Text Unit
▪ Sometimes Also Referred To As Opinion Mining, Which Is Computational Study Of Opinions
(Sentiments, Emotions) Expressed In Text

Business Analytics With R – Cyrus Lentin 5


What is Sentiment?
▪ In Others Words Determine If A Sentence Or A Document Expresses Positive, Negative, Neutral
Sentiment Towards Some Object?

The movie The movie The movie


was fabulous! stars Mr. X was horrible!

[ Sentiment ] [ Fact ] [ Sentiment ]

Why Opinion Mining Now? Because The Web Contains Huge Volumes Of Opinionated Text

Business Analytics With R – Cyrus Lentin 6


Applications
▪ Product Acceptance
▪ Brand Perception
▪ Reputation Management
▪ Customer Satisfaction
▪ Flame Detection (Bad Rants)
▪ Influencers
▪ Child-suitability Identification
▪ News Classification
▪ (In)appropriate Content Identification

Business Analytics With R – Cyrus Lentin 7


Challenges
▪ How does a machine analyze polarity (negative/positive)?
▪ How does a machine define subjectivity & sentiment?
▪ How does a machine deal with subjective word senses?
▪ How does a machine assign an opinion rating?
▪ How does a machine know about sentiment intensity?

Business Analytics With R – Cyrus Lentin 8


Challenges
▪ How does a machine analyze polarity (negative/positive)?
▪ How does a machine define subjectivity & sentiment?
▪ How does a machine deal with subjective word senses?
▪ How does a machine assign an opinion rating?
▪ How does a machine know about sentiment intensity?

Business Analytics With R – Cyrus Lentin 9


Language Is Ambiguous – 1
▪ The Watch Isn't Water Resistant
[ In A Product Review This Could Be Negative ]
▪ Hit The Nail On The Head
[ Use Of Phrases Have Difference Meaning ]
▪ Low Price / Low Quality
[ Sentiment Changes With Accompanying Word ]
▪ The Canon Camera Is Better Than The Fisher Price One
[ Comparisons Are Hard To Classify ]
▪ The Ice Cream Is Luuuvvvvveeeely
[ Slangs ]
▪ IMHO …. / LOL / FO
[ Abbreviations ]
▪ That Won’t Do You No Good
[ Double Negative ]
▪ Not Good / Not Bad
[ Flipped Sentiment ]
▪ Got Up And Walked Out
[ No Sentiment From Word ]
Business Analytics With R – Cyrus Lentin 10
Language Is Ambiguous – 2
▪ I Do Not Dislike Cabin Cruisers
[ Negation Handling ]
▪ Disliking Watercraft Is Not Really My Thing
[ Negation, Inverted Word Order ]
▪ Sometimes I Really Hate Ribs
[ Adverbial Modifies The Sentiment ]
▪ I'd Really Truly Love Going Out In This Weather!
[ Possibly Sarcastic ]
▪ Chris Craft Is Better Looking Than Limestone
[ Two Brand Names, Identifying Target Of Attitude Is Difficult ]
▪ Chris Craft Is Better Looking Than Limestone, But Limestone Projects Seaworthiness And Reliability
[ Two Attitudes, Two Brand Names ]
▪ The Movie Is Surprising With Plenty Of Unsettling Plot Twists
[ Negative Term Used In A Positive Sense In Certain Domains ]
▪ I Love My Mobile But Would Not Recommend It To Any Of My Colleagues
[ Qualified Positive Sentiment, Difficult To Categorize ]
▪ Next Week's Gig Will Be Right Koide9!
[ Newly Minted Terms Can Be Highly Attitudinal But Volatile In Polarity And Often Out Of Known
Vocabulary ]

Business Analytics With R – Cyrus Lentin 11


Sentiment Analytics Tasks
▪ Document Pre-processing
• Tokenization
▪ Document Cleaning
• Special Texts
• Punctuations
• Digits
• Stemming
▪ Classification
• Polarity
• Emotions
• Topic
▪ Visualization
• Polarity Frequency Distribution
• Emotions Frequency Distribution
• Word Cloud

Business Analytics With R – Cyrus Lentin 12


Classification Methods
▪ Baseline Classification
▪ Bayes Classification
▪ Entropy Classification
▪ Ngram Classification
▪ Support Vector Machine

Business Analytics With R – Cyrus Lentin 13


Baseline Classification
▪ Baseline Approach Is To Use A Dictionary (List) Of Positive And Negative Keywords
▪ You May Either Use A List Of Keywords, Which Is Publicly Available Or Create Your Own
▪ For Each Record / Line / Unit Of Words, We Count The Number Of Negative Keywords And Positive
Keywords That Appear
▪ The Classifier Returns The Polarity With The Higher Count
▪ If There Is A Tie, Then Neutral Polarity (The Majority Class) Is Returned
Enhancements
▪ Rather Than Just Checking Count Of Words, We Assign Weights To Each Word Based On Past Training
▪ The Classifier Returns The Polarity With The Higher Weighted Score
▪ If There Is A Tie, Then Majority (As Per The Count) Polarity Is Returned

Business Analytics With R – Cyrus Lentin 14


Baseline Classification – Simple
▪ Pos Dict: good, better, best, wonderful
▪ Neg Dict: bad, worse, worst, horrible

▪ Today is a good day


Pos Dict Score: 1
Neg Dict Score: 0
Positive
▪ Today is a bad day
Pos Dict Score: 0
Neg Dict Score: 1
Negative
▪ Today is a Monday
Pos Dict Score: 0
Neg Dict Score: 0
Nutreal

Business Analytics With R – Cyrus Lentin 15


Baseline Classification – Weighted
▪ Pos Dict: good(1), better (3), best (5), wonderful (5)
▪ Neg Dict: bad(1), worse(3), worst(5), horrible (5)

▪ Today was a wonderful day


Pos Dict Score: 5
Neg Dict Score: 0
Positive
▪ Today is a horrible day
Pos Dict Score: 0
Neg Dict Score: 5
Negative
▪ The movie was bad but the acting was wonderful
Pos Dict Score: 5
Neg Dict Score: 1
Positive

Business Analytics With R – Cyrus Lentin 16


Bayes Classification
▪ Statistical method for classification
▪ Supervised Learning Method
▪ Assumes an underlying probabilistic model, the Bayes theorem
▪ Can solve problems involving both categorical and continuous valued attributes
▪ Named after Thomas Bayes, who proposed the Bayes Theorem
▪ Bayes approach is to use a dictionary (list) of positive and negative keywords
▪ You may either use a list of keywords, which is publicly available or create your own
▪ For each record / line / unit of words, we compute the probability of the word appearing in the
negative keyword list and positive keyword list
▪ The classifier returns the polarity with the higher probability

Business Analytics With R – Cyrus Lentin 17


Bayes Theorem
Given a hypothesis h and data D, the following is are The following is formula to calculate the
the probabilities: posterior probability:
▪ P(h): prior probability
independent probability of hypothesis h
▪ P(D): independent probability
independent probability of data D
▪ P(D|h): likelihood
the statistical probability of the data D for given
hypothesis h
▪ P(h|D): posterior probability
the statistical probability that a hypothesis h is true
calculated in the light of relevant data D

Business Analytics With R – Cyrus Lentin 18


Bayes Classification – How It Works – A Language Model
▪ In Order To Understand The Process, We Will Use An Example With Small Number Of Posts
Type Post Text Class
Training good happy good Positive
Training good good service Positive
Training good friendly Positive
Training lousy good cheat Negative
Test good good good cheat lousy ???

What Was The Problem / Question?


▪ We Are Trying To Determine Whether The Class For Last Post Is Positive Or Negative.
▪ So In Effect, We Want To Compute:
▪ P(Pos|good good good cheat lousy)
▪ By Bayes Theorem, This Is Equal To:
P(Pos) * P(good good good cheat lousy |Pos)
P(good good good lousy cheat)

Business Analytics With R – Cyrus Lentin 19


Bayes Classification – How It Works – A Language Model
▪ By Bayes Theorem, This Is Equal To:
P(Pos) * P(good good good cheat lousy |Pos) P(Pos) * P(good|Pos)^3 * P(cheat|Pos) * P(lousy |Pos)
P(good good good lousy cheat) P(good good good lousy cheat)
▪ P(Pos) = 3/4 Type Document Text Class
▪ P(Neg) = 1/4
Training good happy good Positive

▪ P(good|Pos) = (5)/(6) = 5/6 Training good good service Positive


▪ P(cheat|Pos) = (0)/(6) = 0 Training good friendly Positive
▪ P(lousy|Pos) = (0)/(6) = 0
Training lousy good cheat Negative
▪ P(good|Neg) = (1)/(3) = 1/3
▪ P(cheat|Neg) = (1)/(3) = 1/3 Test good goood good cheat lousy ???
▪ P(lousy|Neg) = (1)/(3) = 1/3

▪ However, this would break as soon as we encounter a word that isn't in our training set?
▪ For example, if “goood” is not in our training set, and occurs in our test set, then since
▪ P(goood|Pos) = 0, so our product is zero for all classes
Business Analytics With R – Cyrus Lentin 20
Bayes Classification – How It Works – A Language Model
▪ By Bayes Theorem, This Is Equal To:
P(Pos) * P(good good good cheat lousy |Pos) P(Pos) * P(good|Pos)^3 * P(cheat|Pos) * P(lousy |Pos)
P(good good good lousy cheat) P(good good good lousy cheat)

▪ We need nonzero probabilities for all words, even words that don't exist
▪ Introducing +1 Smoothing
▪ Just count every word one time more than it actually occurs
▪ Since we are only concerned with relative probabilities, this slight inaccuracy should not be a problem
P(word|C) = count(word|C) + 1
count(C) + V
▪ Where V is the total vocabulary, so that our probabilities sum to 1

Business Analytics With R – Cyrus Lentin 21


Bayes Classification – How It Works – A Language Model
▪ P(Pos) = 3/4 Type Document Text Class
▪ P(Neg) = 1/4
Training good happy good Positive

Training good good service Positive


▪ P(good|Pos) = (5+1)/(8+5+1) = 3/7
▪ P(cheat|Pos) = (0+1)/(8+5+1 )= 1/14 Training good friendly Positive
▪ P(lousy|Pos) = (0+1)/(8+5+1) = 1/14 Training lousy good cheat Negative
▪ P(good|Neg) = (1+1)/(3+5+1) = 2/9
Test good good good cheat lousy ???
▪ P(cheat|Neg) = (1+1)/(3+5+1) = 2/9
▪ P(lousy|Neg) = (1+1)/(3+5+1) = 2/9

▪ P(Pos|D5) = P(Pos) * P(good|Pos)^3 * P(cheat|Pos) * P(lousy |Pos)


▪ P(Pos|D5) = 3/4 * (3/7)^3 * (1/14) * (1/14) = 0.0003
▪ P(Neg|D5) = P(Neg) * P(good| Neg)^3 * P(cheat| Neg) * P(lousy | Neg)
▪ P(Neg|D5) = 1/4 * (2/9)^3 * (2/9) * (2/9) = 0.0001
▪ Prediction: Positive

Business Analytics With R – Cyrus Lentin 22


Issues – Tokenization
▪ Use Only Whitespace To Tokenize?
“Food”, “Food.”, Food,” And “Food!” All Different.
▪ Use Whitespace And Punctuation To Tokenize?
“Won't” Tokenized To “Won” And “T”
▪ What About Emails? Urls? Phone Numbers?
▪ What About The Things We Haven't Thought About Yet?
▪ Don’t Re-Invent The Wheel; Use A Library!

Tokenization Strategies
▪ Stop Words
▪ Sparse Words
▪ Profanity
▪ Remove Punctuations
▪ Consistent Case
▪ Stemming

Business Analytics With R – Cyrus Lentin 23


Issues – Arithmetic
▪ What Happens When You Multiply A Large Amount Of Small Numbers?
Very Small Number
▪ To Prevent Underflow, Use Sums Of Logs Instead Of Products Of True Probabilities.
▪ Key Properties Of Log:
Log(ab) = Log(a) + Log(b)
X > Y => Log(x) > Log(y)
▪ Turns Very Small Numbers Into Manageable Negative Numbers

Business Analytics With R – Cyrus Lentin 24


Training The Classifier
▪ Build The Vocabulary As The List Of All Distinct Words That Appear In All The Documents Of The
Training Set.
▪ Remove Stop Words And Markings (Punctuations)
▪ Decide & Deal With Sparse Words & Profanities
▪ The Words In The Vocabulary Become The Attributes, Assuming That Classification Is Independent Of
The Positions Of The Words
▪ Each Document In The Training Set Becomes A Record With Frequencies For Each Word In The
Vocabulary
▪ Train The Classifier Based On The Training Data Set, By Computing The Prior Probabilities For Each
Class And Attributes
▪ Evaluate The Results On Test Data

Business Analytics With R – Cyrus Lentin 25


Wind Up
▪ Sentiment analysis is a difficult task
▪ The difficulty increases with the nuance and complexity of opinions expressed
▪ Product reviews, etc are relatively easy
▪ Books, movies, art, music are more difficult
▪ Policy discussions, indirect expressions of opinion more difficult still
▪ Non-binary sentiment (political leanings etc) is extremely difficult
▪ Patterns of alliance and opposition between individuals become central

Business Analytics With R – Cyrus Lentin 26


Thank you!
Contact:
Cyrus Lentin
[email protected]
+91-98200-94236

Business Analytics With R – Cyrus Lentin 27

You might also like