Business Analytics Using Python Sentiment Analytics: Cyrus Lentin
Business Analytics Using Python Sentiment Analytics: Cyrus Lentin
Sentiment Analytics
Cyrus Lentin
What is Sentiment?
▪ Sentiment = Feelings
▪ Attitudes
▪ Emotions
▪ Opinions
▪ Attitudes
▪ Emotions
▪ Opinions
▪ Subjective
▪ No Rational
▪ Will Differ From Person To Person
▪ Not Facts
▪ Attitudes
▪ Emotions
▪ Opinions
▪ Subjective
▪ No Rational
▪ Will Differ From Person To Person
▪ Not Facts
▪ Sentiment Analysis Are Machine Learning Methods To Extract, Identify, Or Otherwise Characterize The
Sentiment Content Of A Text Unit
▪ Attitudes
▪ Emotions
▪ Opinions
▪ Subjective
▪ No Rational
▪ Will Differ From Person To Person
▪ Not Facts
▪ Sentiment Analysis Are Machine Learning Methods To Extract, Identify, Or Otherwise Characterize The
Sentiment Content Of A Text Unit
▪ Sometimes Also Referred To As Opinion Mining, Which Is Computational Study Of Opinions
(Sentiments, Emotions) Expressed In Text
Why Opinion Mining Now? Because The Web Contains Huge Volumes Of Opinionated Text
▪ However, this would break as soon as we encounter a word that isn't in our training set?
▪ For example, if “goood” is not in our training set, and occurs in our test set, then since
▪ P(goood|Pos) = 0, so our product is zero for all classes
Business Analytics With R – Cyrus Lentin 20
Bayes Classification – How It Works – A Language Model
▪ By Bayes Theorem, This Is Equal To:
P(Pos) * P(good good good cheat lousy |Pos) P(Pos) * P(good|Pos)^3 * P(cheat|Pos) * P(lousy |Pos)
P(good good good lousy cheat) P(good good good lousy cheat)
▪ We need nonzero probabilities for all words, even words that don't exist
▪ Introducing +1 Smoothing
▪ Just count every word one time more than it actually occurs
▪ Since we are only concerned with relative probabilities, this slight inaccuracy should not be a problem
P(word|C) = count(word|C) + 1
count(C) + V
▪ Where V is the total vocabulary, so that our probabilities sum to 1
Tokenization Strategies
▪ Stop Words
▪ Sparse Words
▪ Profanity
▪ Remove Punctuations
▪ Consistent Case
▪ Stemming