4.Machine Learning for Text Understanding-1
4.Machine Learning for Text Understanding-1
Tien-Lam Pham
Contents
Is this spam?
Examples
Examples
5
Examples
Positive or negative movie review?
6
Examples
Why sentiment analysis?
7
Text Classification
Text Classification: definition
Input:
◦ a document d
◦ a fixed set of classes C = {c1, c2,…, cJ}
OUTPUT
Prediction (y)
Action (0, 1)
Data processing
Input:
◦ a document d
◦ a fixed set of classes C = {c1, c2,…, cJ}
◦ A training set of m hand-labeled documents
(d1,c1),....,(dm,cm)
Classification Methods:
Output:
◦ Supervised Machine
a learned classifier γ:d Learning
àc
Any kind of classifier
◦ Naïve Bayes 14
◦ Logistic regression
◦ Neural networks
◦ k-Nearest Neighbors
◦ …
Classification Evaluation
NaiveNaive
BayesBayes
Intuition
P(d | c)P(c)
P(c | d) =
P(d)
Bag of Words
The Bag of Words Representation
it
it it 6 6
I I I 5 5
I love
I loveI love
this this
movie!
this movie!
movie! It'sIt's It's
sweet, sweet,
sweet, thethe the 4 4
fairy always love it it it
fairy
butbut
but with with
satirical
with satirical
satiricalhumor. humor.
humor. The The
The fairy
it always lovelove
always to toto to to to 3 3
dialogue itit whimsical
whimsical it itit I and and 3 3
dialogue
dialogueis great is is great
andand
great and
the thethe whimsical
and seen areare I I and
adventure scenes are fun... and
and seen areanyone seen 2
adventure
adventure scenes scenes are are fun...
fun... friend seen anyone
anyone seen seen2
friend dialogue
It It manages
manages to to
be be whimsical
whimsical friendhappy happy dialogue
dialogue yetyet yet 1 1
It manages to be whimsical happy recommend would
and romantic
and romantic while while laughing
laughing adventure
adventure recommend
recommend would 1 1
and romantic while laughing adventure sweet of satirical
satirical would 1
whimsical
at at
thethe conventions
conventions of of thethe whowho sweet of movie
sweet of satirical
it it whimsical 1
whimsical
at thefairy
conventions
fairy tale genre. of Ithewould who I but tomovie times 1 1
tale genre. I would it it I but to romantic
movie
romantic I I it times
yet
fairy recommend
tale genre. itI to
recommend would
it to
justjust about
about several it Iagain
several to romantic
but yet the humor I
sweet
sweet times1 1
humor satirical
recommend anyone. I've
it to seen
just it several
itabout several again it yet
itthe satiricalsweet1 1
anyone. I've seen several thethe again seen humor
would adventure
anyone. times,
times, I'veand and I'mI'm
seen it always
alwaysseveral happy
happy to to seen
scenes Iit
I thethe
the
would
manages adventure 1 1
satirical
thescenes genre
times,
to
to and
see seeitI'm it again
againalways
whenever
whenever happy I I fun
seen
thethe
times
would
manages genre 1 1
adventure
have a friend who hasn't fun toI Iscenes and I the
times andand
manages fairy
fairy 1 1
have
to see seen a
it again friend who
whenever I hasn't and the about
abouttimeswhile while humor genre 1
it yet! fun whenever haveand humor fairy 1
haveseen
a friendit yet!who hasn't whenever Iconventions
and about have have
have 1 1
conventions
with while
seen it yet! whenever
with greathumor 1 1
have great
conventions …… have……
with great
19
…
Bayesian Inference −2
−2 0 2 −
Classification
(a)
ŷ = arg max P(y = c | x)
c
Figure 4.3 Quadratic decision boundaries in 2D for the 2
P(x | y = c)P(y = c)
discrimAnalysisDboundariesDemo.
P(y = c | x) =
P(x)
T=100 T=1 T
0.4 1 1
0.3
0.1
0 0 0
1 2 3 1 2 3 1
Bayesian Learning
P(D | h)P(h)
P(h | D) =
P(D)
Bayesian Learning
P(D | θ)P(θ)
P(θ | D) =
P(D)
Naive Bayes Classifier
Document (I)
Classification
MAP is “maximum a
cMAP = argmax P(c | d) posteriori” = most
c∈C likely class
P(d | c)P(c)
= argmax Bayes Rule
c∈C P(d)
= argmax P(d | c)P(c) Dropping the
denominator
c∈C
Document Classification
Naive Bayes Classifier (II)
"Likelihood" "Prior"
available.
Multinomial Naive Bayes Independence
Assumptions
Naive Bayes Independence Assumption
P(x1, x2 ,…, xn | c)
P(x1Multinomial Naive
,…, xn | c) = P(x Bayes
1 | c)• P(x2 |Classifier
c)• P(x3 | c)•...• P(xn | c)
2 3
X
This: cNB = argmax 4log P (cj ) + log P (xi |cj )5
cj 2C
i2positions
Notes:
1) Taking log doesn't change the ranking of classes!
The class with highest probability also has highest log probability!
2) It's a linear model:
Just a max of a sum of weights: a linear function of the inputs
So naive bayes is a linear classifier
Learning the Multinomial Naive Bayes Model
Naive Bayes Learning
First attempt: maximum likelihood estimates
◦ simply use the frequencies in the data
%"!
"! #! =
%#$#%&
count(wi , c j )
P̂(wi | c j ) =
∑ count(w, c j )
w∈V
Parameter
Naive estimation
Bayes Learning
P̂(wi | c j ) =
count(wi , c j ) fraction of times word wi appears
∑ count(w, c j ) among all words in documents of topic cj
w∈V
count("fantastic", positive)
P̂("fantastic" positive) = = 0
∑ count(w, positive)
w∈V
count(wi , c) +1
P̂(wi | c) =
∑ (count(w, c))+1)
w∈V
count(wi , c) +1
=
# &
%% ∑ count(w, c)(( + V
$ w∈V '
Naive Bayes Learning
Multinomial Naïve Bayes: Learning
3 2
d from actual movie reviews.
3 2
Cat of trainingDocuments P( ) = P(+) =
through an example and testing naive Bayes with add-one 5 5
We’ll Activity
Training -
use a sentimentjust plain boring
analysis domain with the Thetwo
wordclasses
with positive
doesn’t occur in the training set, so we drop
A worked sentiment example with add-1 smoothing
-
-
entirely
no
rom actual movie reviews.
predictable
surprises and
and
gative (-), and take the following miniature training
very
lacks
few
energy
andabove,
mentioned
laughs
test documents
we don’t use unknown word models for naive
lihoods from the training set for the remaining three words “pred
+ very powerful
“fun”, are as follows, from Eq. 4.14 (computing the probabilities
Cat
+ the most fun Documents
film of the summer 1. Prior from training:
Training
Test -? just with no fun of the words in the training set is left as an exercise for the reader
plain boring
predictable
- entirely predictable and lacks energy
rior P(c) for the two classes is computed via Eq. 4.11 as N : Nc
0/ # =
1&!
1+1 P(-) = 3/5
- no surprises and very few laughs doc P(“predictable”|
%
1)'(')*
= P(“predictable”|+) =
14 + 20 P(+) = 2/5
+ very powerful
3 2
P( ) = P(+) = 1+1 0+1
+ the most5 fun film of 5 the summer P(“no”| ) = P(“no”|+) =
word Test
with doesn’t? occur
predictable with no
in the training set,fun 2. (asDrop 14
so we drop it completely "with"
+ 20 9 + 20
d above, we don’t use unknown word models for naive Bayes). The like- 0+1 1+1
or3.P(c) for the two classes is computed via Eq. 4.11 as Nc
: P(“fun”| ) = P(“fun”|+) =
om theLikelihoods fromthree
training set for the remaining training: Ndoc “no”, and
words “predictable”, 14 + 20 9 + 20
e as follows, from Eq. 4.14
#%&'((computing
3 "! , # +the1 probabilities
2 For the testfor the remainder
sentence S = “predictable with no fun”, after removin
! " # = P( P(+)
rds in the training
! set ) =
∑ is left as an exercise for the reader):
"∈$ #%&'(
5
=
", # + |.| 4. Scoring the test set:
5 the chosen class, via Eq. 4.9, is therefore computed as follows:
rd with doesn’t occur 1 +the
in 1 training set, so we drop0it+completely
1 (as
P(“predictable”| ) = P(“predictable”|+) = 3 2⇥2⇥1 5
14 + 20 word models for naive9Bayes).
above, we don’t use unknown + 20 P( like-
The )P(S| ) = ⇥ 3
= 6.1 ⇥ 10
5 34
m the training set for the1 + 1
remaining three words 0+ 1
“predictable”, “no”, and
P(“no”| ) = P(“no”|+) = 2 1⇥1⇥2
as follows, from Eq. 4.14 14 +(computing
20 the probabilities
9 + 20 for the remainder
P(+)P(S|+) = ⇥ = 3.2 ⇥ 10 5
5 29 3
0 + 1 1 + 1
s in the training set is left as an exercise for the reader):
P(“fun”| ) = P(“fun”|+) =
14 + 20 9 +model
The 20
1+1 0thus
+ 1predicts the class negative for the test sentence.
P(“predictable”| )=
st sentence S = “predictable with P(“predictable”|+)
no fun”, after removing =the word ‘with’,
14 + 20 9 + 20
n class, via Eq. 4.9, is therefore computed as follows:
P(“no”| ) =
1+1
4.4 Optimizing for Sentiment Analysis
P(“no”|+) =
14 +320 2 ⇥ 2 ⇥ 1
0+1
9 + 520
P( )P(S| ) = ⇥ = 6.1 ⇥ 10
Activity
Naïve Bayes as a Language Model
Determine the class (positive / negative) of the sentence: “I love this fun lm”
fi
Activity
Naïve Bayes as a Language Model
Determine the class (positive / negative) of the sentence: “I love this fun lm”
fi
Binary Multinomial naive Bayes
Binary Multinomial Naive Bayes
on a test document d
First remove all duplicate words from d
Then compute NB using the same equation:
4
Binary Multinomial naive Bayes
percentage of all the observations (for the spam or pie examples that means all emai
or tweets) our system of
Evaluation labeled correctly. Although accuracy might seem a natur
Classifiers
metric, we generally don’t use it for text classification tasks. That’s because accurac
doesn’t work well when the classes are unbalanced (as indeed they are with spam
which is aThe
large 2-by-2
majority ofconfusion
email, or with matrix
tweets, which are mainly not about pie)
Figure 4.4 A confusion matrix for visualizing how well a binary classification system pe
orms against gold standard labels.
z = w · (For
x + bthe rest of the book, we’ll use the(5.3)
notat
e’ll represent such
has a sums
number using
of the dot
advantages;
his sum is high, we say y=1; if low, then y=0product
it takes notation
a fr
real-value
Now we have an algorithm that given an instance x computes the probability P(y =
1|x). How do we make a decision? For a test instance x, we say yes if the probability
P(yText
= 1|x) Features
is more than .5, and no otherwise.
5.1 • We call .5 the decision
C LASSIFICATION : THEboundary:
SIGMOID 5
⇢
1 if P(y = 1|x) > 0.5
ŷ x= =2
2 0 otherwise
x3=1
5.1.1 It'sExample:
hokey . Theresentiment
are virtually classification
no surprises , and the writing is second-rate .
So why was it so enjoyable ? For one thing , the cast is
Let’s have
great an example.
. Another nice Suppose wemusic
touch is the are doing
. I wasbinary sentiment
overcome with theclassification
urge to get offon
movie the
review
couch text,
andand
startwe would. like
dancing to know
It sucked mewhether to assign
in , and it'll do the the
samesentiment
to you . class
+ or to a review document doc. We’ll represent each input observation by the 6
x4=3
x1input
features x1 ...x6 of the =3 shown
x5=0in the following
x6=4.19 table; Fig. 5.2 shows the features
in a sample mini test document.
Figure 5.2 A sample mini test document showing the extracted features in the vector x.
Var Definition Value in Fig. 5.2
x1 these
Given count(positive lexicon)
6 features and 2 doc)
the input review x, P(+|x)3 and P( |x) can be com-
putedx2usingcount(negative
⇢Eq. 5.5: lexicon) 2 doc) 2
1 if “no” 2 doc
x3 = P(Y = 1|x) = s (w · x + b)
p(+|x) 1
0 otherwise
x4 count(1st
⇢ = s
and 2nd pronouns
([2.5, doc)
5.0, 21.2, 3 · [3, 2, 1, 3, 0, 4.19] + 0.1)
0.5, 2.0, 0.7]
1 if “!”=2 doc
s (.833)
x5 0
0 otherwise
= 0.70 (5.6)
x6 log(word count of doc) ln(66) = 4.19 30