0% found this document useful (0 votes)
7 views

4.Machine Learning for Text Understanding-1

Uploaded by

minhtrivip2808
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views

4.Machine Learning for Text Understanding-1

Uploaded by

minhtrivip2808
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 45

Natural Language Processing

Machine learning for Language Understanding

Tien-Lam Pham
Contents

Text classification problem


Machine learning
Naive Bayes
Logistic Regression
Neural network
Examples

Is this spam?
Examples
Examples

What is the subject of this medical article?


MEDLINE Article MeSH Subject Category Hierarchy
Antogonists and Inhibitors
Blood Supply
Chemistry
? Drug Therapy
Embryology
Epidemiology

4
Examples

Positive or negative movie review?

+ ...zany characters and richly applied satire, and some great


plot twists
It was pathetic.
− scenes... The worst part about it was the boxing

...awesome caramel sauce and sweet toasty almonds. I


+ love this place!

− ...awful pizza and ridiculously overpriced...

5
Examples
Positive or negative movie review?

+ ...zany characters and richly applied satire, and some great


plot twists
It was pathetic.
− scenes... The worst part about it was the boxing

...awesome caramel sauce and sweet toasty almonds. I


+ love this place!

− ...awful pizza and ridiculously overpriced...

6
Examples
Why sentiment analysis?

Movie: is this review positive or negative?


Products: what do people think about the new iPhone?
Public sentiment: how is consumer confidence?
Politics: what do people think about this candidate or issue?
Prediction: predict election outcomes or market trends from
sentiment

7
Text Classification
Text Classification: definition

Input:
◦ a document d
◦ a fixed set of classes C = {c1, c2,…, cJ}

Output: a predicted class c Î C


Intelligent Machine
INPUT
Information
(X)

Give machine set of rules


Knowledge
Learning from data Intelligence
(Machine learning)

OUTPUT
Prediction (y)
Action (0, 1)
Data processing

Dậy máy tính cách học

Làm thế nào để computer học?


Traditional ML vs DL
Rule-based Classification: Hand-code rules
Classification Methods: Hand-coded rules

Rules based on combinations of words or other features


◦ spam: black-list-address OR (“dollars” AND “you have been
selected”)
Accuracy can be high
◦ If rules carefully refined by expert
But building and maintaining these rules is expensive
Classification Methods:
Supervised Machine Learning
ML-based Classification: Learn for Rules

Input:
◦ a document d
◦ a fixed set of classes C = {c1, c2,…, cJ}
◦ A training set of m hand-labeled documents
(d1,c1),....,(dm,cm)
Classification Methods:
Output:
◦ Supervised Machine
a learned classifier γ:d Learning
àc
Any kind of classifier
◦ Naïve Bayes 14

◦ Logistic regression
◦ Neural networks
◦ k-Nearest Neighbors
◦ …
Classification Evaluation
NaiveNaive
BayesBayes
Intuition

Simple ("naive") classification method based on


Bayes rule
Bayes’ Rule Applied to Documents and Classes
Relies on very simple representation of document
◦ Bag of words
•For a document d and a class c

P(d | c)P(c)
P(c | d) =
P(d)
Bag of Words
The Bag of Words Representation
it
it it 6 6
I I I 5 5
I love
I loveI love
this this
movie!
this movie!
movie! It'sIt's It's
sweet, sweet,
sweet, thethe the 4 4
fairy always love it it it
fairy
butbut
but with with
satirical
with satirical
satiricalhumor. humor.
humor. The The
The fairy
it always lovelove
always to toto to to to 3 3
dialogue itit whimsical
whimsical it itit I and and 3 3
dialogue
dialogueis great is is great
andand
great and
the thethe whimsical
and seen areare I I and
adventure scenes are fun... and
and seen areanyone seen 2
adventure
adventure scenes scenes are are fun...
fun... friend seen anyone
anyone seen seen2
friend dialogue
It It manages
manages to to
be be whimsical
whimsical friendhappy happy dialogue
dialogue yetyet yet 1 1
It manages to be whimsical happy recommend would
and romantic
and romantic while while laughing
laughing adventure
adventure recommend
recommend would 1 1
and romantic while laughing adventure sweet of satirical
satirical would 1
whimsical
at at
thethe conventions
conventions of of thethe whowho sweet of movie
sweet of satirical
it it whimsical 1
whimsical
at thefairy
conventions
fairy tale genre. of Ithewould who I but tomovie times 1 1
tale genre. I would it it I but to romantic
movie
romantic I I it times
yet
fairy recommend
tale genre. itI to
recommend would
it to
justjust about
about several it Iagain
several to romantic
but yet the humor I
sweet
sweet times1 1
humor satirical
recommend anyone. I've
it to seen
just it several
itabout several again it yet
itthe satiricalsweet1 1
anyone. I've seen several thethe again seen humor
would adventure
anyone. times,
times, I'veand and I'mI'm
seen it always
alwaysseveral happy
happy to to seen
scenes Iit
I thethe
the
would
manages adventure 1 1
satirical
thescenes genre
times,
to
to and
see seeitI'm it again
againalways
whenever
whenever happy I I fun
seen
thethe
times
would
manages genre 1 1
adventure
have a friend who hasn't fun toI Iscenes and I the
times andand
manages fairy
fairy 1 1
have
to see seen a
it again friend who
whenever I hasn't and the about
abouttimeswhile while humor genre 1
it yet! fun whenever haveand humor fairy 1
haveseen
a friendit yet!who hasn't whenever Iconventions
and about have have
have 1 1
conventions
with while
seen it yet! whenever
with greathumor 1 1
have great
conventions …… have……
with great
19


Bayesian Inference −2

−2 0 2 −
Classification
(a)
ŷ = arg max P(y = c | x)
c
Figure 4.3 Quadratic decision boundaries in 2D for the 2
P(x | y = c)P(y = c)
discrimAnalysisDboundariesDemo.
P(y = c | x) =
P(x)
T=100 T=1 T
0.4 1 1

0.3

0.2 0.5 0.5

0.1

0 0 0
1 2 3 1 2 3 1
Bayesian Learning

Prior: probability a hypothesis is True


Likelihood: probability that a data is observed given a
hypothesis
Posterior: probability a hypothesis is True given a data

ĥ = arg max P(h | D)


h

P(D | h)P(h)
P(h | D) =
P(D)
Bayesian Learning

Prior: probability a hypothesis is True


Likelihood: probability that a data is observed given a
hypothesis
Posterior: probability a hypothesis is True given a data

θ ̂ = arg max P(θ | D)


θ

P(D | θ)P(θ)
P(θ | D) =
P(D)
Naive Bayes Classifier
Document (I)
Classification

MAP is “maximum a
cMAP = argmax P(c | d) posteriori” = most
c∈C likely class

P(d | c)P(c)
= argmax Bayes Rule

c∈C P(d)
= argmax P(d | c)P(c) Dropping the
denominator
c∈C
Document Classification
Naive Bayes Classifier (II)
"Likelihood" "Prior"

cMAP = argmax P(d | c)P(c)


c∈C
Document d
represented as
= argmax P(x1, x2 ,…, xn | c)P(c) features
c∈C x1..xn
Naïve Bayes Classifier
Document (IV)
Classification

cMAP = argmax P(x1, x2 ,…, xn | c)P(c)


c∈C

O(|X|n•|C|) parameters How often does this


class occur?

Could only be estimated if a


We can just count the
very, very large number of relative frequencies in
training examples was a corpus

available.
Multinomial Naive Bayes Independence
Assumptions
Naive Bayes Independence Assumption
P(x1, x2 ,…, xn | c)

Bag of Words assumption: Assume position doesn’t matter


Conditional Independence: Assume the feature
probabilities P(xi|cj) are independent given the class c.

P(x1Multinomial Naive
,…, xn | c) = P(x Bayes
1 | c)• P(x2 |Classifier
c)• P(x3 | c)•...• P(xn | c)

cMAP = argmax P(x1, x2 ,…, xn | c)P(c)


c∈C

cNB = argmax P(c j )∏ P(x | c)


c∈C x∈X
Applying
Naive Bayes
Applying Multinomial
Independence
Multinomial NaiveAssumption
Naive Bayes Bayes Classifiers
Classifiers
to to
TextText Classification
Classification

positions ¬ all word positions in test document


positions ¬ all word positions in test document

cNB = argmax P(c j ) ∏ P(xi | c j )


cNB =c ∈C
argmax P(c
j
j)
i∈ positions
c j ∈C
∏ P(xi | c j )
i∈ positions
Problems with
Naive Bayes multiplyingAssumption
Independence lots of probs
There's a problem with this:

cNB = argmax P(c j )


c j ∈C
∏ P(xi | c j )
i∈ positions

Multiplying lots of probabilities can result in floating-point underflow!


.0006 * .0007 * .0009 * .01 * .5 * .000008….
Idea: Use logs, because log(ab) = log(a) + log(b)
We'll sum logs of probabilities instead of multiplying probabilities!
Naive Bayes Independence Assumption
We actually do everything in log space
Instead of this: cNB = argmax P(c j ) ∏ P(xi | c j )
c j ∈C i∈ positions
<latexit sha1_base64="o0LQfSf3I3G0xas3oLJOwQZR0GU=">AAACoXicbVFdaxQxFM2MH63r16qPggQXoSIsMwWxL0JpfdAHyypuW5gMQyZ7ZzZ2koxJRnaJ+V/+Dt/8N2Z2R6itF0IO597Dvffcsm24sUnyO4pv3Lx1e2f3zujuvfsPHo4fPT41qtMM5kw1Sp+X1EDDJcwttw2ctxqoKBs4Ky+O+/zZd9CGK/nFrlvIBa0lrzijNlDF+CdZQEWorgVdOSKoXarWES3wlvJ+REqouXTwTVKt6dqPWOGIhZV1J0fe47d4UBeOFV8Jl/jYY9JAZbPwqRrP9gL/Er/CxHSicLwvCZ1KtXKtMrwfw3jv/xavCv6jFxDN66XNMZFKdqIETUAuLk1RjCfJNNkEvg7SAUzQELNi/IssFOsESMsaakyWJq3NHdWWswbCnp2BlrILWkMWoKQCTO42Dnv8IjALXCkdnrR4w15WOCqMWYsyVPYemqu5nvxfLutsdZA7LtvOgmTbRlXXYKtwfy684BqYbdYBUKaDWwyzJdWU2XDU3oT06srXwen+NH09TT7tTw6PBjt20VP0HO2hFL1Bh+g9mqE5YtGz6F30MTqJJ/GHeBZ/3pbG0aB5gv6JOPsD0yvRAA==</latexit>

2 3
X
This: cNB = argmax 4log P (cj ) + log P (xi |cj )5
cj 2C
i2positions

Notes:
1) Taking log doesn't change the ranking of classes!
The class with highest probability also has highest log probability!
2) It's a linear model:
Just a max of a sum of weights: a linear function of the inputs
So naive bayes is a linear classifier
Learning the Multinomial Naive Bayes Model
Naive Bayes Learning
First attempt: maximum likelihood estimates
◦ simply use the frequencies in the data

%"!
"! #! =
%#$#%&
count(wi , c j )
P̂(wi | c j ) =
∑ count(w, c j )
w∈V
Parameter
Naive estimation
Bayes Learning

P̂(wi | c j ) =
count(wi , c j ) fraction of times word wi appears
∑ count(w, c j ) among all words in documents of topic cj
w∈V

Create mega-document for topic j by concatenating all


docs in this topic
◦ Use frequency of w in mega-document
Sec.13.3

Problem with Maximum


Naive Bayes Learning Likelihood

What if we have seen no training documents with the word fantastic


and classified in the topic positive (thumbs-up)?

count("fantastic", positive)
P̂("fantastic" positive) = = 0
∑ count(w, positive)
w∈V

Zero probabilities cannot be conditioned away, no matter the other


evidence!
cMAP = argmax c P̂(c)∏ P̂(xi | c)
i
Laplace
Naive Bayes (add-1) smoothing for Naïve Baye
Learning

count(wi , c) +1
P̂(wi | c) =
∑ (count(w, c))+1)
w∈V

count(wi , c) +1
=
# &
%% ∑ count(w, c)(( + V
$ w∈V '
Naive Bayes Learning
Multinomial Naïve Bayes: Learning

• From training corpus, extract Vocabulary


Calculate P(cj) terms • Calculate P(wk | cj) terms
◦ For each cj in C do • Textj ¬ single doc containing all docsj
docsj ¬ all docs with class =cj • For each word wk in Vocabulary
nk ¬ # of occurrences of wk in Textj
| docs j |
P(c j ) ← nk + α
| total # documents| P(wk | c j ) ←
n + α | Vocabulary |
Unknown
Unknown words
Words

What about unknown words


◦ that appear in our test data
◦ but not in our training data or vocabulary?
We ignore them
◦ Remove them from the test document!
◦ Pretend they weren't there!
◦ Don't include any probability for them at all!
Why don't we build an unknown word model?
◦ It doesn't help: knowing which class has more unknown words
not generally helpful!
Stop words
Stop Words

Some systems ignore stop words


◦ Stop words: very frequent words like the and a.
◦ Sort the vocabulary by word frequency in training set
◦ Call the top 10 or 50 words the stopword list.
◦ Remove all stop words from both training and test sets
◦ As if they were never there!

But removing stop words doesn't usually help


• So in practice most NB algorithms use all words and don't
use stopword lists
We’ll use a sentiment analysis domain with the two c
ve (-), and take the following miniature training and t
Activity
Let's do a worked sentiment example!
m actual movie reviews.
Cat Documents
Training - just plain boring
- entirely predictable and lacks energy
- no surprises and very few laughs
+ very powerful
+ the most fun film of the summer
Test ? predictable with no fun
Nc
P(c) for the two classes is computed via Eq. 4.11 as Ndo

3 2
d from actual movie reviews.
3 2
Cat of trainingDocuments P( ) = P(+) =
through an example and testing naive Bayes with add-one 5 5
We’ll Activity
Training -
use a sentimentjust plain boring
analysis domain with the Thetwo
wordclasses
with positive
doesn’t occur in the training set, so we drop
A worked sentiment example with add-1 smoothing
-
-
entirely
no
rom actual movie reviews.
predictable
surprises and
and
gative (-), and take the following miniature training
very
lacks
few
energy
andabove,
mentioned
laughs
test documents
we don’t use unknown word models for naive
lihoods from the training set for the remaining three words “pred
+ very powerful
“fun”, are as follows, from Eq. 4.14 (computing the probabilities
Cat
+ the most fun Documents
film of the summer 1. Prior from training:
Training
Test -? just with no fun of the words in the training set is left as an exercise for the reader
plain boring
predictable
- entirely predictable and lacks energy
rior P(c) for the two classes is computed via Eq. 4.11 as N : Nc
0/ # =
1&!
1+1 P(-) = 3/5
- no surprises and very few laughs doc P(“predictable”|
%
1)'(')*
= P(“predictable”|+) =
14 + 20 P(+) = 2/5
+ very powerful
3 2
P( ) = P(+) = 1+1 0+1
+ the most5 fun film of 5 the summer P(“no”| ) = P(“no”|+) =
word Test
with doesn’t? occur
predictable with no
in the training set,fun 2. (asDrop 14
so we drop it completely "with"
+ 20 9 + 20
d above, we don’t use unknown word models for naive Bayes). The like- 0+1 1+1
or3.P(c) for the two classes is computed via Eq. 4.11 as Nc
: P(“fun”| ) = P(“fun”|+) =
om theLikelihoods fromthree
training set for the remaining training: Ndoc “no”, and
words “predictable”, 14 + 20 9 + 20
e as follows, from Eq. 4.14
#%&'((computing
3 "! , # +the1 probabilities
2 For the testfor the remainder
sentence S = “predictable with no fun”, after removin
! " # = P( P(+)
rds in the training
! set ) =
∑ is left as an exercise for the reader):
"∈$ #%&'(
5
=
", # + |.| 4. Scoring the test set:
5 the chosen class, via Eq. 4.9, is therefore computed as follows:
rd with doesn’t occur 1 +the
in 1 training set, so we drop0it+completely
1 (as
P(“predictable”| ) = P(“predictable”|+) = 3 2⇥2⇥1 5
14 + 20 word models for naive9Bayes).
above, we don’t use unknown + 20 P( like-
The )P(S| ) = ⇥ 3
= 6.1 ⇥ 10
5 34
m the training set for the1 + 1
remaining three words 0+ 1
“predictable”, “no”, and
P(“no”| ) = P(“no”|+) = 2 1⇥1⇥2
as follows, from Eq. 4.14 14 +(computing
20 the probabilities
9 + 20 for the remainder
P(+)P(S|+) = ⇥ = 3.2 ⇥ 10 5
5 29 3
0 + 1 1 + 1
s in the training set is left as an exercise for the reader):
P(“fun”| ) = P(“fun”|+) =
14 + 20 9 +model
The 20
1+1 0thus
+ 1predicts the class negative for the test sentence.
P(“predictable”| )=
st sentence S = “predictable with P(“predictable”|+)
no fun”, after removing =the word ‘with’,
14 + 20 9 + 20
n class, via Eq. 4.9, is therefore computed as follows:
P(“no”| ) =
1+1
4.4 Optimizing for Sentiment Analysis
P(“no”|+) =
14 +320 2 ⇥ 2 ⇥ 1
0+1
9 + 520
P( )P(S| ) = ⇥ = 6.1 ⇥ 10
Activity
Naïve Bayes as a Language Model
Determine the class (positive / negative) of the sentence: “I love this fun lm”

Which class assigns the higher probability to

Model pos Model neg


0.1 I 0.2 I I love this
0.1 love 0.001 love
0.1 0.1 0.01
0.01 this 0.01 this 0.2 0.001 0.01

0.05 fun 0.005 fun


0.1 film 0.1 film P(s|pos) > P

fi
Activity
Naïve Bayes as a Language Model
Determine the class (positive / negative) of the sentence: “I love this fun lm”

Which class assigns the higher probability to

Model pos Model neg


0.1 I 0.2 I I love this
0.1 love 0.001 love
0.1 0.1 0.01
0.01 this 0.01 this 0.2 0.001 0.01

0.05 fun 0.005 fun


0.1 film 0.1 film P(s|pos) > P

fi
Binary Multinomial naive Bayes
Binary Multinomial Naive Bayes
on a test document d
First remove all duplicate words from d
Then compute NB using the same equation:

cNB = argmax P(c j )


c j ∈C
∏ P(wi | c j )
i∈ positions

4
Binary Multinomial naive Bayes
percentage of all the observations (for the spam or pie examples that means all emai
or tweets) our system of
Evaluation labeled correctly. Although accuracy might seem a natur
Classifiers
metric, we generally don’t use it for text classification tasks. That’s because accurac
doesn’t work well when the classes are unbalanced (as indeed they are with spam
which is aThe
large 2-by-2
majority ofconfusion
email, or with matrix
tweets, which are mainly not about pie)

gold standard labels


gold positive gold negative
system system tp
positive true positive false positive precision = tp+fp
output
labels system
negative false negative true negative
tp tp+tn
recall = accuracy =
tp+fn tp+fp+tn+fn

Figure 4.4 A confusion matrix for visualizing how well a binary classification system pe
orms against gold standard labels.

To make this more explicit, imagine that we looked at a million tweets, an


finding
are the things
many ways that
to define we are
a single supposed
metric to be looking
that incorporates fo
aspects
many
and ways
recall. to
The define
simplestaofsingle
these metric that
combinations is the F-measu
incorporates a
A combined
F-Score
n,recall.
1975) , defined as:
measure: F
The simplest of these combinations is the F-
975)F ,measure:
defined as: 2
(b + 1)PR
a single number that combines P and R:
Fb =
b 22P + R
(b + 1)PR
Fb =
parameter differentially weights the importance of recall and p
b P+R
2
haps on the needs of an application. Values of b > 1 favor reca
rameter
b < 1 favor differentially weights
precision. When b = 1,the importance
precision of recall
and recall are equ
We almost
most always
frequently use
used balanced
metric, andF1is(i.e.,
the needs of an application. Values ofbb=1 >
ss isonthe calledb =
F 1) or 1just F1 :
favo
1 favor precision. When b2PR
= 1, precision and recall ar
F1 =metric, and is called F
the most frequently used or ju
P+R b =1
sure comes from a weighted harmonic mean of precision and rec
2PR
mean of a set of numbers isF1the=reciprocal of the arithmetic mean
P+R
ive Logistic
(meaning
multiplies Regression
each the
xi byfeature for
is not
its weight one observation
wiassociated
, sums up the with the
weighted x clas
In
n term
the rest
Logistic
a sentiment
of the
Regression
b. Thetask
book
the word
resulting
we’ll represent
awesome
single number to have
such
z expresses
sums
a high posit the
us
linear
ce
haveforathealgebra.
veryclass.
negative The dot
weight. product
The biasof two
term, vectors
also1 called
a a
theInput observation:
products Figure 5.1 vector
of the The sigmoid
corresponding x = [x
function
1 , x
elements
2
y,…,= x
1+en ] z takes a
of eac
l number that’s It
n is added !
nearly to the
linear weighted
around 0 but inputs.
outlier values get squa
Weights:
equivalent
nowonto ado test X one
formation
instance—
classification per feature:
to
after Eq.
The W
5.2:=
problem:
we’ve [w
learned 1 ,z wisn't ,…,a w ]
probabilit
2the weights n
z = wi xi + b number! (5.2)
◦multiplies
Sometimes
firstsigmoid To
eachwecreate
xcall
i by thea
its weights
probability,
weight w θi , = [θ
we’ll
sums 1 , θ
up 2,…,
pass the θ
z ]
throug
weigh
n
r each feature sigmoid
xi, weightfunction
i=1 wi tells us(named
importance of x
bias term
Output:
(Pluslogistic b.a The
predicted
we'll have a bias b) resulting classsingle
"! Î because
number
{0,1} z = z w x
i it looks like
·
expresses + b
represent
dence for
function such
the tion,
sums
class.and gives
using logistic
the dot regression
product its
notation name. from The
e'll sum up all the weighted features and the bias
duct of two shown
vectors
But note that nothing graphically
a and b,in in
Solution:
written
Eq. Fig.
use
as
5.3 5.1:
aafunction
b
forces
· is of zsum
the z that goes
to of
be
n
!
nding elements
(multinomial Xof each
logistic vector.
regression:
between 0 and 1. In fact, since weights are Thus "
! Îthe {0,following
1, 2, 3, is
4}) an
1 real-
5.2: z = wi xi + b y = s (z) = = (5
negative; z ranges
i=1
from • to •.
1+e 1 z

z = w · (For
x + bthe rest of the book, we’ll use the(5.3)
notat
e’ll represent such
has a sums
number using
of the dot
advantages;
his sum is high, we say y=1; if low, then y=0product
it takes notation
a fr
real-value
Now we have an algorithm that given an instance x computes the probability P(y =
1|x). How do we make a decision? For a test instance x, we say yes if the probability
P(yText
= 1|x) Features
is more than .5, and no otherwise.
5.1 • We call .5 the decision
C LASSIFICATION : THEboundary:
SIGMOID 5

1 if P(y = 1|x) > 0.5
ŷ x= =2
2 0 otherwise
x3=1
5.1.1 It'sExample:
hokey . Theresentiment
are virtually classification
no surprises , and the writing is second-rate .
So why was it so enjoyable ? For one thing , the cast is
Let’s have
great an example.
. Another nice Suppose wemusic
touch is the are doing
. I wasbinary sentiment
overcome with theclassification
urge to get offon
movie the
review
couch text,
andand
startwe would. like
dancing to know
It sucked mewhether to assign
in , and it'll do the the
samesentiment
to you . class
+ or to a review document doc. We’ll represent each input observation by the 6
x4=3
x1input
features x1 ...x6 of the =3 shown
x5=0in the following
x6=4.19 table; Fig. 5.2 shows the features
in a sample mini test document.
Figure 5.2 A sample mini test document showing the extracted features in the vector x.
Var Definition Value in Fig. 5.2
x1 these
Given count(positive lexicon)
6 features and 2 doc)
the input review x, P(+|x)3 and P( |x) can be com-
putedx2usingcount(negative
⇢Eq. 5.5: lexicon) 2 doc) 2
1 if “no” 2 doc
x3 = P(Y = 1|x) = s (w · x + b)
p(+|x) 1
0 otherwise
x4 count(1st
⇢ = s
and 2nd pronouns
([2.5, doc)
5.0, 21.2, 3 · [3, 2, 1, 3, 0, 4.19] + 0.1)
0.5, 2.0, 0.7]
1 if “!”=2 doc
s (.833)
x5 0
0 otherwise
= 0.70 (5.6)
x6 log(word count of doc) ln(66) = 4.19 30

p( |x) = P(Y = 0|x) = 1 s (w · x + b)


Text Features

You might also like