Logistic Regression
Logistic Regression
Discriminative Classifiers
Logistic
Regression
Logistic Regression
by contrast:
imagenet imagenet
Generative Classifier:
• Build a model of what's in a cat image
• Knows about whiskers, ears, eyes
• Assigns a probability to any image:
• how cat-y is this image?
Logistic Regression
posterior
P(c|d)
7
Components of a probabilistic machine learning
classifier
Given m input/output pairs (x(i),y(i)):
Logistic
Regression
Classification Reminder
Positive/negative sentiment
Spam/not spam
Authorship attribution
(Hamilton or Madison?)
Alexander Hamilton
Text Classification: definition
Input:
◦ a document x
◦ a fixed set of classes C = {c1, c2,…, cJ}
1
y = s (z) = (
1+ e− z
20
Idea of logistic regression
Because
Turning a probability into a classifier
P(y=1)
wx + b
Turning a probability into a classifier
if w∙x+b > 0
if w∙x+b ≤ 0
Classification in Logistic Regression
Logistic
Regression
Logistic Regression: a text example
on sentiment classification
Logistic
Regression
Sentiment example: does y=1 or y=0?
It's hokey . There are virtually no surprises , and the writing is second-rate .
So why was it so enjoyable ? For one thing , the cast is
great . Another nice touch is the music . I was overcome with the urge to get off
the couch and start dancing . It sucked me in , and it'll do the same to you .
29
x2=2
x3=1
It's hokey . There are virtually no surprises , and the writing is second-rate .
So why was it so enjoyable ? For one thing , the cast is
great . Another nice touch is the music . I was overcome with the urge to get off
the couch and start dancing . It sucked me in , and it'll do the same to you .
x4=3
x1=3 x5=0 x6=4.19
Figur e 5.2 A sample mini test document showing the extracted features in the vector x.
Given these 6 features and the input review x, P(+ |x) and P(− |x) can be com-
puted using Eq. 5.5:
Suppose w =
b = 0.1 31
Figur e 5.2 1 mini test5document showing
A sample 6 the extracted features in the vector x.
Classifying
Figur e 5.2
sentiment for input x
A sample mini test document showing the extracted features in the vector x.
Given these 6 features and the input review x, P(+ |x) and P(− |x) can be com-
puted usingthese
Given Eq. 5.5:
6 features and the input review x, P(+ |x) and P(− |x) can be com-
puted using Eq. 5.5:
p(+ |x) = P(Y = 1|x) = s (w·x+ b)
(w·x+− 5.0,
p(+ |x) = P(Y = 1|x) = s ([2.5, b) − 1.2, 0.5, 2.0, 0.7] ·[3, 2, 1, 3, 0, 4.19] + 0.1)
= ([2.5, − 5.0, − 1.2, 0.5, 2.0, 0.7] ·[3, 2, 1, 3, 0, 4.19] + 0.1)
s (.833)
= s (.833)
0.70 (5.6)
p(− |x) = P(Y = 0|x) = 1− 0.70s (w·x+ b) (5.6)
p(− |x) = P(Y = 0|x) = 0.30 1− s (w·x+ b)
= 0.30
Logistic regression is commonly applied to all sorts of NLP tasks, and any property
of the input
Logistic can be aisfeature.
regression commonlyConsider thetotask
applied all of perof
sorts iod disambiguation:
NLP tasks, and any deciding
property
if
of athe
period
input is
canthe
beend of a sentence
a feature. Considerorthe
part ofofa per
task word,
iod by classifying each
disambiguation: period
deciding
32
We can build features for logistic regression for
any classification task: period disambiguation
End of sentence
This ends in a period.
The house at 465 Main St. is new.
Not end
33
Classification in (binary) logistic regression: summary
Given:
◦ a set of classes: (+ sentiment,- sentiment)
◦ a vector x of features [x1, x2, …, xn]
◦ x1= count( "awesome")
◦ x2 = log(number of words in review)
◦ A vector w of weights [w1, w2, …, wn]
◦ wi for each feature fi
Logistic Regression: a text example
on sentiment classification
Logistic
Regression
Learning: Cross-Entropy Loss
Logistic
Regression
Wait, where did the W’s come from?
Supervised classification:
• We know the correct label y (either 0 or 1) for each x.
• But what the system produces is an estimate, 𝑦ො
We want to set w and b to minimize the distance between our
estimate 𝑦ො (i) and the true y(i).
• We need a distance estimator: a loss function or a cost
function
• We need an optimization algorithm to update w and b to
minimize the loss.
37
Learning components
A loss function:
◦ cross-entropy loss
An optimization algorithm:
◦ stochastic gradient descent
The distance between 𝑦ො and y
noting:
if y=1, this simplifies to 𝑦ො
if y=0, this simplifies to 1- 𝑦ො
Deriving cross-entropy loss for a single observation x
Goal: maximize probability of the correct label p(y|x)
Maximize:
Now take the log of both sides (mathematically handy)
Maximize:
It's hokey . There are virtually no surprises , and the writing is second-rate .
So why was it so enjoyable ? For one thing , the cast is great . Another nice
touch is the music . I was overcome with the urge to get off the couch and
start dancing . It sucked me in , and it'll do the same to you .
x4=3
x1=3 x5=0 x6=4.19
Let'sFigursee
e 5.2 if thisminiworks
A sample for
test document our
showing sentiment
the extracted example
features in the vector x.
Given these 6 features and the input review x, P(+ |x) and P(− |x) can be com-
True value is y=1. How well is our model doing?
puted using Eq. 5.5:
Logistic regression is commonly applied to all sorts of NLP tasks, and any property
of the input can be a feature. Consider the task of per iod disambiguation: deciding
if a period is the end of a sentence or part of a word, by classifying each period
into one of two classes EOS (end-of-sentence) and not-EOS. We might use features
like x1 below expressing that the current word is lower case and the class is EOS
p(+ |x) = P(Y = 1|x) = s (w·x+ b)
Let's see if this works= for our− 5.0,
s ([2.5, sentiment example
− 1.2, 0.5, 2.0, 0.7] ·[3, 2, 1
= s (.833)
Suppose true value instead was y=0.
= 0.70
p(− |x) = P(Y = 0|x) = 1− s (w·x+ b)
= 0.30
What's the loss?
Logistic regression is commonly applied to all sorts of NLP tasks,
of the input can be a feature. Consider the task of per iod disambig
if a period is the end of a sentence or part of a word, by classif
into one of two classes EOS (end-of-sentence) and not-EOS. We m
like x1 below expressing that the current word is lower case and
(perhaps with a positive weight), or that the current word is in
Let's see if this works for our sentiment example
The loss when model was right (if true y=1)
Is lower than the loss when model was wrong (if true y=0):
Logistic
Regression
Stochastic Gradient Descent
Logistic
Regression
Our goal: minimize the loss
Let's make explicit that the loss function is parameterized
by weights 𝛳=(w,b)
• And we’ll represent 𝑦ො as f (x; θ ) to make the
dependence on θ more obvious
We want the weights that minimize the loss, averaged
over all examples:
Intuition of gradient descent
How do I get to the bottom of this river canyon?
w1 wmin w
0 (goal)
Let's first visualize for a single scalar w
Q: Given current w, should we make it bigger or smaller?
A: Move w in the reverse direction from the slope of the function
Loss
slope of loss at w1
is negative
w1 wmin w
0 (goal)
Let's first visualize for a single scalar w
Q: Given current w, should we make it bigger or smaller?
A: Move w in the reverse direction from the slope of the function
Loss
one step
of gradient
slope of loss at w1 descent
is negative
w1 wmin w
0 (goal)
Gradients
The gradient of a function of many variables is a
vector pointing in the direction of the greatest
increase in a function.
GISTI C• REGRESSI
The valueON
of the gradient (slope in our example)
𝑑
𝐿(𝑓 𝑥; 𝑤 , 𝑦) weighted by a learning rate η
𝑑𝑤
• Higher learning rate means move w faster
t+ 1 t d
w =w−h L( f (x; w), y)
dw
’s extend the intuition from a function of one scalar variabl
Now let's consider N dimensions
We want to know where in the N-dimensional space
(of the N parameters that make up θ ) we should
move.
The gradient is just such a vector; it expresses the
directional components of the sharpest slope along
each of the N dimensions.
of a 2-dimensional gradient vector taken at the red point.
Imagine 2 dimensions, w and b
Cost(w,b)
Visualizing the
gradient vector at
the red point
It has two
dimensions shown
in the x-y plane
b
w
Real gradients
Are much longer; lots and lots of weights
For each dimension wi the gradient component i
tells us the slope with respect to that variable.
◦ “How much would a small change in wi influence the
total loss function L?”
◦ We express the slope as a partial derivative ∂ of the loss
∂wi
The gradient is then defined as a vector of these
partials.
The gradient
We’ll represent 𝑦ො as f (x; θ ) to make the dependence on θ more
obvious:
The elegant derivative of this function (see textbook 5.8 for derivation)
Hyperparameters
The learning rate η is a hyperparameter
◦ too high: the learner will take big steps and overshoot
◦ too low: the learner will take too long
Hyperparameters:
• Briefly, a special kind of parameter for an ML model
• Instead of being learned by algorithm from
supervision (like regular parameters), they are
chosen by algorithm designer.
Stochastic Gradient Descent
Logistic
Regression
Stochastic Gradient Descent:
An example and more details
Logistic
Regression
Working through an example
One step of gradient descent
A mini-sentiment example, where the true y=1 (positive)
Two features:
x1 = 3 (count of positive lexicon words)
x2 = 2 (count of negative lexicon words)
Assume 3 parameters (2 weights and 1 bias) in Θ0 are zero:
w1 = w2 = b = 0
η = 0.1
al equation for updating
Example q based on
of gradient the gradient is thus
descent w1 = w2 = b = 0;
Update step for update θ is: x1 = 3; x2 = 2
where
where
where
where
where