0% found this document useful (0 votes)

2 views

Logistic Regression

The document discusses logistic regression as a key tool in supervised machine learning, highlighting its role as a discriminative classifier compared to generative classifiers like Naive Bayes. It explains the components of a probabilistic classifier, the training and testing phases, and the use of cross-entropy loss for optimization. Additionally, it illustrates the application of logistic regression in text classification tasks, such as sentiment analysis, by detailing how features and weights contribute to classification outcomes.

Uploaded by

Amad irfan

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

2 views

Logistic Regression

Uploaded by

Amad irfan

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 78

Background: Generative and

Discriminative Classifiers
Logistic
Regression
Logistic Regression

Important analytic tool in natural and

social sciences
Baseline supervised machine learning
tool for classification
Is also the foundation of neural
networks
Generative and Discriminative Classifiers
Naive Bayes is a generative classifier

by contrast:

Logistic regression is a discriminative

classifier
Generative and Discriminative Classifiers
Suppose we're distinguishing cat from dog images

imagenet imagenet
Generative Classifier:
• Build a model of what's in a cat image
• Knows about whiskers, ears, eyes
• Assigns a probability to any image:
• how cat-y is this image?

Also build a model for dog images

Now given a new image:

Run both models and see which one fits better
Discriminative Classifier
Just try to distinguish dogs from cats

Oh look, dogs have collars!

Let's ignore everything else
Finding the correct class c from a document d in
Generative vs Discriminative Classifiers
Naive Bayes

Logistic Regression
posterior
P(c|d)
7
Components of a probabilistic machine learning
classifier
Given m input/output pairs (x(i),y(i)):

1. A feature representation of the input. For each input

observation x(i), a vector of features [x1, x2, ... , xn]. Feature j
for input x(i) is xj, more completely xj(i), or sometimes fj(x).
2. A classification function that computes 𝑦, ො the estimated
class, via p(y|x), like the sigmoid or softmax functions.
3. An objective function for learning, like cross-entropy loss.
4. An algorithm for optimizing the objective function: stochastic
gradient descent.
The two phases of logistic regression

Training: we learn weights w and b using stochastic

gradient descent and cross-entropy loss.

Test: Given a test example x we compute p(y|x)

using learned weights w and b, and return
whichever label (y = 1 or y = 0) is higher probability
Background: Generative and
Discriminative Classifiers
Logistic
Regression
Classification in Logistic Regression

Logistic
Regression
Classification Reminder

Positive/negative sentiment
Spam/not spam
Authorship attribution
(Hamilton or Madison?)
Alexander Hamilton
Text Classification: definition

Input:
◦ a document x
◦ a fixed set of classes C = {c1, c2,…, cJ}

Output: a predicted class 𝑦ො  C

Binary Classification in Logistic Regression

Given a series of input/output pairs:

◦ (x(i), y(i))
For each observation x(i)
◦ We represent x(i) by a feature vector [x1, x2,…, xn]
◦ We compute an output: a predicted class 𝑦ො (i)  {0,1}
Features in logistic regression
• For feature xi, weight wi tells is how important is xi
• xi ="review contains ‘awesome’": wi = +10
• xj ="review contains ‘abysmal’": wj = -10
• xk =“review contains ‘mediocre’": wk = -2
Logistic Regression for one observation x

Input observation: vector x = [x1, x2,…, xn]

Weights: one per feature: W = [w1, w2,…, wn]
◦ Sometimes we call the weights θ = [θ1, θ2,…, θn]
Output: a predicted class 𝑦ො  {0,1}

(multinomial logistic regression: 𝑦ො  {0, 1, 2, 3, 4})

How to do classification
For each feature xi, weight wi tells us importance of xi
◦ (Plus we'll have a bias b)
We'll sum up all the weighted features and the bias

If this sum is high, we say y=1; if low, then y=0

But we want a probabilistic classifier

We need to formalize “sum is high”.

We’d like a principled classifier that gives us a
probability, just like Naive Bayes did
We want a model that can tell us:
p(y=1|x; θ)
p(y=0|x; θ)
ear around 0 but outlier values get squashed toward 0 or 1.
to Eq.
The5.2:
problem: z isn't a probability, it's just a
number! we’ ll pass z through the sigmoid func
e a probability,
ction (named because z = it w·x+
looks like b an s) is also called t
es logistic regression its name. Thesigmoid has the fol
hically in Fig.
Solution: use5.1:
a function of z that goes from 0 to 1
g in Eq. 5.3 forces z to be a legal probabil
fact, since weights 1 real-valued,
are 1 the outp
y = s (z) = − z
=
om − • to • . 1+ e 1+ exp(− z)
rest of the book, we’ ll use the notation exp(x) to mean
r of advantages; it takes areal-valued number and maps
rangeswe’
bility, − • to
ll pass
from z •through
. the sigmoid function, s (z).
The very useful sigmoid or logistic function
med because it looks like an s) is also called the logistic fu
regression its name. The sigmoid has the following equat
Fig. 5.1:

1
y = s (z) = (
1+ e− z

20
Idea of logistic regression

We’ll compute w∙x+b

And then we’ll pass it through the
sigmoid function:
σ(w∙x+b)
And we'll just treat it as a probability
Making probabilities with sigmoids
By the way:

Because
Turning a probability into a classifier

0.5 here is called the decision boundary

ranges from − • to • .
The probabilistic classifier

P(y=1)

wx + b
Turning a probability into a classifier

if w∙x+b > 0
if w∙x+b ≤ 0
Classification in Logistic Regression

Logistic
Regression
Logistic Regression: a text example
on sentiment classification
Logistic
Regression
Sentiment example: does y=1 or y=0?

It's hokey . There are virtually no surprises , and the writing is second-rate .
So why was it so enjoyable ? For one thing , the cast is
great . Another nice touch is the music . I was overcome with the urge to get off
the couch and start dancing . It sucked me in , and it'll do the same to you .

29
x2=2
x3=1
It's hokey . There are virtually no surprises , and the writing is second-rate .
So why was it so enjoyable ? For one thing , the cast is
great . Another nice touch is the music . I was overcome with the urge to get off
the couch and start dancing . It sucked me in , and it'll do the same to you .
x4=3
x1=3 x5=0 x6=4.19

Figur e 5.2 A sample mini test document showing the extracted features in the vector x.

Given these 6 features and the input review x, P(+ |x) and P(− |x) can be com-
puted using Eq. 5.5:

p(+ |x) = P(Y = 1|x) = s (w·x+ b)

= s ([2.5, − 5.0, − 1.2, 0.5, 2.0, 0.7] ·[3, 2, 1, 3, 0, 4.19] + 0.1)
= s (.833)
= 0.70 (5.6) 30
Classifying sentiment for input x

Suppose w =
b = 0.1 31
Figur e 5.2 1 mini test5document showing
A sample 6 the extracted features in the vector x.

Classifying
Figur e 5.2
sentiment for input x
A sample mini test document showing the extracted features in the vector x.
Given these 6 features and the input review x, P(+ |x) and P(− |x) can be com-
puted usingthese
Given Eq. 5.5:
6 features and the input review x, P(+ |x) and P(− |x) can be com-
puted using Eq. 5.5:
p(+ |x) = P(Y = 1|x) = s (w·x+ b)
(w·x+− 5.0,
p(+ |x) = P(Y = 1|x) = s ([2.5, b) − 1.2, 0.5, 2.0, 0.7] ·[3, 2, 1, 3, 0, 4.19] + 0.1)
= ([2.5, − 5.0, − 1.2, 0.5, 2.0, 0.7] ·[3, 2, 1, 3, 0, 4.19] + 0.1)
s (.833)
= s (.833)
0.70 (5.6)
p(− |x) = P(Y = 0|x) = 1− 0.70s (w·x+ b) (5.6)
p(− |x) = P(Y = 0|x) = 0.30 1− s (w·x+ b)
= 0.30
Logistic regression is commonly applied to all sorts of NLP tasks, and any property
of the input
Logistic can be aisfeature.
regression commonlyConsider thetotask
applied all of perof
sorts iod disambiguation:
NLP tasks, and any deciding
property
if
of athe
period
input is
canthe
beend of a sentence
a feature. Considerorthe
part ofofa per
task word,
iod by classifying each
disambiguation: period
deciding
32
We can build features for logistic regression for
any classification task: period disambiguation
End of sentence
This ends in a period.
The house at 465 Main St. is new.
Not end

33
Classification in (binary) logistic regression: summary
Given:
◦ a set of classes: (+ sentiment,- sentiment)
◦ a vector x of features [x1, x2, …, xn]
◦ x1= count( "awesome")
◦ x2 = log(number of words in review)
◦ A vector w of weights [w1, w2, …, wn]
◦ wi for each feature fi
Logistic Regression: a text example
on sentiment classification
Logistic
Regression
Learning: Cross-Entropy Loss

Logistic
Regression
Wait, where did the W’s come from?

Supervised classification:
• We know the correct label y (either 0 or 1) for each x.
• But what the system produces is an estimate, 𝑦ො
We want to set w and b to minimize the distance between our
estimate 𝑦ො (i) and the true y(i).
• We need a distance estimator: a loss function or a cost
function
• We need an optimization algorithm to update w and b to
minimize the loss.
37
Learning components

A loss function:
◦ cross-entropy loss

An optimization algorithm:
◦ stochastic gradient descent
The distance between 𝑦ො and y

We want to know how far is the classifier output:

𝑦ො = σ(w∙x+b)

from the true output:

y [= either 0 or 1]

We'll call this difference:

L(𝑦ො ,y) = how much 𝑦ො differs from the true y
Intuition of negative log likelihood loss
= cross-entropy loss

A case of conditional maximum likelihood

estimation
We choose the parameters w,b that maximize
• the log probability
• of the true y labels in the training data
• given the observations x
Deriving cross-entropy loss for a single observation x

Goal: maximize probability of the correct label p(y|x)

Since there are only 2 discrete outcomes (0 or 1) we can
express the probability p(y|x) from our classifier (the thing
we want to maximize) as

noting:
if y=1, this simplifies to 𝑦ො
if y=0, this simplifies to 1- 𝑦ො
Deriving cross-entropy loss for a single observation x
Goal: maximize probability of the correct label p(y|x)
Maximize:
Now take the log of both sides (mathematically handy)
Maximize:

Whatever values maximize log p(y|x) will also maximize

p(y|x)
Deriving cross-entropy loss for a single observation x
Goal: maximize probability of the correct label p(y|x)
Maximize:

Now flip sign to turn this into a loss: something to minimize

Cross-entropy loss (because is formula for cross-entropy(y, 𝑦ො ))
Minimize:
Or, plugging in definition of 𝑦:
ො
Let's see if this works for our sentiment example
We want loss to be:
• smaller if the model estimate is close to correct
• bigger if model is confused
Let's first suppose the true label of this is y=1 (positive)

It's hokey . There are virtually no surprises , and the writing is second-rate .
So why was it so enjoyable ? For one thing , the cast is great . Another nice
touch is the music . I was overcome with the urge to get off the couch and
start dancing . It sucked me in , and it'll do the same to you .
x4=3
x1=3 x5=0 x6=4.19

Let'sFigursee
e 5.2 if thisminiworks
A sample for
test document our
showing sentiment
the extracted example
features in the vector x.

Given these 6 features and the input review x, P(+ |x) and P(− |x) can be com-
True value is y=1. How well is our model doing?
puted using Eq. 5.5:

p(+ |x) = P(Y = 1|x) = s (w·x+ b)

= s ([2.5, − 5.0, − 1.2, 0.5, 2.0, 0.7] ·[3, 2, 1, 3, 0, 4.19] + 0.1)
= s (.833)
= 0.70 (5.6)
p(− |x) = P(Y = 0|x) = 1− s (w·x+ b)
Pretty well! What's the loss?
= 0.30

Logistic regression is commonly applied to all sorts of NLP tasks, and any property
of the input can be a feature. Consider the task of per iod disambiguation: deciding
if a period is the end of a sentence or part of a word, by classifying each period
into one of two classes EOS (end-of-sentence) and not-EOS. We might use features
like x1 below expressing that the current word is lower case and the class is EOS
p(+ |x) = P(Y = 1|x) = s (w·x+ b)
Let's see if this works= for our− 5.0,
s ([2.5, sentiment example
− 1.2, 0.5, 2.0, 0.7] ·[3, 2, 1
= s (.833)
Suppose true value instead was y=0.
= 0.70
p(− |x) = P(Y = 0|x) = 1− s (w·x+ b)
= 0.30
What's the loss?
Logistic regression is commonly applied to all sorts of NLP tasks,
of the input can be a feature. Consider the task of per iod disambig
if a period is the end of a sentence or part of a word, by classif
into one of two classes EOS (end-of-sentence) and not-EOS. We m
like x1 below expressing that the current word is lower case and
(perhaps with a positive weight), or that the current word is in
Let's see if this works for our sentiment example
The loss when model was right (if true y=1)

Is lower than the loss when model was wrong (if true y=0):

Sure enough, loss was bigger when model was wrong!

Cross-Entropy Loss

Logistic
Regression
Stochastic Gradient Descent

Logistic
Regression
Our goal: minimize the loss
Let's make explicit that the loss function is parameterized
by weights 𝛳=(w,b)
• And we’ll represent 𝑦ො as f (x; θ ) to make the
dependence on θ more obvious
We want the weights that minimize the loss, averaged
over all examples:
Intuition of gradient descent
How do I get to the bottom of this river canyon?

Look around me 360∘

Find the direction of
steepest slope down
x Go that way
Our goal: minimize the loss
For logistic regression, loss function is convex
• A convex function has just one minimum
• Gradient descent starting from any point is
guaranteed to find the minimum
• (Loss for neural networks is non-convex)
Let's first visualize for a single scalar w
Q: Given current w, should we make it bigger or smaller?
A: Move w in the reverse direction from the slope of the function

Loss Should we move

right or left from here?

w1 wmin w
0 (goal)
Let's first visualize for a single scalar w
Q: Given current w, should we make it bigger or smaller?
A: Move w in the reverse direction from the slope of the function

Loss

slope of loss at w1
is negative

So we'll move positive

w1 wmin w
0 (goal)
Let's first visualize for a single scalar w
Q: Given current w, should we make it bigger or smaller?
A: Move w in the reverse direction from the slope of the function

Loss

one step
of gradient
slope of loss at w1 descent
is negative

So we'll move positive

w1 wmin w
0 (goal)
Gradients
The gradient of a function of many variables is a
vector pointing in the direction of the greatest
increase in a function.

Gradient Descent: Find the gradient of the loss

function at the current point and move in the
opposite direction.
How much do we move in that direction ?

GISTI C• REGRESSI
The valueON
of the gradient (slope in our example)
𝑑
𝐿(𝑓 𝑥; 𝑤 , 𝑦) weighted by a learning rate η
𝑑𝑤
• Higher learning rate means move w faster

t+ 1 t d
w =w−h L( f (x; w), y)
dw
’s extend the intuition from a function of one scalar variabl
Now let's consider N dimensions
We want to know where in the N-dimensional space
(of the N parameters that make up θ ) we should
move.
The gradient is just such a vector; it expresses the
directional components of the sharpest slope along
each of the N dimensions.
of a 2-dimensional gradient vector taken at the red point.
Imagine 2 dimensions, w and b
Cost(w,b)
Visualizing the
gradient vector at
the red point
It has two
dimensions shown
in the x-y plane
b
w
Real gradients
Are much longer; lots and lots of weights
For each dimension wi the gradient component i
tells us the slope with respect to that variable.
◦ “How much would a small change in wi influence the
total loss function L?”
◦ We express the slope as a partial derivative ∂ of the loss
∂wi
The gradient is then defined as a vector of these
partials.
The gradient
We’ll represent 𝑦ො as f (x; θ ) to make the dependence on θ more
obvious:

The final equation for updating θ based on the gradient is thus

What are these partial derivatives for logistic regression?

The loss function

The elegant derivative of this function (see textbook 5.8 for derivation)
Hyperparameters
The learning rate η is a hyperparameter
◦ too high: the learner will take big steps and overshoot
◦ too low: the learner will take too long
Hyperparameters:
• Briefly, a special kind of parameter for an ML model
• Instead of being learned by algorithm from
supervision (like regular parameters), they are
chosen by algorithm designer.
Stochastic Gradient Descent

Logistic
Regression
Stochastic Gradient Descent:
An example and more details
Logistic
Regression
Working through an example
One step of gradient descent
A mini-sentiment example, where the true y=1 (positive)
Two features:
x1 = 3 (count of positive lexicon words)
x2 = 2 (count of negative lexicon words)
Assume 3 parameters (2 weights and 1 bias) in Θ0 are zero:
w1 = w2 = b = 0
η = 0.1
al equation for updating
Example q based on
of gradient the gradient is thus
descent w1 = w2 = b = 0;
Update step for update θ is: x1 = 3; x2 = 2

qt+ 1 = qt − h —L( f (x;q),y)

where

Gradient vector has 3 dimensions:

al equation for updating
Example q based on
of gradient the gradient is thus
descent w1 = w2 = b = 0;
Update step for update θ is: x1 = 3; x2 = 2

qt+ 1 = qt − h —L( f (x;q),y)

where

Gradient vector has 3 dimensions:

al equation for updating
Example q based on
of gradient the gradient is thus
descent w1 = w2 = b = 0;
Update step for update θ is: x1 = 3; x2 = 2

qt+ 1 = qt − h —L( f (x;q),y)

where

Gradient vector has 3 dimensions:

al equation for updating
Example q based on
of gradient the gradient is thus
descent w1 = w2 = b = 0;
Update step for update θ is: x1 = 3; x2 = 2

qt+ 1 = qt − h —L( f (x;q),y)

where

Gradient vector has 3 dimensions:

al equation for updating
Example q based on
of gradient the gradient is thus
descent w1 = w2 = b = 0;
Update step for update θ is: x1 = 3; x2 = 2

qt+ 1 = qt − h —L( f (x;q),y)

where

Gradient vector has 3 dimensions:

6 ∂ w2 7
—q L( f (x;q),y)) = 6 .. 7 (5.15)
Example of gradient
4 . descent
5
∂
∂ wn L( f (x;q),y)

on for updating q based on the gradient is thus

Now that we have a gradient, we compute the new parameter vector
θ1 by moving θ0 in the opposite direction from the gradient:

qt+ 1 = qt − h —L( f (x;q),y) η = 0.1; (5.16)

6 ∂ w2 7
—q L( f (x;q),y)) = 6 .. 7 (5.15)
Example of gradient
4 . descent
5
∂
∂ wn L( f (x;q),y)

on for updating q based on the gradient is thus

Now that we have a gradient, we compute the new parameter vector
θ1 by moving θ0 in the opposite direction from the gradient:

qt+ 1 = qt − h —L( f (x;q),y) η = 0.1; (5.16)

6 ∂ w2 7
—q L( f (x;q),y)) = 6 .. 7 (5.15)
Example of gradient
4 . descent
5
∂
∂ wn L( f (x;q),y)

on for updating q based on the gradient is thus

Now that we have a gradient, we compute the new parameter vector
θ1 by moving θ0 in the opposite direction from the gradient:

qt+ 1 = qt − h —L( f (x;q),y) η = 0.1; (5.16)

6 ∂ w2 7
—q L( f (x;q),y)) = 6 .. 7 (5.15)
Example of gradient
4 . descent
5
∂
∂ wn L( f (x;q),y)

on for updating q based on the gradient is thus

Now that we have a gradient, we compute the new parameter vector
θ1 by moving θ0 in the opposite direction from the gradient:

qt+ 1 = qt − h —L( f (x;q),y) η = 0.1; (5.16)

Note that enough negative examples would eventually make w2 negative

Mini-batch training
Stochastic gradient descent chooses a single
random example at a time.
That can result in choppy movements
More common to compute gradient over batches of
training instances.
Batch training: entire dataset
Mini-batch training: m examples (512, or 1024)
Stochastic Gradient Descent:
An example and more details
Logistic
Regression

Low Pass Filter Design
No ratings yet
Low Pass Filter Design
15 pages
5_LR_Apr_7_2021 (3)
No ratings yet
5_LR_Apr_7_2021 (3)
93 pages
Text Classification Using Logistics Regression
No ratings yet
Text Classification Using Logistics Regression
64 pages
Logisticregression 2021
No ratings yet
Logisticregression 2021
78 pages
5 LR Apr 7 2021
No ratings yet
5 LR Apr 7 2021
94 pages
Multimedia Application L9
No ratings yet
Multimedia Application L9
43 pages
Ch03 LogisticRegression
No ratings yet
Ch03 LogisticRegression
79 pages
5
No ratings yet
5
25 pages
Logistic Regression Notes
No ratings yet
Logistic Regression Notes
25 pages
Logistic Regression: "And How Do You Know That These Fine Begonias Are Not of Equal Importance?"
No ratings yet
Logistic Regression: "And How Do You Know That These Fine Begonias Are Not of Equal Importance?"
25 pages
Ed3book - Jan72023 87 110
No ratings yet
Ed3book - Jan72023 87 110
24 pages
Logistic Regression: "And How Do You Know That These Fine Begonias Are Not of Equal Importance?"
No ratings yet
Logistic Regression: "And How Do You Know That These Fine Begonias Are Not of Equal Importance?"
21 pages
23-LogisticRegression
No ratings yet
23-LogisticRegression
67 pages
7 Logistic-Regression
No ratings yet
7 Logistic-Regression
63 pages
Logistic Regression: Some Slides Adapted From Dan Jurfasky and Brendan O'Connor
No ratings yet
Logistic Regression: Some Slides Adapted From Dan Jurfasky and Brendan O'Connor
53 pages
Logistic Regression_byimran
No ratings yet
Logistic Regression_byimran
35 pages
Logistic Regression
No ratings yet
Logistic Regression
4 pages
W8 - Logistic Regression
No ratings yet
W8 - Logistic Regression
18 pages
Logistic Regressions
No ratings yet
Logistic Regressions
11 pages
Week 4 Logistic
No ratings yet
Week 4 Logistic
21 pages
Deep Learning Week 204-4
No ratings yet
Deep Learning Week 204-4
1 page
Generalized Linear Model
No ratings yet
Generalized Linear Model
67 pages
04- Linear-Classification-2024
No ratings yet
04- Linear-Classification-2024
65 pages
Logistic Regression
No ratings yet
Logistic Regression
25 pages
Logistic Regression: Adapted From: Tom Mitchell's Machine Learning Book Evan Wei Xiang and Qiang Yang
No ratings yet
Logistic Regression: Adapted From: Tom Mitchell's Machine Learning Book Evan Wei Xiang and Qiang Yang
15 pages
09_23ECE216_LogisticRegression
No ratings yet
09_23ECE216_LogisticRegression
40 pages
Lecture Notes 6 Logistic Regression
No ratings yet
Lecture Notes 6 Logistic Regression
8 pages
ML_MU_Unit_2 - Supervised Learning-Classification Techniques
No ratings yet
ML_MU_Unit_2 - Supervised Learning-Classification Techniques
153 pages
Introduction To Machine Learning: 2 Linear Classifiers
No ratings yet
Introduction To Machine Learning: 2 Linear Classifiers
4 pages
Lecture 5_Logistic Regression (1)
No ratings yet
Lecture 5_Logistic Regression (1)
28 pages
Logistic Regression
No ratings yet
Logistic Regression
34 pages
Week 3 - Lecture Slides - Logistic Regression
No ratings yet
Week 3 - Lecture Slides - Logistic Regression
54 pages
Logistic Regression
No ratings yet
Logistic Regression
24 pages
Logistic - Regression Class 3
No ratings yet
Logistic - Regression Class 3
88 pages
Logistic Regression
No ratings yet
Logistic Regression
26 pages
06_LogisticRegression
No ratings yet
06_LogisticRegression
29 pages
3-LG_Eval
No ratings yet
3-LG_Eval
52 pages
M02Logistic Regression Logistic RegressioLogistic Regressionn
No ratings yet
M02Logistic Regression Logistic RegressioLogistic Regressionn
19 pages
Lecture 6
No ratings yet
Lecture 6
19 pages
Logistic Regression: Gunjan Bharadwaj Assistant Professor Dept of CEA
100% (1)
Logistic Regression: Gunjan Bharadwaj Assistant Professor Dept of CEA
42 pages
ML 03 Logistic Regression
No ratings yet
ML 03 Logistic Regression
32 pages
Unit II
100% (1)
Unit II
13 pages
Lec12 Logreg
No ratings yet
Lec12 Logreg
41 pages
Logistic Regression
No ratings yet
Logistic Regression
6 pages
ML Assignment
No ratings yet
ML Assignment
20 pages
CS229 Supplemental Lecture Notes: 1 Binary Classification
No ratings yet
CS229 Supplemental Lecture Notes: 1 Binary Classification
7 pages
Binary Classification and Logistic Regression
No ratings yet
Binary Classification and Logistic Regression
7 pages
lec20
No ratings yet
lec20
16 pages
Logistic Regression
No ratings yet
Logistic Regression
19 pages
Logistic Regression
No ratings yet
Logistic Regression
8 pages
Lecture W3
No ratings yet
Lecture W3
28 pages
Machine Learning - Logistic Regression
No ratings yet
Machine Learning - Logistic Regression
16 pages
04 Probability and Learning PDF
No ratings yet
04 Probability and Learning PDF
34 pages
Lecture 05
No ratings yet
Lecture 05
5 pages
Slide 2
No ratings yet
Slide 2
30 pages
Logistic Regression Notes
No ratings yet
Logistic Regression Notes
23 pages
FEM 2063 - Data Analytics: CHAPTER 4: Classifications
100% (1)
FEM 2063 - Data Analytics: CHAPTER 4: Classifications
76 pages
Practical - Logistic Regression
No ratings yet
Practical - Logistic Regression
84 pages
Logistic Regression and Naive Bayes
No ratings yet
Logistic Regression and Naive Bayes
4 pages
DDA3020 Lecture 06 Logistic Regression
No ratings yet
DDA3020 Lecture 06 Logistic Regression
47 pages
A Short Course in Discrete Mathematics
From Everand
A Short Course in Discrete Mathematics
Edward A. Bender
3/5 (1)
DCT
No ratings yet
DCT
5 pages
Quiz 2 Solution
No ratings yet
Quiz 2 Solution
2 pages
Transportation and Assignment Problem
No ratings yet
Transportation and Assignment Problem
67 pages
Case Study
No ratings yet
Case Study
5 pages
Real Numbers Test
No ratings yet
Real Numbers Test
3 pages
06 Smoothing PDF
No ratings yet
06 Smoothing PDF
55 pages
s11831-024-10063-0 (3)
No ratings yet
s11831-024-10063-0 (3)
40 pages
Data Warehousing & Data Mining (R20) Imp Questions:-Unit-1
100% (1)
Data Warehousing & Data Mining (R20) Imp Questions:-Unit-1
3 pages
Procedure C Lab-16
No ratings yet
Procedure C Lab-16
4 pages
Chapter 4
No ratings yet
Chapter 4
15 pages
Module 4: Dynamic Programming: Design and Analysis of Algorithms 21CS42
No ratings yet
Module 4: Dynamic Programming: Design and Analysis of Algorithms 21CS42
105 pages
Design and Analysis of Algorithms Questions by Chapters
No ratings yet
Design and Analysis of Algorithms Questions by Chapters
5 pages
DSD Unit 1 Analysis of Algorithm
No ratings yet
DSD Unit 1 Analysis of Algorithm
38 pages
03a Transportation Problem
100% (1)
03a Transportation Problem
19 pages
Atme College of Engineering: Lab Manual
No ratings yet
Atme College of Engineering: Lab Manual
44 pages
UAV Cooperative Multiple Task Assignment Based On Discrete Particle Swarm Optimization
No ratings yet
UAV Cooperative Multiple Task Assignment Based On Discrete Particle Swarm Optimization
6 pages
Hashing in DBMS
No ratings yet
Hashing in DBMS
9 pages
1) Lao - Wase Wims Ho
No ratings yet
1) Lao - Wase Wims Ho
6 pages
Slide - Computational Complexity Theory 2.0
No ratings yet
Slide - Computational Complexity Theory 2.0
32 pages
Design Filter
No ratings yet
Design Filter
27 pages
An Introduction To Particle Swarm Optimization: Article
No ratings yet
An Introduction To Particle Swarm Optimization: Article
9 pages
NMB34203 - ch6 - Frequency Response (Bode Plot)
No ratings yet
NMB34203 - ch6 - Frequency Response (Bode Plot)
45 pages
Face Recognition Using Facenet
No ratings yet
Face Recognition Using Facenet
46 pages
Digital Communication
No ratings yet
Digital Communication
18 pages
Signals Systems Question Paper
100% (1)
Signals Systems Question Paper
14 pages
Maths Assignment
No ratings yet
Maths Assignment
4 pages
MM Questionbank
100% (1)
MM Questionbank
4 pages
Activities and Predecessors
No ratings yet
Activities and Predecessors
18 pages
Bresenham Line Drawing Algorithm
No ratings yet
Bresenham Line Drawing Algorithm
15 pages

Logistic Regression

Uploaded by

Logistic Regression

Uploaded by

Background: Generative and

Important analytic tool in natural and

Logistic regression is a discriminative

Also build a model for dog images

Now given a new image:

Oh look, dogs have collars!

1. A feature representation of the input. For each input

Training: we learn weights w and b using stochastic

Test: Given a test example x we compute p(y|x)

Output: a predicted class 𝑦ො  C

Given a series of input/output pairs:

Input observation: vector x = [x1, x2,…, xn]

(multinomial logistic regression: 𝑦ො  {0, 1, 2, 3, 4})

If this sum is high, we say y=1; if low, then y=0

We need to formalize “sum is high”.

We’ll compute w∙x+b

0.5 here is called the decision boundary

p(+ |x) = P(Y = 1|x) = s (w·x+ b)

We want to know how far is the classifier output:

from the true output:

We'll call this difference:

A case of conditional maximum likelihood

Goal: maximize probability of the correct label p(y|x)

Whatever values maximize log p(y|x) will also maximize

Now flip sign to turn this into a loss: something to minimize

p(+ |x) = P(Y = 1|x) = s (w·x+ b)

Sure enough, loss was bigger when model was wrong!

Look around me 360∘

Loss Should we move

So we'll move positive

So we'll move positive

Gradient Descent: Find the gradient of the loss

The final equation for updating θ based on the gradient is thus

The loss function

qt+ 1 = qt − h —L( f (x;q),y)

Gradient vector has 3 dimensions:

qt+ 1 = qt − h —L( f (x;q),y)

Gradient vector has 3 dimensions:

qt+ 1 = qt − h —L( f (x;q),y)

Gradient vector has 3 dimensions:

qt+ 1 = qt − h —L( f (x;q),y)

Gradient vector has 3 dimensions:

qt+ 1 = qt − h —L( f (x;q),y)

Gradient vector has 3 dimensions:

on for updating q based on the gradient is thus

qt+ 1 = qt − h —L( f (x;q),y) η = 0.1; (5.16)

on for updating q based on the gradient is thus

qt+ 1 = qt − h —L( f (x;q),y) η = 0.1; (5.16)

on for updating q based on the gradient is thus

qt+ 1 = qt − h —L( f (x;q),y) η = 0.1; (5.16)

on for updating q based on the gradient is thus

qt+ 1 = qt − h —L( f (x;q),y) η = 0.1; (5.16)

Note that enough negative examples would eventually make w2 negative

You might also like