0% found this document useful (0 votes)

7 views79 pages

Ch03 LogisticRegression

Uploaded by

Bashar Almuntaser

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

7 views79 pages

Ch03 LogisticRegression

Uploaded by

Bashar Almuntaser

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 79

Background: Generative and

Logistic Discriminative Classifiers

Regressi
on

1
Logistic Regression
 Important analytic tool in natural and
social sciences
 Baseline supervised machine learning tool
for classification
 It is also the foundation of neural
networks

2
Generative and
Discriminative Classifiers
Naive Bayes is a generative classifier

by contrast:

Logistic regression is a discriminative

classifier
3
Generative and
Discriminative Classifiers
Suppose we're distinguishing cat from dog images

imagenet imagenet 4
Generative Classifier:
• Build a model of what's in a cat image
• Knows about whiskers, ears, eyes
• Assigns a probability to any image:
• how cat-y is this image?

Also build a model for dog images

Now given a new image:

Run both models and see which one fits better 5
Discriminative Classifier
Just try to distinguish dogs from cats

Oh look, dogs have collars!

Let's ignore everything else 6
Finding the correct class c from a document d in
Generative vs Discriminative Classifiers

Naive Bayes

Logistic Regression
posterior
P(c|d)
7
Components of a probabilistic machine learning
classifier
Given m input/output pairs (x(i),y(i)):

1. A feature representation of the input. For each input

observation x(i), a vector of features [x1, x2, ... , xn]. Feature j
for input x(i) is xj, more completely xj(i), or sometimes fj(x).
2. A classification function that computes , the estimated class,
via p(y|x), like the sigmoid or softmax functions.
3. An objective function for learning, like cross-entropy loss.
4. An algorithm for optimizing the objective function: stochastic
gradient descent.
8
The two phases of logistic
regression

Training: we learn weights w and b using stochastic

gradient descent and cross-entropy loss.

Test: Given a test example x we compute p(y|x)

using learned weights w and b, and return
whichever label (y = 1 or y = 0) is higher probability

9
Classification in Logistic Regression
Logistic
Regressi
on

10
Classification Reminder

Positive/negative sentiment

Spam/not spam
Authorship attribution
(Hamilton or Madison?)
Alexander Hamilton

11
Text Classification: definition

Input:
◦ a document x
◦ a fixed set of classes C = {c1, c2,…, cJ}

Output: a predicted class  C

12
Binary Classification in Logistic Regression

Given a series of input/output pairs:

◦ (x(i), y(i))
For each observation x(i)
◦ We represent x(i) by a feature vector [x1, x2,…, xn]
◦ We compute an output: a predicted class (i)  {0,1}

13
Features in logistic regression
• For feature xi, weight wi tells is how important is xi
• xi ="review contains ‘awesome’": wi = +10
• xj ="review contains ‘abysmal’": wj = -10
• xk =“review contains ‘mediocre’": wk = -2

14
Logistic Regression for one
observation x

Input observation: vector x = [x1, x2,…, xn]

Weights: one per feature: W = [w1, w2,…, wn]
◦ Sometimes we call the weights θ = [θ1, θ2,…, θn]
Output: a predicted class  {0,1}
(multinomial logistic regression:  {0, 1, 2, 3, 4})

15
How to do classification
For each feature xi, weight wi tells us importance of xi
◦ (Plus we'll have a bias b)
We'll sum up all the weighted features and the bias

If this sum is high, we say y=1; if low, then y=0

16
But we want a probabilistic
classifier

We need to formalize “sum is high”.

We’d like a principled classifier that gives us a
probability
We want a model that can tell us:
p(y=1|x; θ)
p(y=0|x; θ)

17
The problem: z isn't a probability, it's just a
number!

Solution: use a function of z that goes from 0 to 1

18
The very useful sigmoid or logistic function

19
Idea of logistic regression

We’ll compute w∙x+b

And then we’ll pass it through the
sigmoid function:
σ(w∙x+b)
And we'll just treat it as a probability

20
Making probabilities with sigmoids

21
Turning a probability into a classifier

0.5 here is called the decision boundary

22
The probabilistic classifier

P(y=1)

wx + b 23
Turning a probability into a classifier

if w∙x+b > 0
if w∙x+b ≤ 0

24
Logistic Regression: a text example
Logistic on sentiment classification
Regressi
on

25
Sentiment example: does y=1
or y=0?

It's hokey . There are virtually no surprises , and the writing is second-rate .
So why was it so enjoyable ? For one thing , the cast is
great . Another nice touch is the music . I was overcome with the urge to get off
the couch and start dancing . It sucked me in , and it'll do the same to you .

26
27
Classifying sentiment for input x

Suppose w =
b = 0.1 28
Classifying sentiment for input x

29
Classification in (binary) logistic regression: summary
Given:
◦ a set of classes: (+ sentiment,- sentiment)
◦ a vector x of features [x1, x2, …, xn]
◦ x1= count( "awesome")
◦ x2 = log(number of words in review)
◦ A vector w of weights [w1, w2, …, wn]
◦ w for each feature f
i i

30
Logistic Regression: a text example
Logistic on sentiment classification
Regressi
on

31
Learning: Cross-Entropy Loss
Logistic
Regressi
on

32
Wait, where did the W’s come
from?

Supervised classification:
• We know the correct label y (either 0 or 1) for each x.
• But what the system produces is an estimate,
We want to set w and b to minimize the distance between our
estimate (i) and the true y(i).
• We need a distance estimator: a loss function or a cost
function
• We need an optimization algorithm to update w and b to
minimize the loss. 33
Learning components

A loss function:
◦ cross-entropy loss

An optimization algorithm:
◦ stochastic gradient descent

34
The distance between and y

We want to know how far is the classifier output:

= σ(w∙x+b)

from the true output:

y [= either 0 or 1]

We'll call this difference:

L(,y) = how much differs from the true y
35
Intuition of negative log likelihood loss = cross-
entropy loss

A case of conditional maximum likelihood

estimation
We choose the parameters w,b that maximize
• the log probability
• of the true y labels in the training data
• given the observations x

36
Deriving cross-entropy loss for a single observation x
Goal: maximize probability of the correct label p(y|x)
Since there are only 2 discrete outcomes (0 or 1) we can
express the probability p(y|x) from our classifier (the
thing we want to maximize) as

noting:
if y=1, this simplifies to
if y=0, this simplifies to 1-
37
Deriving cross-entropy loss for a single observation x
Goal: maximize probability of the correct label p(y|x)
Maximize:
Now take the log of both sides (mathematically handy)
Maximize:

Whatever values maximize log p(y|x) will also maximize

p(y|x)

38
Deriving cross-entropy loss for a single observation x
Goal: maximize probability of the correct label p(y|x)
Maximize:

Now flip sign to turn this into a loss: something to minimize

Cross-entropy loss (because is formula for cross-entropy(y, ))
Minimize:
Or, plugging in definition of
Let's see if this works for our sentiment example
We want loss to be:
• smaller if the model estimate is close to correct
• bigger if model is confused
Let's first suppose the true label of this is y=1 (positive)

It's hokey . There are virtually no surprises , and the writing is second-rate .
So why was it so enjoyable ? For one thing , the cast is great . Another
nice touch is the music . I was overcome with the urge to get off the couch
and start dancing . It sucked me in , and it'll do the same to you .

40
Let's see if this works for our sentiment example

True value is y=1. How well is our model doing?

Pretty well! What's the loss?

41
Let's see if this works for our sentiment example
Suppose true value instead was y=0.

What's the loss?

42
Let's see if this works for our sentiment example
The loss when model was right (if true y=1)

Is lower than the loss when model was wrong (if true y=0):

Sure enough, loss was bigger when model was wrong! 43

Cross-Entropy Loss
Logistic
Regressi
on

44
Stochastic Gradient Descent
Logistic
Regressi
on

45
Our goal: minimize the loss

by weights 𝛳=(w,b)
Let's make explicit that the loss function is parameterized

• And we’ll represent as f (x; θ ) to make the

dependence on θ more obvious
We want the weights that minimize the loss, averaged
over all examples:

46
Our goal: minimize the loss
For logistic regression, loss function is convex
• A convex function has just one minimum
• Gradient descent starting from any point is
guaranteed to find the minimum
• (Loss for neural networks is non-convex)

47
Let's first visualize for a single
scalar w
Q: Given current w, should we make it bigger or smaller?
A: Move w in the reverse direction from the slope of the function

48
Let's first visualize for a single
scalar w
Q: Given current w, should we make it bigger or smaller?
A: Move w in the reverse direction from the slope of the function

So we'll move positive

49
Let's first visualize for a single
scalar w
Q: Given current w, should we make it bigger or smaller?
A: Move w in the reverse direction from the slope of the function

So we'll move positive

50
Gradients
The gradient of a function of many variables is a
vector pointing in the direction of the greatest
increase in a function.

Gradient Descent: Find the gradient of the loss

function at the current point and move in the
opposite direction.

51
How much do we move in that direction ?

• The value of the gradient (slope in our example)

weighted by a learning rate η
• Higher learning rate means move w faster

52
Now let's consider N dimensions
We want to know where in the N-dimensional space
(of the N parameters that make up θ ) we should
move.
The gradient is just such a vector; it expresses the
directional components of the sharpest slope along
each of the N dimensions.

53
Imagine 2 dimensions, w and b

Visualizing the
gradient vector at
the red point
It has two
dimensions shown
in the x-y plane

54
The gradient
We’ll represent as f (x; θ ) to make the dependence on θ more
obvious:

The final equation for updating θ based on the gradient is thus

55
What are these partial derivatives for logistic regression?

The loss function

The elegant derivative of this function

56
57
Hyperparameters
The learning rate η is a hyperparameter
◦ too high: the learner will take big steps and overshoot
◦ too low: the learner will take too long
Hyperparameters:
• Briefly, a special kind of parameter for an ML model
• Instead of being learned by algorithm from
supervision (like regular parameters), they are
chosen by algorithm designer.
58
Stochastic Gradient Descent
Logistic
Regressi
on

59
Stochastic Gradient Descent:
Logistic An example and more details
Regressi
on

60
Working through an example
One step of gradient descent
A mini-sentiment example, where the true y=1 (positive)
Two features:
x1 = 3 (count of positive lexicon words)
x2 = 2 (count of negative lexicon words)
Assume 3 parameters (2 weights and 1 bias) in Θ0 are zero:
w1 = w2 = b = 0
η = 0.1
61
Example of gradient descent
w =w = b = 0;
1 2
Update step for update θ is: x1 = 3; x2 = 2

where
Gradient vector has 3 dimensions:

62
Example of gradient descent
w =w = b = 0;
1 2
Update step for update θ is: x1 = 3; x2 = 2

where
Gradient vector has 3 dimensions:

63
Example of gradient descent
w =w = b = 0;
1 2
Update step for update θ is: x1 = 3; x2 = 2

where
Gradient vector has 3 dimensions:

64
Example of gradient descent
w =w = b = 0;
1 2
Update step for update θ is: x1 = 3; x2 = 2

where
Gradient vector has 3 dimensions:

65
Example of gradient descent
w =w = b = 0;
1 2
Update step for update θ is: x1 = 3; x2 = 2

where
Gradient vector:

66
Example of gradient descent

Now that we have a gradient, we compute the new parameter vector

θ1 by moving θ0 in the opposite direction from the gradient:
η = 0.1;

67
Example of gradient descent

Now that we have a gradient, we compute the new parameter vector

θ1 by moving θ0 in the opposite direction from the gradient:
η = 0.1;

68
Example of gradient descent

Now that we have a gradient, we compute the new parameter vector

θ1 by moving θ0 in the opposite direction from the gradient:
η = 0.1;

69
Example of gradient descent

Now that we have a gradient, we compute the new parameter vector

θ1 by moving θ0 in the opposite direction from the gradient:
η = 0.1;

Note that enough negative examples would eventually make w2 negative 70

Mini-batch training
Stochastic gradient descent chooses a single
random example at a time.
That can result in choppy movements
More common to compute gradient over batches of
training instances.
Batch training: entire dataset
Mini-batch training: m examples
71
Stochastic Gradient Descent:
Logistic An example and more details
Regressi
on

72
Regularization
Logistic
Regressi
on

73
Overfitting
A model that perfectly match the training data has a
problem.
It will also overfit to the data, modeling noise
◦ A random word that perfectly predicts y (it happens to
only occur in one class) will get a very high weight.
◦ Failing to generalize to a test set without this word.
A good model should be able to generalize

74
Overfitting Useful or harmless features

+ X1 = "this"
X2 = "movie
This movie drew me in, and it'll X3 = "hated"
do the same to you. X4 = "drew me in"

-
I can't tell you how much I
4gram features that just
"memorize" training set and
might cause problems
hated this movie. It sucked. X5 = "the same to you"
X7 = "tell you how much"
75
Overfitting
4-gram model on tiny data will just memorize the data
◦ 100% accuracy on the training set
But it will be surprised by the novel 4-grams in the test data
◦ Low accuracy on test set
Models that are too powerful can overfit the data
◦ Fitting the details of the training data so exactly that the
model doesn't generalize well to the test set
◦ How to avoid overfitting?
◦ Regularization in logistic regression
◦ Dropout in neural networks
76
Regularization
A solution for overfitting
Add a regularization term R(θ) to the loss function (for
now written as maximizing logprob rather than minimizing loss)

Idea: choose an R(θ) that penalizes large weights

◦ fitting the data well with lots of big weights not as good as
fitting the data a little less well, with small weights
77
L2 Regularization (= ridge
regression)
The sum of the squares of the weights
The name is because this is the (square of the)
L2 norm ||θ||2, = Euclidean distance of θ to the origin.

L2 regularized objective function:

78
L1 Regularization (= lasso
regression)
The sum of the (absolute value of the) weights
Named after the L1 norm ||W||1, = sum of the absolute
values of the weights, = Manhattan distance

L1 regularized objective function:

EC2303 Additional Practice Questions
No ratings yet
EC2303 Additional Practice Questions
7 pages
The Unscrambler Method References
No ratings yet
The Unscrambler Method References
45 pages
Logisticregression 2021
No ratings yet
Logisticregression 2021
78 pages
Multimedia Application L9
No ratings yet
Multimedia Application L9
43 pages
Logistic Regression
No ratings yet
Logistic Regression
78 pages
Text Classification Using Logistics Regression
No ratings yet
Text Classification Using Logistics Regression
64 pages
5 LR Apr 7 2021
No ratings yet
5 LR Apr 7 2021
94 pages
5 LR Apr 7 2021
No ratings yet
5 LR Apr 7 2021
93 pages
Logistic Regression
No ratings yet
Logistic Regression
51 pages
Lecture 5 - Logistic Regression
No ratings yet
Lecture 5 - Logistic Regression
28 pages
CS229 Supplemental Lecture Notes: 1 Binary Classification
No ratings yet
CS229 Supplemental Lecture Notes: 1 Binary Classification
7 pages
L3 Cse256 Fa24 FFN
No ratings yet
L3 Cse256 Fa24 FFN
64 pages
Binary Classification and Logistic Regression
No ratings yet
Binary Classification and Logistic Regression
7 pages
Introduction To Machine Learning: 2 Linear Classifiers
No ratings yet
Introduction To Machine Learning: 2 Linear Classifiers
4 pages
Generalized Linear Model
No ratings yet
Generalized Linear Model
67 pages
Week 7
No ratings yet
Week 7
53 pages
3-LG Eval
No ratings yet
3-LG Eval
52 pages
Lec12 Logreg
No ratings yet
Lec12 Logreg
41 pages
7 Logistic-Regression
No ratings yet
7 Logistic-Regression
63 pages
Week 4 Logistic
No ratings yet
Week 4 Logistic
21 pages
Lecture 1, Part 3: Training A Classifier: Roger Grosse
No ratings yet
Lecture 1, Part 3: Training A Classifier: Roger Grosse
11 pages
Lec 4
No ratings yet
Lec 4
24 pages
Deep Learning Week 204-4
No ratings yet
Deep Learning Week 204-4
1 page
Neural Network
No ratings yet
Neural Network
14 pages
04 - Linear-Classification-2024
No ratings yet
04 - Linear-Classification-2024
65 pages
Logistic Regression: Some Slides Adapted From Dan Jurfasky and Brendan O'Connor
No ratings yet
Logistic Regression: Some Slides Adapted From Dan Jurfasky and Brendan O'Connor
53 pages
Lec 02 LogisticReg
No ratings yet
Lec 02 LogisticReg
33 pages
W2 Ann
No ratings yet
W2 Ann
12 pages
Logistic Regression: "And How Do You Know That These Fine Begonias Are Not of Equal Importance?"
No ratings yet
Logistic Regression: "And How Do You Know That These Fine Begonias Are Not of Equal Importance?"
25 pages
Logistic Regression
No ratings yet
Logistic Regression
8 pages
4.logistic Regression
No ratings yet
4.logistic Regression
16 pages
04 LogisticRegression
No ratings yet
04 LogisticRegression
29 pages
Logistic Regression - Byimran
No ratings yet
Logistic Regression - Byimran
35 pages
09 23ECE216 LogisticRegression
No ratings yet
09 23ECE216 LogisticRegression
40 pages
Logistic Regression: "And How Do You Know That These Fine Begonias Are Not of Equal Importance?"
No ratings yet
Logistic Regression: "And How Do You Know That These Fine Begonias Are Not of Equal Importance?"
21 pages
23 LogisticRegression
No ratings yet
23 LogisticRegression
67 pages
06 LogisticRegression
No ratings yet
06 LogisticRegression
29 pages
Logistic Regression: Adapted From: Tom Mitchell's Machine Learning Book Evan Wei Xiang and Qiang Yang
No ratings yet
Logistic Regression: Adapted From: Tom Mitchell's Machine Learning Book Evan Wei Xiang and Qiang Yang
15 pages
Lecture 0.3 - Linear Classifiers, Logistic Regression, Multiclass Classification
No ratings yet
Lecture 0.3 - Linear Classifiers, Logistic Regression, Multiclass Classification
48 pages
M02Logistic Regression Logistic RegressioLogistic Regressionn
No ratings yet
M02Logistic Regression Logistic RegressioLogistic Regressionn
19 pages
Lecture Notes 6 Logistic Regression
No ratings yet
Lecture Notes 6 Logistic Regression
8 pages
Logistic Regression Notes
No ratings yet
Logistic Regression Notes
25 pages
Logistic Regression: "And How Do You Know That These Fine Begonias Are Not of Equal Importance?"
No ratings yet
Logistic Regression: "And How Do You Know That These Fine Begonias Are Not of Equal Importance?"
25 pages
Logistic Regression
No ratings yet
Logistic Regression
26 pages
Machine Learning - Logistic Regression
No ratings yet
Machine Learning - Logistic Regression
16 pages
ML - MU - Unit - 2 - Supervised Learning-Classification Techniques
No ratings yet
ML - MU - Unit - 2 - Supervised Learning-Classification Techniques
153 pages
Ed3book - Jan72023 87 110
No ratings yet
Ed3book - Jan72023 87 110
24 pages
Logistic Regression
No ratings yet
Logistic Regression
34 pages
Logistic Regressions
No ratings yet
Logistic Regressions
11 pages
Week 3 - Lecture Slides - Logistic Regression
No ratings yet
Week 3 - Lecture Slides - Logistic Regression
54 pages
Lecture 8 Logistic Regression
No ratings yet
Lecture 8 Logistic Regression
34 pages
Computation Graph - 1
No ratings yet
Computation Graph - 1
65 pages
Lec18 Logistic Regression
No ratings yet
Lec18 Logistic Regression
17 pages
Machine Learning: Probabilistic View of Linear Regression Logistic Regression Hyperplane Based Classifiers and Perceptron
No ratings yet
Machine Learning: Probabilistic View of Linear Regression Logistic Regression Hyperplane Based Classifiers and Perceptron
67 pages
Lecture 3. Classification
No ratings yet
Lecture 3. Classification
60 pages
Mathematical Foundations of Computational Linguistics: Manfred Klenner and Jannis Vamvas
No ratings yet
Mathematical Foundations of Computational Linguistics: Manfred Klenner and Jannis Vamvas
32 pages
Logistic Regression
No ratings yet
Logistic Regression
10 pages
2021 Logistic Regression
No ratings yet
2021 Logistic Regression
33 pages
Yousef ML Washin Classification
100% (1)
Yousef ML Washin Classification
333 pages
Unit II
100% (1)
Unit II
13 pages
Lec 05
No ratings yet
Lec 05
53 pages
Calculus I Essentials
From Everand
Calculus I Essentials
Editors of REA
1/5 (1)
Sem1 Statistics1-6
No ratings yet
Sem1 Statistics1-6
10 pages
Ch6 Slides Ed3 Feb2021
No ratings yet
Ch6 Slides Ed3 Feb2021
63 pages
Questions 161261
100% (2)
Questions 161261
3 pages
LAS 3.4 Basic Statistics in Experimental Research No Answer Key
No ratings yet
LAS 3.4 Basic Statistics in Experimental Research No Answer Key
6 pages
Econometricstrix Meeting 2020 December
No ratings yet
Econometricstrix Meeting 2020 December
5 pages
DLL Engineering Simple Reliability Calculations: Empirical Failure Rate
No ratings yet
DLL Engineering Simple Reliability Calculations: Empirical Failure Rate
8 pages
Answer Key Variability
No ratings yet
Answer Key Variability
2 pages
Random Forest Algorithm 1
No ratings yet
Random Forest Algorithm 1
14 pages
Math Viva Sem 3
50% (2)
Math Viva Sem 3
21 pages
Answers Chapter 4 in
No ratings yet
Answers Chapter 4 in
5 pages
Nonparametric Statistics
No ratings yet
Nonparametric Statistics
5 pages
Comparingfootballteams
No ratings yet
Comparingfootballteams
5 pages
LSS - Group 3
No ratings yet
LSS - Group 3
10 pages
Cbsnews 20250327 Flying
No ratings yet
Cbsnews 20250327 Flying
8 pages
Geometric Mean
No ratings yet
Geometric Mean
3 pages
WILP ASM Mid-Sem Makeup Solutions
No ratings yet
WILP ASM Mid-Sem Makeup Solutions
4 pages
Factor Analysis
No ratings yet
Factor Analysis
23 pages
Iranian Churn
No ratings yet
Iranian Churn
16 pages
Using R With Multivariate Statistics by Randall E. Schumacker
No ratings yet
Using R With Multivariate Statistics by Randall E. Schumacker
471 pages
BTTM Question Bank
No ratings yet
BTTM Question Bank
2 pages
The Analysis of Economic Data Using Multivariate and Time Series Technique
No ratings yet
The Analysis of Economic Data Using Multivariate and Time Series Technique
5 pages
Stats Exam QnA
No ratings yet
Stats Exam QnA
9 pages
Introduction To Stata Software, MaU, 2022
No ratings yet
Introduction To Stata Software, MaU, 2022
93 pages
Tugas Rista Bria
No ratings yet
Tugas Rista Bria
10 pages
Course Outline
No ratings yet
Course Outline
1 page
General Entry With Prediction of Population Between India and China
No ratings yet
General Entry With Prediction of Population Between India and China
4 pages
The Mann Whitney or Wilcoxon Rank
No ratings yet
The Mann Whitney or Wilcoxon Rank
6 pages
Tugas Statistik
No ratings yet
Tugas Statistik
6 pages

Ch03 LogisticRegression

Uploaded by

Ch03 LogisticRegression

Uploaded by

Background: Generative and

Logistic Discriminative Classifiers

Logistic regression is a discriminative

Also build a model for dog images

Now given a new image:

Oh look, dogs have collars!

1. A feature representation of the input. For each input

Training: we learn weights w and b using stochastic

Test: Given a test example x we compute p(y|x)

Output: a predicted class  C

Given a series of input/output pairs:

Input observation: vector x = [x1, x2,…, xn]

If this sum is high, we say y=1; if low, then y=0

We need to formalize “sum is high”.

Solution: use a function of z that goes from 0 to 1

We’ll compute w∙x+b

0.5 here is called the decision boundary

We want to know how far is the classifier output:

from the true output:

We'll call this difference:

A case of conditional maximum likelihood

Whatever values maximize log p(y|x) will also maximize

Now flip sign to turn this into a loss: something to minimize

True value is y=1. How well is our model doing?

Pretty well! What's the loss?

What's the loss?

Sure enough, loss was bigger when model was wrong! 43

• And we’ll represent as f (x; θ ) to make the

So we'll move positive

So we'll move positive

Gradient Descent: Find the gradient of the loss

• The value of the gradient (slope in our example)

The final equation for updating θ based on the gradient is thus

The loss function

The elegant derivative of this function

Now that we have a gradient, we compute the new parameter vector

Now that we have a gradient, we compute the new parameter vector

Now that we have a gradient, we compute the new parameter vector

Now that we have a gradient, we compute the new parameter vector

Note that enough negative examples would eventually make w2 negative 70

Idea: choose an R(θ) that penalizes large weights

L2 regularized objective function:

L1 regularized objective function:

You might also like