0% found this document useful (0 votes)

30 views7 pages

Lec4 PDF

The document discusses two approaches to dealing with noisy labels in linear classification: 1. Support vector machines (SVMs) formulate the problem as a regularization problem that minimizes classification loss and a regularization penalty. The hinge loss function is used. 2. Logistic regression models the probability that an example is correctly labeled as a function of its distance from the decision boundary. Maximum likelihood estimation is used to train the model by maximizing the probability of predicting correct labels. Both SVMs and logistic regression can be viewed as regularization methods that balance fitting the training data and controlling model complexity.

Uploaded by

juanagallardo01

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

30 views7 pages

Lec4 PDF

Uploaded by

juanagallardo01

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 7

6.

867 Machine learning, lecture 4 (Jaakkola) 1

The Support Vector Machine and regularization

We proposed a simple relaxed optimization problem for finding the maximum margin sep
arator when some of the examples may be misclassified:
n
1 �
minimize �θ�2 + C ξt (1)
2 t=1
subject to yt (θT xt + θ0 ) ≥ 1 − ξt and ξt ≥ 0 for all t = 1, . . . , n (2)
where the remaining parameter C could be set by cross-validation, i.e., by minimizing the
leave-one-out cross-validation error.
The goal here is to briefly understand the relaxed optimization problem from the point of
view of regularization. Regularization problems are typically formulated as optimization
problems involving the desired objective (classification loss in our case) and a regularization
penalty. The regularization penalty is used to help stabilize the minimization of the ob
jective or infuse prior knowledge we might have about desirable solutions. Many machine
learning methods can be viewed as regularization methods in this manner. For later utility
we will cast SVM optimization problem as a regularization problem.
3 3

2.5 2.5

2 2

1.5 1.5

1 1

0.5 0.5

0 0

−0.5 −0.5

a) −1
−3 −2 −1 0 1 2 3 b) −1
−3 −2 −1 0 1 2 3

Figure 1: a) The hinge loss (1 − z)+ as a function of z. b) The logistic loss log[1 + exp(−z)]
as a function of z.
To turn the relaxed optimization problem into a regularization problem we define a loss
function that corresponds to individually optimized ξt values and specifies the cost of vio
lating each of the margin constraints. We are effectively solving the optimization problem
�
with respect to the ξ values for a fixed θ and θ0 . This will lead to an expression of C t ξt
as a function of θ and θ0 .

Cite as: Tommi Jaakkola, course materials for 6.867 Machine Learning, Fall 2006. MIT OpenCourseWare
(http://ocw.mit.edu/), Massachusetts Institute of Technology. Downloaded on [DD Month YYYY].
6.867 Machine learning, lecture 4 (Jaakkola) 2

The loss function we need for this purpose is based on the hinge loss Lossh (z) deﬁned as
the positive part of 1 − z, written as (1 − z)+ (see Figure 1a). The relaxed optimization
problem can be written using the hinge loss as

=ξ̂t
n �
1 � �� +�
minimize �θ�2 + C T
�
1 − yt (θ xt + θ0 ) (3)
2 t=1

Here �θ�2 /2, the inverse squared geometric margin, is viewed as a regularization penalty
that helps stabilize the objective
n
� �+
1 − yt (θT xt + θ0 )
�
C (4)
t=1

In other words, when no margin constraints are violated (zero loss), the regularization
penalty helps us select the solution with the largest geometric margin.

Logistic regression, maximum likelihood estimation

0.9

0.8

0.7

0.6

0.5

0.4

0.3

0.2

0.1

0
−3 −2 −1 0 1 2 3

Figure 2: The logistic function g(z) = (1 + exp(−z))−1 .

Another way of dealing with noisy labels in linear classification is to model how the noisy
labels are generated. For example, human assigned labels tend to be very good for “typical
examples” but exhibit some variation in more difficult cases. One simple model of noisy
labels in linear classification is a logistic regression model. In this model we assign a

probability distribution over the two labels in such a way that the labels for examples
further away from the decision boundary are more likely to be correct. More precisely, we
say that

P (y = 1|x, θ, θ0 ) = g θT x + θ0
� �
(5)

where g(z) = (1 + exp(−z))−1 is known as the logistic function (Figure 2). One way to
derive the form of the logistic function is to say that the log-odds of the predicted class
probabilities should be a linear function of the inputs:

P (y = 1|x, θ, θ0 )
log = θ T x + θ0 (6)
P (y = −1|x, θ, θ0 )

So for example, when we predict the same probability (1/2) for both classes, the log-odds
term is zero and we recover the decision boundary θT x + θ0 = 0. The precise functional
form of the logistic function, or, equivalently, the fact that we chose to model log-odds
with the linear prediction, may seem a little arbitrary (but perhaps not more so than the
hinge loss used with the SVM classiﬁer). We will derive the form of the logistic function
later on in the course based on certain assumptions about class-conditional distributions
P (x|y = 1) and P (x|y = −1).
In order to better compare the logistic regression model with the SVM we will write the
conditional probability P (y|x, θ, θ0 ) a bit more succinctly. Speciﬁcally, since 1 − g(z) =
g(−z) we get

P (y = −1|x, θ, θ0 ) = 1 − P (y = 1|x, θ, θ0 ) = 1 − g( θT x + θ0 ) = g −(θT x + θ0 )

� �
(7)

and therefore

P (y|x, θ, θ0 ) = g y(θT x + θ0 )
� �
(8)

So now we have a linear classiﬁer that makes probabilistic predictions about the labels.
How should we train such models? A sensible criterion would seem to be to maximize the
probability that we predict the correct label in response to each example. Assuming each
example is labeled independently from others, this probability of assigning correct labels
to examples is given by the product
n
�
L(θ, θ0 ) = P (yt |xt , θ, θ0 ) (9)
t=1

L(θ, θ0 ) is known as the (conditional) likelihood function and is interpreted as a function

of the parameters for a fixed data (labels and examples). By maximizing this conditional
likelihood with respect to θ and θ0 we obtain maximum likelihood estimates of the param
eters. Maximum likelihood estimators 1 have many nice properties. For example, assuming
we have selected the right model class (logistic regression model) and certain regularity
conditions hold, then the ML estimator is a) consistent (we will get the right parameter
values in the limit of a large number of training examples), and b) efficient (no other esti
mator will converge to the correct parameter values faster in the mean squared sense). But
what if we do not have the right model class? Neither property may hold as a result. More
robust estimators can be found in a larger class of estimators called M-estimators that
includes maximum likelihood. We will nevertheless use the maximum likelihood principle
to set the parameter values.
The product form of the conditional likelihood function is a bit difficult to work with
directly so we will maximize its logarithm instead:
n
�
l(θ, θ0 ) = log P (yt |xt , θ, θ0 ) (10)
t=1

Alternatively, we can minimize the negative logarithm

n �
log-loss
� ��
− l(θ, θ0 ) = − log P (yt |xt , θ, θ0 ) (11)
t=1
�n
− log g yt (θT xt + θ0 )
� �
= (12)
t=1
n
�
log 1 + exp −yt (θT xt + θ0 )
� � ��
= (13)
t=1

We can interpret this similarly to the sum of the hinge losses in the SVM approach. As
before, we have a base loss function, here log[1 + exp(−z)] (Figure 1b), similar to the hinge
loss (Figure 1a), and this loss depends only on the value of the “margin” yt (θT xt + θ0 ) for
each example. The diﬀerence here is that we have a clear probabilistic interpretation of
the “strength” of the prediction, i.e., how high P (yt |xt , θ, θ0 ) is for any particular example.
Having a probabilistic interpretation does not, however, mean that the probability values
are in any way sensible or calibrated. Predicted probabilities are calibrated when they
1
An estimator is a function that maps data to parameter values. An estimate is the value obtained in
response to speciﬁc data.

correspond to observed frequencies. So, for example, if we group together all the examples
for which we predict positive label with probability 0.5, then roughly half of them should be
labeled +1. Probability estimates are rarely well-calibrated but can nevertheless be useful.
The minimization problem we have deﬁned above is convex and there are a number of
optimization methods available for ﬁnding the minimizing θ̂ and θ̂0 including simple gradi
ent descent. In a simple (stochastic) gradient descent, we would modify the parameters in
response to each term in the sum (based on each training example). To specify the updates
we need the following derivatives
� �
d � � T
�� exp −yt (θT xt + θ0 )
log 1 + exp −yt (θ xt + θ0 ) = −yt (14)
dθ0 1 + exp ( −yt (θT xt + θ0 ) )
= −yt [1 − P (yt |xt , θ, θ0 )] (15)

and
d
log 1 + exp −yt (θT xt + θ0 )
� � ��
= −yt xt [1 − P (yt |xt , θ, θ0 )] (16)
dθ
The parameters are then updated by selecting training examples at random and moving
the parameters in the opposite direction of the derivatives:

θ0 ← θ0 + η · yt [1 − P (yt |xt , θ, θ0 )] (17)

θ ← θ + η · yt xt [1 − P (yt |xt , θ, θ0 )] (18)

where η is a small (positive) learning rate. Note that P (yt |xt , θ, θ0 ) is the probability that
we predict the training label correctly and [1 − P (yt |xt , θ, θ0 )] is the probability of making a
mistake. The stochastic gradient descent updates in the logistic regression context therefore
strongly resemble the perceptron mistake driven updates. The key diﬀerence here is that
the updates are graded, made in proportion to the probability of making a mistake.
The stochastic gradient descent algorithm leads to no signiﬁcant change on average when
the gradient of the full objective equals zero. Setting the gradient to zero is also a necessary
condition of optimality:
n
d �
(−l(θ, θ0 ) = − yt [1 − P (yt |xt , θ, θ0 )] = 0 (19)
dθ0 t=1
n
d �
(−l(θ, θ0 )) = − yt xt [1 − P (yt |xt , θ, θ0 )] = 0 (20)
dθ t=1

The sum in Eq.(19) is the diﬀerence between mistake probabilities associated with positively
and negatively labeled examples. The optimality of θ0 therefore ensures that the mistakes
are balanced in this (soft) sense. Another way of understanding this is that the vector of
mistake probabilities is orthogonal to the vector of labels. Similarly, the optimal setting
of θ is characterized by mistake probabilities that are orthogonal to all rows of the label-
example matrix X̃ = [y1 x1 , . . . , yn xn ]. In other words, for each dimension j of the example
vectors, [y1 x1j , . . . , yn xnj ] is orthogonal to the mistake probabilities. Taken together, these
orthogonality conditions ensure that there’s no further linearly available information in
the examples to improve the predicted probabilities (or mistake probabilities). This is
perhaps a bit easier to see if we ﬁrst map ±1 labels into 0/1 labels: ỹt = (1 + yt )/2 so that
ỹt ∈ {0, 1}. Then the above optimality conditions can be rewritten in terms of prediction
errors [ỹt − P (y = 1|xt , θ, θ0 )] rather than mistake probabilities as
n
�
[ỹt − P (y = 1|xt , θ, θ0 )] = 0 (21)
t=1
n
�
xt [ỹt − P (y = 1|xt , θ, θ0 )] = 0 (22)
t=1

and
n
� n
�
θ0� [ỹt − P (y = 1|xt , θ, θ0 )] + θ �T
xt [ỹt − P (y = 1|xt , θ, θ0 )] (23)
t=1 t=1

�
= (θ�T xt + θ0 )[ỹt − P (y = 1|xt , θ, θ0 )] = 0 (24)
t=1

meaning that the prediction errors are orthogonal to any linear function of the inputs.
Let’s try to briefly understand the type of predictions we could obtain via maximum like
lihood estimation of the logistic regression model. Suppose the training examples are
linearly separable. In this case we can find parameter values such that yt (θT xt + θ0 ) are
positive for all training examples. By scaling up the parameters, we make these values
larger and larger. This is beneficial as far as the likelihood model is concerned since
the �log of the
� logistic function
�� is strictly increasing as a function of yt (θT xt + θ0 ) (the loss
log 1 + exp −yt (θT xt + θ0 ) is strictly decreasing). Thus, as a result, the maximum like
lihood parameter values would become unbounded, and infinite scaling of any parameters
corresponding to a perfect linear classifier would attain the highest likelihood (likelihood
of exactly one or the loss exactly zero). The resulting probability values, predicting each
training label correctly with probability one, are hardly accurate in the sense of reflecting

our uncertainty about what the labels might be. So, when the number of training ex
amples is small we would need to add the regularizer �θ�2 /2 just as in the SVM model.
The regularizer helps select reasonable parameters when the available training data fails to
suﬃciently constrain the linear classiﬁer.
To estimate the parameters of the logistic regression model with regularization we would
minimize instead
n
1 �
�θ�2 + C log 1 + exp −yt (θT xt + θ0 )
� � ��

(25)

2 t=1

where the constant C again specifies the trade-off between correct classification (the ob
jective) and the regularization penalty. The regularization problem is typically written
(equivalently) as
n
λ �
�θ�2 + log 1 + exp −yt (θT xt + θ0 )
� � ��
(26)
2 t=1

since it seems more natural to vary the strength of regularization with λ while keeping the
objective the same.

189 Cheat Sheet Minicards
No ratings yet
189 Cheat Sheet Minicards
2 pages
Binary Classification and Logistic Regression
No ratings yet
Binary Classification and Logistic Regression
7 pages
Notes Chapter Logistic Regression
No ratings yet
Notes Chapter Logistic Regression
6 pages
Linear Regression, Active Learning
No ratings yet
Linear Regression, Active Learning
10 pages
CS229 Supplemental Lecture Notes: 1 Binary Classification
No ratings yet
CS229 Supplemental Lecture Notes: 1 Binary Classification
7 pages
Cheatsheet Supervised Learning
No ratings yet
Cheatsheet Supervised Learning
4 pages
Midterm f01
No ratings yet
Midterm f01
10 pages
Logistic Regression
No ratings yet
Logistic Regression
9 pages
Cheatsheet Supervised Learning
100% (1)
Cheatsheet Supervised Learning
4 pages
Output 25
No ratings yet
Output 25
8 pages
Midterm F02soln
No ratings yet
Midterm F02soln
14 pages
Supervised Learning
No ratings yet
Supervised Learning
6 pages
Machine Learning-Kernel Methods
No ratings yet
Machine Learning-Kernel Methods
5 pages
Lecture 2
No ratings yet
Lecture 2
8 pages
CS229 Lecture 3 PDF
100% (1)
CS229 Lecture 3 PDF
35 pages
Exam in Statistical Machine Learning Statistisk Maskininlärning (1RT700)
No ratings yet
Exam in Statistical Machine Learning Statistisk Maskininlärning (1RT700)
12 pages
Exam in Statistical Machine Learning Statistisk Maskininlärning (1RT700)
No ratings yet
Exam in Statistical Machine Learning Statistisk Maskininlärning (1RT700)
12 pages
Chapter 2 - Logistic Regression
No ratings yet
Chapter 2 - Logistic Regression
88 pages
Lecture 7 Loss Function and Regularization
No ratings yet
Lecture 7 Loss Function and Regularization
38 pages
EE353 - 769 08 Linear Classification
No ratings yet
EE353 - 769 08 Linear Classification
22 pages
Chapter Regression
No ratings yet
Chapter Regression
10 pages
Output 23
No ratings yet
Output 23
6 pages
SVM Problems1
No ratings yet
SVM Problems1
5 pages
Lec12 PDF
No ratings yet
Lec12 PDF
9 pages
ML Classification Trupesh Patel
No ratings yet
ML Classification Trupesh Patel
39 pages
Concise - Lecture - Notes - On - Optimization - Methods - 1722728042 2024-08-03 23 - 34 - 09
No ratings yet
Concise - Lecture - Notes - On - Optimization - Methods - 1722728042 2024-08-03 23 - 34 - 09
258 pages
Midterm 2010 F
No ratings yet
Midterm 2010 F
15 pages
Solutions Problem Set 1
No ratings yet
Solutions Problem Set 1
7 pages
189 Cheat Sheet Nominicards PDF
No ratings yet
189 Cheat Sheet Nominicards PDF
2 pages
Machine Learning PDF
No ratings yet
Machine Learning PDF
77 pages
ML Basics Lecture2 Linear Classification
No ratings yet
ML Basics Lecture2 Linear Classification
34 pages
Numerical Optimization of Likelihoods: Additional Literature For STK2120
No ratings yet
Numerical Optimization of Likelihoods: Additional Literature For STK2120
46 pages
Final f04
No ratings yet
Final f04
13 pages
Problemset2 PDF
No ratings yet
Problemset2 PDF
4 pages
03 Linear Models
No ratings yet
03 Linear Models
46 pages
Lec3 PDF
No ratings yet
Lec3 PDF
6 pages
01 Lecturenote SRM
No ratings yet
01 Lecturenote SRM
9 pages
Assignment 1
No ratings yet
Assignment 1
6 pages
ML Question CMU
No ratings yet
ML Question CMU
12 pages
10: Empirical Risk Minimization
No ratings yet
10: Empirical Risk Minimization
6 pages
4 Linear Regression Additional Notes
No ratings yet
4 Linear Regression Additional Notes
8 pages
ML Practice 1
No ratings yet
ML Practice 1
106 pages
Note 4: EECS 189 Introduction To Machine Learning Fall 2020 1 MLE and MAP For Regression (Part I)
No ratings yet
Note 4: EECS 189 Introduction To Machine Learning Fall 2020 1 MLE and MAP For Regression (Part I)
6 pages
Logistic Regression
No ratings yet
Logistic Regression
36 pages
T R Ik-Cl Ervor Er Kis: (Example)
No ratings yet
T R Ik-Cl Ervor Er Kis: (Example)
122 pages
EE 769 Introduction To Machine Learning: Sheet 4 - 2020-21-2 Linear Classification
No ratings yet
EE 769 Introduction To Machine Learning: Sheet 4 - 2020-21-2 Linear Classification
4 pages
MLE Lecture Note For Econometrician
No ratings yet
MLE Lecture Note For Econometrician
13 pages
Lecture16 Crossvalidation
No ratings yet
Lecture16 Crossvalidation
32 pages
Logistic Regression (Probability Concepts) and Perceptron
No ratings yet
Logistic Regression (Probability Concepts) and Perceptron
20 pages
Sample Midterm Exam 6
No ratings yet
Sample Midterm Exam 6
11 pages
Assignment 10 Solution
No ratings yet
Assignment 10 Solution
8 pages
ASU Assignment2 Sol
No ratings yet
ASU Assignment2 Sol
8 pages
CS-31002 (ML) - CS End April 2025
No ratings yet
CS-31002 (ML) - CS End April 2025
19 pages
Final f03
No ratings yet
Final f03
8 pages
Week 3 Lecture Notes
No ratings yet
Week 3 Lecture Notes
7 pages
Machine Learning: Probabilistic View of Linear Regression Logistic Regression Hyperplane Based Classifiers and Perceptron
No ratings yet
Machine Learning: Probabilistic View of Linear Regression Logistic Regression Hyperplane Based Classifiers and Perceptron
67 pages
Homework2 v1.0
No ratings yet
Homework2 v1.0
5 pages
CS 229, Public Course Problem Set #1 Solutions: Supervised Learning
No ratings yet
CS 229, Public Course Problem Set #1 Solutions: Supervised Learning
10 pages
Worked Examples in Mathematics for Scientists and Engineers
From Everand
Worked Examples in Mathematics for Scientists and Engineers
G. Stephenson
No ratings yet
Student Solutions Manual to Accompany Economic Dynamics in Discrete Time, secondedition
From Everand
Student Solutions Manual to Accompany Economic Dynamics in Discrete Time, secondedition
Yue Jiang
4.5/5 (2)
Massachusetts Institute of Technology: 6.867 Machine Learning, Fall 2006 Problem Set 3: Solutions
No ratings yet
Massachusetts Institute of Technology: 6.867 Machine Learning, Fall 2006 Problem Set 3: Solutions
3 pages
Hw1a Soln
No ratings yet
Hw1a Soln
5 pages
Massachusetts Institute of Technology: 6.867 Machine Learning, Fall 2006 Problem Set 4: Solutions
No ratings yet
Massachusetts Institute of Technology: 6.867 Machine Learning, Fall 2006 Problem Set 4: Solutions
8 pages
Massachusetts Institute of Technology: 6.867 Machine Learning, Fall 2006 Problem Set 2: Solutions
No ratings yet
Massachusetts Institute of Technology: 6.867 Machine Learning, Fall 2006 Problem Set 2: Solutions
7 pages
HW 3 Errata
No ratings yet
HW 3 Errata
1 page
HW 4
No ratings yet
HW 4
6 pages
Relational Database and SQL
No ratings yet
Relational Database and SQL
35 pages
Fpls 14 1308528
No ratings yet
Fpls 14 1308528
10 pages
HEC-HMS Training-V2-20231219 - 223927
No ratings yet
HEC-HMS Training-V2-20231219 - 223927
61 pages
NPM-D3A en 25 0101
No ratings yet
NPM-D3A en 25 0101
4 pages
PecStar iEMS V3.6 System Design Guide
No ratings yet
PecStar iEMS V3.6 System Design Guide
17 pages
Damro Furniture
No ratings yet
Damro Furniture
20 pages
E Passbook 2024 08 01 09 53 42 AM
No ratings yet
E Passbook 2024 08 01 09 53 42 AM
146 pages
DLD Lab 7
No ratings yet
DLD Lab 7
9 pages
SERDES
No ratings yet
SERDES
47 pages
Unit I - MMD - Lecture NoteStu
No ratings yet
Unit I - MMD - Lecture NoteStu
10 pages
Current and Future Trends of Media & Information: (Ubiquitous Learning)
100% (1)
Current and Future Trends of Media & Information: (Ubiquitous Learning)
25 pages
Cs408 MCQs For FINALTERMl Ibriansmine
No ratings yet
Cs408 MCQs For FINALTERMl Ibriansmine
17 pages
Cookie Settings
No ratings yet
Cookie Settings
11 pages
DSO-DP6 Plug-In Card 100-00168
No ratings yet
DSO-DP6 Plug-In Card 100-00168
2 pages
CISE 301 Numerical Methods
No ratings yet
CISE 301 Numerical Methods
2 pages
Satya Nadella The Man Who Rebuilt Microsoft
No ratings yet
Satya Nadella The Man Who Rebuilt Microsoft
2 pages
Ifm OGD592 20180314 IODD11 en
No ratings yet
Ifm OGD592 20180314 IODD11 en
14 pages
LEAK DETECTION IN PIPELINE-jijo
No ratings yet
LEAK DETECTION IN PIPELINE-jijo
17 pages
Sensors: A Stress Sensor Based On Galvanic Skin Response (GSR) Controlled by Zigbee
No ratings yet
Sensors: A Stress Sensor Based On Galvanic Skin Response (GSR) Controlled by Zigbee
30 pages
Eipl Profile - 24
No ratings yet
Eipl Profile - 24
21 pages
Sisco
No ratings yet
Sisco
10 pages
Design of Power-Efficient High-Speed 4-Bit Compara
No ratings yet
Design of Power-Efficient High-Speed 4-Bit Compara
8 pages
Smart Scale P2 Pro - EU - Manual
No ratings yet
Smart Scale P2 Pro - EU - Manual
60 pages
A Conceptual Overview of Data Mining: B.N. Lakshmi., G.H. Raghunandhan
No ratings yet
A Conceptual Overview of Data Mining: B.N. Lakshmi., G.H. Raghunandhan
6 pages
Javell: Address: 23 A East Avenue, Linstead P.O., Jamaica Email: Telephone: (876) 484-8766 1876-416-8765
No ratings yet
Javell: Address: 23 A East Avenue, Linstead P.O., Jamaica Email: Telephone: (876) 484-8766 1876-416-8765
3 pages
Project Report On DVR (17001005025,2056,2046)
No ratings yet
Project Report On DVR (17001005025,2056,2046)
51 pages
AiPhen - Solution Challenge - Project Submission
No ratings yet
AiPhen - Solution Challenge - Project Submission
13 pages
Lecture 3 Hypothesis Space & Inductive Bias
No ratings yet
Lecture 3 Hypothesis Space & Inductive Bias
29 pages
200-301 Cisco CCNA Exam Updated Practice Questions
No ratings yet
200-301 Cisco CCNA Exam Updated Practice Questions
67 pages
BinaryEntryPoint MessageSpecificationGuidelines 8.0.0.1 EnUS
No ratings yet
BinaryEntryPoint MessageSpecificationGuidelines 8.0.0.1 EnUS
123 pages

Lec4 PDF

Uploaded by

Lec4 PDF

Uploaded by

6.

867 Machine learning, lecture 4 (Jaakkola) 1

The Support Vector Machine and regularization

Logistic regression, maximum likelihood estimation

Figure 2: The logistic function g(z) = (1 + exp(−z))−1 .

P (y = −1|x, θ, θ0 ) = 1 − P (y = 1|x, θ, θ0 ) = 1 − g( θT x + θ0 ) = g −(θT x + θ0 )

L(θ, θ0 ) is known as the (conditional) likelihood function and is interpreted as a function

Alternatively, we can minimize the negative logarithm

θ0 ← θ0 + η · yt [1 − P (yt |xt , θ, θ0 )] (17)

You might also like