0% found this document useful (0 votes)

58 views6 pages

Notes Chapter Logistic Regression

This document provides an introduction to logistic regression as an optimization problem. It discusses: 1. Framing machine learning algorithms as optimization problems by defining an objective function J(Θ) to minimize over the parameters Θ. Common forms include a loss term and regularization term. 2. The purpose of regularization is to prevent overfitting and prefer simpler hypotheses. It introduces a tradeoff between training accuracy and model complexity. 3. Linear logistic classifiers are introduced as an alternative to linear classifiers to address non-smoothness issues. They output probabilities using the logistic sigmoid function σ(z) rather than discrete class predictions.

Uploaded by

Parias L. Mukeba

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

58 views6 pages

Notes Chapter Logistic Regression

Uploaded by

Parias L. Mukeba

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 6

CHAPTER 5

Logistic regression

1 Machine learning as optimization

The perceptron algorithm was originally written down directly via cleverness and intu-
ition, and later analyzed theoretically. Another approach to designing machine learning
algorithms is to frame them as optimization problems, and then use standard optimization
algorithms and implementations to actually find the hypothesis. Taking this approach will
allow us to take advantage of a wealth of mathematical and algorithmic technique for un-
derstanding and solving optimization problems, which will allow us to move to hypothesis
classes that are substantially more complex than linear separators.
We begin by writing down an objective function J(Θ), where Θ stands for all the parame-
ters in our model. Note that we will sometimes write J(θ, θ0 ) because when studying linear
classifiers, we have used these two names for parts of our whole collection of parameters,
so Θ = (θ, θ0 ). We also often write J(Θ; D) to make clear the dependence on the data D. The
objective function describes how we feel about possible hypotheses Θ: we will generally
look for values for parameters Θ that minimize the objective function: You can think about Θ∗
here as “the theta that
Θ∗ = arg min J(Θ) . minimizes J”.
Θ

A very common form for an ML objective is

 
Xn
1
J(Θ) =  L(h(x(i) ; Θ), y(i) ) + |{z}
λ R(Θ) . (5.1)
n | {z } | {z }
i=1 loss constant regularizer

The loss tells us how unhappy we are about the prediction h(x(i) ; Θ) that Θ makes for
(x(i) , y(i) ). A common example is the 0-1 loss, introduced in chapter 1:

0 if y = h(x; Θ)
L01 (h(x; Θ), y) = ,
1 otherwise
which gives a value of 0 for a correct prediction, and a 1 for an incorrect prediction. In the
case of linear separators, this becomes:

0 if y(θT x + θ0 ) > 0
L01 (h(x; θ, θ0 ), y) = .
1 otherwise

27
MIT 6.036 Fall 2019 28

2 Regularization
If all we cared about was finding a hypothesis with small loss on the training data, we
would have no need for regularization, and could simply omit the second term in the
objective. But remember that our ultimate goal is to perform well on input values that we
haven’t trained on! It may seem that this is an impossible task, but humans and machine-
learning methods do this successfully all the time. What allows generalization to new input
values is a belief that there is an underlying regularity that governs both the training and
testing data. We have already discussed one way to describe an assumption about such
a regularity, which is by choosing a limited class of possible hypotheses. Another way to
do this is to provide smoother guidance, saying that, within a hypothesis class, we prefer
some hypotheses to others. The regularizer articulates this preference and the constant λ
says how much we are willing to trade off loss on the training data versus preference over
hypotheses.
This trade-off is illustrated in the figure below. Hypothesis h1 has 0 training loss, but is
very complicated. Hypothesis h2 mis-classifies two points, but is very simple. In absence
of other beliefs about the solution, it is often better to prefer that the solution be “simpler,”
and so we might prefer h2 over h1 , expecting it to perform better on future examples drawn
from this same distribution. Another nice way of thinking about regularization is that we To establish some vo-
would like to prevent our hypothesis from being too dependent on the particular training cabulary, we say that h1
is overfit to the training
data that we were given: we would like for it to be the case that if the training data were
data.
changed slightly, the hypothesis would not change by much.

A common strategy for specifying a regularizer is to use the form

2
R(Θ) = Θ − Θprior

when we have some idea in advance that θ ought to be near some value Θprior . In the Learn about Bayesian
absence of such knowledge a default is to regularize toward zero: methods in machine
learning to see the the-
ory behind this and cool
R(Θ) = kΘk2 .
results!

Last Updated: 12/18/19 11:56:05

MIT 6.036 Fall 2019 29

3 A new hypothesis class: linear logistic classifiers

For classification, it is natural to make predictions in {+1, −1} and use the 0−1 loss function.
However, even for simple linear classifiers, it is very difficult to find values for θ, θ0 that
minimize simple training error
n
1X
J(θ, θ0 ) = L(sign(θT x(i) + θ0 ), y(i) ) .
n
i=1

This problem is NP-hard, which probably implies that solving the most difficult instances The “probably” here is
of this problem would require computation time exponential in the number of training ex- not because we’re too
lazy to look it up, but
amples, n.
actually because of a
What makes this a difficult optimization problem is its lack of “smoothness”: fundamental unsolved
problem in computer-
• There can be two hypotheses, (θ, θ0 ) and (θ 0 , θ00 ), where one is closer in parameter science theory, known
space to the optimal parameter values (θ∗ , θ∗0 ), but they make the same number of as “P vs NP.”
misclassifications so they have the same J value.
• All predictions are categorical: the classifier can’t express a degree of certainty about
whether a particular input x should have an associated value y.
For these reasons, if we are considering a hypothesis θ, θ0 that makes five incorrect predic-
tions, it is difficult to see how we might change θ, θ0 so that it will perform better, which
makes it difficult to design an algorithm that searches through the space of hypotheses for
a good one.
For these reasons, we are going to investigate a new hypothesis class: linear logistic
classifiers. These hypotheses are still parameterized by a d-dimensional vector θ and a
scalar θ0 , but instead of making predictions in {+1, −1}, they generate real-valued outputs
in the interval (0, 1). A linear logistic classifier has the form
h(x; θ, θ0 ) = σ(θT x + θ0 ) .
This looks familiar! What’s new?
The logistic function, also known as the sigmoid function, is defined as
1
σ(z) = ,
1 + e−z
and plotted below, as a function of its input z. Its output can be interpreted as a probability,
because for any value of z the output is in (0, 1).

σ(z)
1

0.5

z
−4 −3 −2 −1 1 2 3 4

Study Question: Convince yourself the output of σ is always in the interval (0, 1).
Why can’t it equal 0 or equal 1? For what value of z does σ(z) = 0.5?
What does a linear logistic classifier (LLC) look like? Let’s consider the simple case
where d = 1, so our input points simply lie along the x axis. The plot below shows LLCs
for three different parameter settings: σ(10x + 1), σ(−2x + 1), and σ(2x − 3).

Last Updated: 12/18/19 11:56:05

MIT 6.036 Fall 2019 30

σ(θT x + θ0 )
1

0.5

x
−4 −3 −2 −1 1 2 3 4

Study Question: Which plot is which? What governs the steepness of the curve?
What governs the x value where the output is equal to 0.5?
But wait! Remember that the definition of a classifier from chapter 2 is that it’s a map-
ping from Rd → {−1, +1} or to some other discrete set. So, then, it seems like an LLC is
actually not a classifier!
Given an LLC, with an output value in (0, 1), what should we do if we are forced to
make a prediction in {+1, −1}? A default answer is to predict +1 if σ(θT x + θ0 ) > 0.5 and
−1 otherwise. The value 0.5 is sometimes called a prediction threshold.
In fact, for different problem settings, we might prefer to pick a different prediction
threshold. The field of decision theory considers how to make this choice from the perspec-
tive of Bayesian reasoning. For example, if the consequences of predicting +1 when the
answer should be −1 are much worse than the consequences of predicting −1 when the
answer should be +1, then we might set the prediction threshold to be greater than 0.5.
Study Question: Using a prediction threshold of 0.5, for what values of x do each of
the LLCs shown in the figure above predict +1?
When d = 2, then our inputs x lie in a two-dimensional space with axes x1 and x2 , and
the output of the LLC is a surface, as shown below, for θ = (1, 1), θ0 = 2.

Last Updated: 12/18/19 11:56:05

MIT 6.036 Fall 2019 31

Study Question: Convince yourself that the set of points for which σ(θT x + θ0 ) =
0.5, that is, the separator between positive and negative predictions with prediction
threshold 0.5 is a line in (x1 , x2 ) space. What particular line is it for the case in the
figure above? How would the plot change for θ = (1, 1), but now with θ0 = −2? For
θ = (−1, −1), θ0 = 2?

4 Loss function for logistic classifiers

We have defined a class, LLC, of hypotheses whose outputs are in (0, 1), but we have train-
ing data with y values in {+1, −1}. How can we define a loss function? Intuitively, we
would like to have low loss if we assign a low probability to the incorrect class. We’ll define a
loss function, called negative log-likelihood (NLL), that does just this. In addition, it has the
cool property that it extends nicely to the case where we would like to classify our inputs
into more than two classes.
In order to simplify the description, we will assume that (or transform so that) the labels
in the training data are y ∈ {0, 1}, enabling them to be interpreted as probabilities of being
a member of the class of interest. We would like to pick the parameters of our classifier to Remember to be sure
maximize the probability assigned by the LCC to the correct y values, as specified in the your y values have this
form if you try to learn
training set. Letting guess g(i) = σ(θT x(i) + θ0 ), that probability is
an LLC using NLL!|
n

Y g(i) if y(i) = 1
,
i=1
1−g (i)
otherwise

under the assumption that our predictions are independent. This can be cleverly rewritten,
when y(i) ∈ {0, 1}, as
Yn
y(i)
(1 − g(i) )1−y
(i)
g(i) .
i=1

Study Question: Be sure you can see why these two expressions are the same.
Now, because products are kind of hard to deal with, and because the log function is
monotonic, the θ, θ0 that maximize the log of this quantity will be the same as the θ, θ0 that
maximize the original, so we can try to maximize
n
X
y(i) log g(i) + (1 − y(i) ) log(1 − g(i) ) .
i=1

We can turn the maximization problem above into a minimization problem by taking the
negative of the above expression, and write in terms of minimizing a loss
n
X
Lnll (g(i) , y(i) )
i=1

where Lnll is the negative log-likelihood loss function:

Lnll (guess, actual) = − (actual · log(guess) + (1 − actual) · log(1 − guess)) .

This loss function is also sometimes referred to as the log loss or cross entropy. You can use any base
for the logarithm and
it won’t make any real
difference. If we ask
you for numbers, use
log base e.
Last Updated: 12/18/19 11:56:05
MIT 6.036 Fall 2019 32

5 Logistic classification as optimization

We can finally put all these pieces together and develop an objective function for optimiz-
ing regularized negative log-likelihood for a linear logistic classifier. In fact, this process is That’s a lot of fancy
usually called “logistic regression,” so we’ll call our objective Jlr , and define it as words!

n
!
1X
Jlr (θ, θ0 ; D) = Lnll (σ(θ x + θ0 ), y ) + λ kθk2 .
T (i) (i)
n
i=1

Study Question: Consider the case of linearly separable data. What will the θ values
that optimize this objective be like if λ = 0? What will they be like if λ is very big?
Try to work out an example in one dimension with two data points.

Last Updated: 12/18/19 11:56:05

Which Factors Affect Dental Esthetics and Smile Attractiveness in Orthodontically Treated Patients?
No ratings yet
Which Factors Affect Dental Esthetics and Smile Attractiveness in Orthodontically Treated Patients?
13 pages
Chapter 2 - Logistic Regression
No ratings yet
Chapter 2 - Logistic Regression
88 pages
Logistic Regression
No ratings yet
Logistic Regression
36 pages
Machine Learning PDF
No ratings yet
Machine Learning PDF
77 pages
Mathematics of Machine Learning MIT
No ratings yet
Mathematics of Machine Learning MIT
411 pages
Lecture3 Logistic Regression Classifier V0
No ratings yet
Lecture3 Logistic Regression Classifier V0
41 pages
Log Reg Skimed.ipynb - Colab
No ratings yet
Log Reg Skimed.ipynb - Colab
10 pages
Text Classification Using Logistics Regression
No ratings yet
Text Classification Using Logistics Regression
64 pages
EE353 - 769 08 Linear Classification
No ratings yet
EE353 - 769 08 Linear Classification
22 pages
lec43
No ratings yet
lec43
9 pages
04- Linear-Classification-2024
No ratings yet
04- Linear-Classification-2024
65 pages
lec20
No ratings yet
lec20
16 pages
Logistic Regression(Probability Concepts) and Perceptron
No ratings yet
Logistic Regression(Probability Concepts) and Perceptron
20 pages
2021 Logistic Regression
No ratings yet
2021 Logistic Regression
33 pages
Ilovepdf Merged (24)
No ratings yet
Ilovepdf Merged (24)
208 pages
Slide 2
No ratings yet
Slide 2
30 pages
ML Classification Trupesh Patel
No ratings yet
ML Classification Trupesh Patel
39 pages
Lect4 Log Reg
No ratings yet
Lect4 Log Reg
20 pages
Sample Research Paper
No ratings yet
Sample Research Paper
26 pages
Credit Risk Analysis - Project Report
No ratings yet
Credit Risk Analysis - Project Report
104 pages
Ch2Regression and Regularization1
No ratings yet
Ch2Regression and Regularization1
45 pages
AC-ED L04 - Logistic Regression, Regularization
No ratings yet
AC-ED L04 - Logistic Regression, Regularization
80 pages
MIT18 657F15 LecNote PDF
No ratings yet
MIT18 657F15 LecNote PDF
194 pages
Notes6_Classification
No ratings yet
Notes6_Classification
10 pages
Chapter Regression
No ratings yet
Chapter Regression
10 pages
Chapter Classification
No ratings yet
Chapter Classification
12 pages
Lecture 2
No ratings yet
Lecture 2
6 pages
Key Concepts in Probabilistic Learning and SVMs
No ratings yet
Key Concepts in Probabilistic Learning and SVMs
15 pages
Algorithms Notes
No ratings yet
Algorithms Notes
66 pages
01B-DL2023-LinearModels
No ratings yet
01B-DL2023-LinearModels
47 pages
AML AfterMid Merged
No ratings yet
AML AfterMid Merged
389 pages
Chapter 2 - Linear Classifiers
No ratings yet
Chapter 2 - Linear Classifiers
4 pages
CS60010: Deep Learning: Spring 2021
No ratings yet
CS60010: Deep Learning: Spring 2021
32 pages
Binary Classification and Logistic Regression
No ratings yet
Binary Classification and Logistic Regression
7 pages
06LogisticRegression
No ratings yet
06LogisticRegression
55 pages
Homework2 v1.0
No ratings yet
Homework2 v1.0
5 pages
Logistic Regression
No ratings yet
Logistic Regression
8 pages
Unit 3-Discriminative Models
No ratings yet
Unit 3-Discriminative Models
29 pages
A Layman's Guide to the Project
No ratings yet
A Layman's Guide to the Project
34 pages
M02Logistic Regression Logistic RegressioLogistic Regressionn
No ratings yet
M02Logistic Regression Logistic RegressioLogistic Regressionn
19 pages
Week 4 Logistic
No ratings yet
Week 4 Logistic
21 pages
output_23
No ratings yet
output_23
6 pages
Logistic Regression
No ratings yet
Logistic Regression
37 pages
Logistic Regression
No ratings yet
Logistic Regression
24 pages
Lecture Notes 6 Logistic Regression
No ratings yet
Lecture Notes 6 Logistic Regression
8 pages
Lec4 PDF
No ratings yet
Lec4 PDF
7 pages
Notes Chapter Linear Classifiers
No ratings yet
Notes Chapter Linear Classifiers
4 pages
Geofencing in Location-Based Behavioral Research Methodology, Challenges, and Implementation
No ratings yet
Geofencing in Location-Based Behavioral Research Methodology, Challenges, and Implementation
29 pages
Chap 1
No ratings yet
Chap 1
77 pages
W8 - Logistic Regression
No ratings yet
W8 - Logistic Regression
18 pages
lawal-et-al-2013-effect-of-livelihood-assets-on-poverty-status-of-farming-households-apos-in-south-western-nigeria
100% (1)
lawal-et-al-2013-effect-of-livelihood-assets-on-poverty-status-of-farming-households-apos-in-south-western-nigeria
3 pages
LAB5_ML_EAC22050
No ratings yet
LAB5_ML_EAC22050
11 pages
2
No ratings yet
2
7 pages
Introduction To Machine Learning: Dr. Muhammad Amjad Iqbal
No ratings yet
Introduction To Machine Learning: Dr. Muhammad Amjad Iqbal
20 pages
06 Lectureslides LinearClassification Fixed
No ratings yet
06 Lectureslides LinearClassification Fixed
52 pages
Machine Learning 2
No ratings yet
Machine Learning 2
19 pages
04 Probability and Learning PDF
No ratings yet
04 Probability and Learning PDF
34 pages
Logistic SPSS
100% (1)
Logistic SPSS
29 pages
Identifying The Barriers and Facilitators To Optimal Hearing Aid Self-Efficacy
No ratings yet
Identifying The Barriers and Facilitators To Optimal Hearing Aid Self-Efficacy
11 pages
Tuo Zhao Notes
No ratings yet
Tuo Zhao Notes
47 pages
36-45 Jean D Amour
No ratings yet
36-45 Jean D Amour
10 pages
Logistic Regression: Gunjan Bharadwaj Assistant Professor Dept of CEA
100% (1)
Logistic Regression: Gunjan Bharadwaj Assistant Professor Dept of CEA
42 pages
6.036: Intro To Machine Learning: Lecturer: Professor Leslie Kaelbling Notes By: Andrew Lin Fall 2019
No ratings yet
6.036: Intro To Machine Learning: Lecturer: Professor Leslie Kaelbling Notes By: Andrew Lin Fall 2019
50 pages
Week 3 Lecture Notes
No ratings yet
Week 3 Lecture Notes
7 pages
How To Improve Public Transport Usage in A Medium-Sized City: Key Factors For A Successful Bus System
No ratings yet
How To Improve Public Transport Usage in A Medium-Sized City: Key Factors For A Successful Bus System
13 pages
CS229 Supplemental Lecture Notes: 1 Binary Classification
No ratings yet
CS229 Supplemental Lecture Notes: 1 Binary Classification
7 pages
CS229 Lecture 3 PDF
100% (1)
CS229 Lecture 3 PDF
35 pages
Malware Detection Using ML
No ratings yet
Malware Detection Using ML
20 pages
06 Logistic Regression PDF
No ratings yet
06 Logistic Regression PDF
10 pages
(20635303 - Journal of Behavioral Addictions) Smartphone Use and Smartphone Addiction Among Young People in Switzerland
No ratings yet
(20635303 - Journal of Behavioral Addictions) Smartphone Use and Smartphone Addiction Among Young People in Switzerland
16 pages
Álvarez-García Et Al. (2015)
No ratings yet
Álvarez-García Et Al. (2015)
10 pages
Advanced Statistical Methods Using R
No ratings yet
Advanced Statistical Methods Using R
32 pages
Telling Stories With Soundtracks: An Empirical Analysis of Music in Film
No ratings yet
Telling Stories With Soundtracks: An Empirical Analysis of Music in Film
11 pages
Fedis Woreda Titay Zelek
No ratings yet
Fedis Woreda Titay Zelek
7 pages
Machine Learning: An Applied Econometric Approach: Sendhil Mullainathan and Jann Spiess
No ratings yet
Machine Learning: An Applied Econometric Approach: Sendhil Mullainathan and Jann Spiess
38 pages
SPSS Without Pain
No ratings yet
SPSS Without Pain
177 pages
Lecture 6
No ratings yet
Lecture 6
19 pages
VGLM Cbind Family Data : G G G G
No ratings yet
VGLM Cbind Family Data : G G G G
4 pages
Baumgaertner. The Value of The Tip-Apex Distance PDF
No ratings yet
Baumgaertner. The Value of The Tip-Apex Distance PDF
8 pages
An In-Game Win Probability Model For Football
No ratings yet
An In-Game Win Probability Model For Football
13 pages
Curriculum-PGP in Big Data Analytics and Optimization
No ratings yet
Curriculum-PGP in Big Data Analytics and Optimization
16 pages
A Risk Prediction Score For Kidney Failure or Mortality in Rhabdomyolysis
No ratings yet
A Risk Prediction Score For Kidney Failure or Mortality in Rhabdomyolysis
8 pages
Nlogit An R Package Presentation
No ratings yet
Nlogit An R Package Presentation
40 pages
Motherhood and Marriagej.1741-3737.2010.00786.x
No ratings yet
Motherhood and Marriagej.1741-3737.2010.00786.x
4 pages
Horn Clause: Fundamentals and Applications
From Everand
Horn Clause: Fundamentals and Applications
Fouad Sabry
No ratings yet
Hypothesis Testing'Flow Chart
No ratings yet
Hypothesis Testing'Flow Chart
1 page
Self-Study Plan For Becoming A Quantitative Trader - Part II
No ratings yet
Self-Study Plan For Becoming A Quantitative Trader - Part II
4 pages
Statistical Learning Theory
No ratings yet
Statistical Learning Theory
4 pages
Top Numerical Methods With Matlab For Beginners!
From Everand
Top Numerical Methods With Matlab For Beginners!
Andrei Besedin
No ratings yet

Notes Chapter Logistic Regression

Uploaded by

Notes Chapter Logistic Regression

Uploaded by

CHAPTER 5

1 Machine learning as optimization

A very common form for an ML objective is

A common strategy for specifying a regularizer is to use the form

Last Updated: 12/18/19 11:56:05

3 A new hypothesis class: linear logistic classifiers

Last Updated: 12/18/19 11:56:05

Last Updated: 12/18/19 11:56:05

4 Loss function for logistic classifiers

where Lnll is the negative log-likelihood loss function:

Lnll (guess, actual) = − (actual · log(guess) + (1 − actual) · log(1 − guess)) .

5 Logistic classification as optimization

Last Updated: 12/18/19 11:56:05

You might also like