0% found this document useful (0 votes)

113 views7 pages

CS229 Supplemental Lecture Notes: 1 Binary Classification

This summary provides an overview of the key points from the CS229 supplemental lecture notes on binary classification and logistic regression: 1. The document introduces binary classification problems where the target y can take on one of two values (often -1 or 1) and describes using a hypothesis of hθ(x) = θTx to make predictions based on the sign of θTx. 2. Several common loss functions for binary classification are described, including the logistic, hinge, and exponential losses, which aim to minimize loss when the margin yθTx is positive and maximize loss when the margin is negative. 3. Logistic regression is introduced as minimizing the average logistic loss, and it can be given a probabilistic

Uploaded by

GauravJain

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

113 views7 pages

CS229 Supplemental Lecture Notes: 1 Binary Classification

Uploaded by

GauravJain

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 7

CS229 Supplemental Lecture notes

John Duchi

1 Binary classification
In binary classification problems, the target y can take on at only two
values. In this set of notes, we show how to model this problem by letting
y {1, +1}, where we say that y is a 1 if the example is a member of the
positive class and y = 1 if the example is a member of the negative class.
We assume, as usual, that we have input features x Rn .
As in our standard approach to supervised learning problems, we first
pick a representation for our hypothesis class (what we are trying to learn),
and after that we pick a loss function that we will minimize. In binary
classification problems, it is often convenient to use a hypothesis class of the
form h (x) = T x, and, when presented with a new example x, we classify it
as positive or negative depending on the sign of T x, that is, our predicted
label is

1
if t > 0
T
sign(h (x)) = sign( x) where sign(t) = 0 if t = 0

1 if t < 0.

In a binary classification problem, then, the hypothesis h with parameter

vector classifies a particular example (x, y) correctly if
sign(T x) = y or equivalently yT x > 0. (1)
The quantity yT x in expression (1) is a very important quantity in binary
classification, important enough that we call the value
yxT
the margin for the example (x, y). Often, though not always, one interprets
the value h (x) = xT as a measure of the confidence that the parameter

1
vector assigns to labels for the point x: if xT is very negative (or very
positive), then we more strongly believe the label y is negative (or positive).
Now that we have chosen a representation for our data, we must choose a
loss function. Intuitively, we would like to choose some loss function so that
for our training data {(x(i) , y (i) )}m (i) T (i)
i=1 , the chosen makes the margin y x
very large for each training example. Let us fix a hypothetical example (x, y),
let z = yxT denote the margin, and let : R R be the loss functionthat
is, the loss for the example (x, y) with margin z = yxT is (z) = (yxT ).
For any particular loss function, the empirical risk that we minimize is then
m
1 X
J() = (y (i) T x(i) ). (2)
m i=1

Consider our desired behavior: we wish to have y (i) T x(i) positive for each
training example i = 1, . . . , m, and we should penalize those for which
y (i) T x(i) < 0 frequently in the training data. Thus, an intuitive choice for
our loss would be one with (z) small if z > 0 (the margin is positive), while
(z) is large if z < 0 (the margin is negative). Perhaps the most natural
such loss is the zero-one loss, given by
(
1 if z 0
zo (z) =
0 if z > 0.

In this case, the risk J() is simply the average number of mistakesmisclassifications
the parameter makes on the training data. Unfortunately, the loss zo is
discontinuous, non-convex (why this matters is a bit beyond the scope of
the course), and perhaps even more vexingly, NP-hard to minimize. So we
prefer to choose losses that have the shape given in Figure 1. That is, we
will essentially always use losses that satisfy

(z) 0 as z , while (z) as z .

As a few different examples, here are three loss functions that we will see
either now or later in the class, all of which are commonly used in machine
learning.

(i) The logistic loss uses

logistic (z) = log(1 + ez )

z = yxT

Figure 1: The rough shape of loss we desire: the loss is convex and continuous,
and tends to zero as the margin z = yxT .

(ii) The hinge loss uses

hinge (z) = [1 z]+ = max{1 z, 0}

(iii) The exponential loss uses

exp (z) = ez .

In Figure 2, we plot each of these losses against the margin z = yxT ,

noting that each goes to zero as the margin grows, and each tends to + as
the margin becomes negative. The different loss functions lead to different
machine learning procedures; in particular, the logistic loss logistic is logistic
regression, the hinge loss hinge gives rise to so-called support vector machines,
and the exponential loss gives rise to the classical version of boosting, both
of which we will explore in more depth later in the class.

2 Logistic regression
With this general background in place, we now we give a complementary
view of logistic regression to that in Andrew Ngs lecture notes. When we

3
logistic
hinge
exp

z = yxT

Figure 2: The three margin-based loss functions logistic loss, hinge loss, and
exponential loss.

use binary labels y {1, 1}, it is possible to write logistic regression more
compactly. In particular, we use the logistic loss

logistic (yxT ) = log 1 + exp(yxT ) ,

and the logistic regression algorithm corresponds to choosing that mini-

mizes
m m
1 X 1 X
logistic (y (i) T x(i) ) = log 1 + exp(y (i) T x(i) ) .

J() = (3)
m i=1 m i=1

Roughly, we hope that choosing to minimize the average logistic loss will
yield a for which y (i) T x(i) > 0 for most (or even all!) of the training
examples.

2.1 Probabilistic intrepretation

Similar to the linear regression (least-squares) case, it is possible to give a
probabilistic interpretation of logistic regression. To do this, we define the

4
1

0
-5 5

Figure 3: Sigmoid function

sigmoid function (also often called the logistic function)

1
g(z) = ,
1 + ez
which is plotted in Fig. 3. In particular, the sigmoid function satisfies
1 1 ez 1
g(z) + g(z) = + z
= z
+ = 1,
1+e z 1+e 1+e 1 + ez
so we can use it to define a probability model for binary classification. In
particular, for y {1, 1}, we define the logistic model for classification as
1
p(Y = y | x; ) = g(yxT ) = . (4)
1 + eyxT
For intepretation, we see that if the margin yxT is largebigger than, say,
5 or sothen p(Y = y | x; ) = g(yxT ) 1, that is, we assign nearly
probability 1 to the event that the label is y. Conversely, if yxT is quite
negative, then p(Y = y | x; ) 0.
By redefining our hypothesis class as
1
h (x) = g(T x) = ,
1 + eT x

5
then we see that the likelihood of the training data is
m
Y m
Y
(i) (i)
L() = p(Y = y | x ; ) = h (y (i) x(i) ),
i=1 i=1

and the log-likelihood is precisely

m m
(i) T (i)
X X
(i) (i)
() = log h (y x ) = log 1 + ey x = mJ(),
i=1 i=1

where J() is exactly the logistic regression risk from Eq. (3). That is,
maximum likelihood in the logistic model (4) is the same as minimizing the
average logistic loss, and we arrive at logistic regression again.

2.2 Gradient descent methods

The final part of logistic regression is to actually fit the model. As is usually
the case, we consider gradient-descent-based procedures for performing this
minimization. With that in mind, we now show how to take derivatives of
the logistic loss. For logistic (z) = log(1 + ez ), we have the one-dimensional
derivative
d 1 d z ez 1
logistic (z) = logistic (z) = e = = = g(z),
dz 1 + e dz
z 1+e z 1 + ez
where g is the sigmoid function. Then we apply the chain rule to find that
for a single training example (x, y), we have

logistic (yxT ) = g(yxT ) (yxT ) = g(yxT )yxk .
k k
Thus, a stochastic gradient procedure for minimization of J() iteratively
performs the following for iterations t = 1, 2, . . ., where t is a stepsize at
time t:
1. Choose an example i {1, . . . , m} uniformly at random
2. Perform the gradient update
T
(t+1) = (t) t logistic (y (i) x(i) (t) )
T
= (t) + t g(y (i) x(i) (t) )y (i) x(i) = (t) + t h(t) (y (i) x(i) )y (i) x(i) .

6
This update is intuitive: if our current hypothesis h(t) assigns probability
close to 1 for the incorrect label y (i) , then we try to reduce the loss by
moving in the direction of y (i) x(i) . Conversely, if our current hypothesis
h(t) assigns probability close to 0 for the incorrect label y (i) , the update
essentially does nothing.

Full Frontal Calculus An Infinitesimal Approach 2nd Edition 2nd Edition Seth Braver PDF Download
No ratings yet
Full Frontal Calculus An Infinitesimal Approach 2nd Edition 2nd Edition Seth Braver PDF Download
84 pages
Lecture 0.3 - Linear Classifiers, Logistic Regression, Multiclass Classification
No ratings yet
Lecture 0.3 - Linear Classifiers, Logistic Regression, Multiclass Classification
48 pages
1 s2.0 S1053811920307382 Main
No ratings yet
1 s2.0 S1053811920307382 Main
15 pages
Statistical Learning Theory
No ratings yet
Statistical Learning Theory
100 pages
Lecture 5 - Logistic Regression
No ratings yet
Lecture 5 - Logistic Regression
28 pages
cs188 Fa22 Note21
No ratings yet
cs188 Fa22 Note21
4 pages
Logistic Regression
No ratings yet
Logistic Regression
36 pages
L3 Cse256 Fa24 FFN
No ratings yet
L3 Cse256 Fa24 FFN
64 pages
Ch03 LogisticRegression
No ratings yet
Ch03 LogisticRegression
79 pages
4.1.1. Arenes: F324: Rings, Polymers and Analysis
No ratings yet
4.1.1. Arenes: F324: Rings, Polymers and Analysis
17 pages
BIO 11 Lect - Botany Part 2 Assignment
No ratings yet
BIO 11 Lect - Botany Part 2 Assignment
2 pages
Science Fair 2012
No ratings yet
Science Fair 2012
1 page
3-LG Eval
No ratings yet
3-LG Eval
52 pages
Logisticregression 2021
No ratings yet
Logisticregression 2021
78 pages
Lec 3
No ratings yet
Lec 3
22 pages
LogisticRegression ExercisesSolutions
No ratings yet
LogisticRegression ExercisesSolutions
5 pages
Week 7
No ratings yet
Week 7
53 pages
Log Reg Skimed - Ipynb - Colab
No ratings yet
Log Reg Skimed - Ipynb - Colab
10 pages
Week 2 Introduction To Linear Models - Revised - v1
No ratings yet
Week 2 Introduction To Linear Models - Revised - v1
54 pages
Text Classification Using Logistics Regression
No ratings yet
Text Classification Using Logistics Regression
64 pages
CH 4
No ratings yet
CH 4
41 pages
W2 Ann
No ratings yet
W2 Ann
12 pages
Homework2 v1.0
No ratings yet
Homework2 v1.0
5 pages
A Layman's Guide To The Project
No ratings yet
A Layman's Guide To The Project
34 pages
Algorithms Notes
No ratings yet
Algorithms Notes
66 pages
04 - Linear-Classification-2024
No ratings yet
04 - Linear-Classification-2024
65 pages
Week 3 Lecture Notes
No ratings yet
Week 3 Lecture Notes
7 pages
Logistic Regression
No ratings yet
Logistic Regression
9 pages
56-2-2 - QP Chemistry
No ratings yet
56-2-2 - QP Chemistry
27 pages
Lect4 Log Reg
No ratings yet
Lect4 Log Reg
20 pages
01B DL2023 LinearModels
No ratings yet
01B DL2023 LinearModels
47 pages
Framework Grade3 Assessment
No ratings yet
Framework Grade3 Assessment
29 pages
Logistic Regression
No ratings yet
Logistic Regression
9 pages
Deep Learning Week 204-4
No ratings yet
Deep Learning Week 204-4
1 page
CS60010: Deep Learning: Spring 2021
No ratings yet
CS60010: Deep Learning: Spring 2021
32 pages
Binary Classification and Logistic Regression
No ratings yet
Binary Classification and Logistic Regression
7 pages
7 Logistic-Regression
No ratings yet
7 Logistic-Regression
63 pages
Triangle Construction Instructional
No ratings yet
Triangle Construction Instructional
4 pages
Logistic Regression (Probability Concepts) and Perceptron
No ratings yet
Logistic Regression (Probability Concepts) and Perceptron
20 pages
4 Linear Regression Additional Notes
No ratings yet
4 Linear Regression Additional Notes
8 pages
Lecture 1, Part 3: Training A Classifier: Roger Grosse
No ratings yet
Lecture 1, Part 3: Training A Classifier: Roger Grosse
11 pages
Mathematical Foundations of Computational Linguistics: Manfred Klenner and Jannis Vamvas
No ratings yet
Mathematical Foundations of Computational Linguistics: Manfred Klenner and Jannis Vamvas
32 pages
Study and Design of Electro Magnetic Lev
No ratings yet
Study and Design of Electro Magnetic Lev
5 pages
M02Logistic Regression Logistic RegressioLogistic Regressionn
No ratings yet
M02Logistic Regression Logistic RegressioLogistic Regressionn
19 pages
AC-ED L04 - Logistic Regression, Regularization
No ratings yet
AC-ED L04 - Logistic Regression, Regularization
80 pages
09 23ECE216 LogisticRegression
No ratings yet
09 23ECE216 LogisticRegression
40 pages
J. Clyde Mitchell
No ratings yet
J. Clyde Mitchell
12 pages
cs188 Fa23 Note22
No ratings yet
cs188 Fa23 Note22
3 pages
Logistic Regression
No ratings yet
Logistic Regression
37 pages
Lec4 PDF
No ratings yet
Lec4 PDF
7 pages
Cost Index
No ratings yet
Cost Index
36 pages
Diameter Training Plan
No ratings yet
Diameter Training Plan
10 pages
Op Tim Ization
No ratings yet
Op Tim Ization
19 pages
Logistic Regression
No ratings yet
Logistic Regression
8 pages
Introduction To Machine Learning: 2 Linear Classifiers
No ratings yet
Introduction To Machine Learning: 2 Linear Classifiers
4 pages
Lecture Notes 6 Logistic Regression
No ratings yet
Lecture Notes 6 Logistic Regression
8 pages
CS229 Supplemental Lecture Notes Hoeffding's Inequality: 1 Basic Probability Bounds
No ratings yet
CS229 Supplemental Lecture Notes Hoeffding's Inequality: 1 Basic Probability Bounds
8 pages
Week 4 Logistic
No ratings yet
Week 4 Logistic
21 pages
Bab 10 - Animasi TB Penukargantian Metana - English
No ratings yet
Bab 10 - Animasi TB Penukargantian Metana - English
36 pages
Samsung Test Schedule
No ratings yet
Samsung Test Schedule
9 pages
RCT Da Costa
No ratings yet
RCT Da Costa
6 pages
Book Rev9
No ratings yet
Book Rev9
104 pages
Invoice+Packing 0329 (Revised)
No ratings yet
Invoice+Packing 0329 (Revised)
2 pages
Machine Learning - Logistic Regression
No ratings yet
Machine Learning - Logistic Regression
16 pages
Unit II
100% (1)
Unit II
13 pages
Compiler
No ratings yet
Compiler
4 pages
Logistic Regression
No ratings yet
Logistic Regression
6 pages
Generalized Linear Model
No ratings yet
Generalized Linear Model
67 pages
ML Linear Model
No ratings yet
ML Linear Model
10 pages
ML DSBA Lab2
No ratings yet
ML DSBA Lab2
4 pages
Toad Edge Comparison To Toad - Mac Edition Freeware
No ratings yet
Toad Edge Comparison To Toad - Mac Edition Freeware
2 pages
Nanofibrillation Twin-Screw Extrusion
No ratings yet
Nanofibrillation Twin-Screw Extrusion
13 pages
BABTWR
No ratings yet
BABTWR
2 pages
CS229 Lecture 3 PDF
100% (1)
CS229 Lecture 3 PDF
35 pages
Logistic Regression: Gunjan Bharadwaj Assistant Professor Dept of CEA
100% (1)
Logistic Regression: Gunjan Bharadwaj Assistant Professor Dept of CEA
42 pages
Representer Function
No ratings yet
Representer Function
12 pages
Physics Radiation Protection
94% (18)
Physics Radiation Protection
844 pages
Notes Chapter Logistic Regression
No ratings yet
Notes Chapter Logistic Regression
6 pages
xv6 Rev9
No ratings yet
xv6 Rev9
100 pages
May Be Useful When Constructing A Circuit As It Has Only One Not Operation and A Small Number of and & or Operations
No ratings yet
May Be Useful When Constructing A Circuit As It Has Only One Not Operation and A Small Number of and & or Operations
1 page
Ncert ch2 Chemistry Class 11
No ratings yet
Ncert ch2 Chemistry Class 11
44 pages
Cheatsheet Supervised Learning
No ratings yet
Cheatsheet Supervised Learning
4 pages
Kumpulan Coding Program Arduino
100% (2)
Kumpulan Coding Program Arduino
5 pages
You Do Not Really Understand Something Unless You Can Explain It To Your Grandmother
No ratings yet
You Do Not Really Understand Something Unless You Can Explain It To Your Grandmother
1 page
CASE IH 1660 Combine
33% (3)
CASE IH 1660 Combine
14 pages
Cheatsheet Supervised Learning
100% (1)
Cheatsheet Supervised Learning
4 pages
Projection - Parallel and Perspective
No ratings yet
Projection - Parallel and Perspective
26 pages
A Tutorial of Machine Learning
No ratings yet
A Tutorial of Machine Learning
16 pages
Ethernet Introduction PDF
100% (1)
Ethernet Introduction PDF
2 pages
Chaturmas Suchi 2016
No ratings yet
Chaturmas Suchi 2016
25 pages
Fill Mind With Programming
No ratings yet
Fill Mind With Programming
2 pages
1 Common Nouns Are Used To Name A GENERAL Type of Person, Place or Thing
No ratings yet
1 Common Nouns Are Used To Name A GENERAL Type of Person, Place or Thing
2 pages
Continuous Concrete Beam Design To Bs 81101997 Table 3.5
No ratings yet
Continuous Concrete Beam Design To Bs 81101997 Table 3.5
8 pages
Neil - Bernardo@eee - Upd.edu - PH Bernalyn - Decena@eee - Upd.edu - PH Ephraim - Lizardo@eee - Upd.edu - PH
No ratings yet
Neil - Bernardo@eee - Upd.edu - PH Bernalyn - Decena@eee - Upd.edu - PH Ephraim - Lizardo@eee - Upd.edu - PH
3 pages
Machine Learning Mastery Notes
No ratings yet
Machine Learning Mastery Notes
4 pages
Lecture 6
No ratings yet
Lecture 6
19 pages
Determining The Chemical Formula of A Hydrate
No ratings yet
Determining The Chemical Formula of A Hydrate
2 pages
Chef and Subsequences - CodeChef
No ratings yet
Chef and Subsequences - CodeChef
3 pages
Dental Ceramics A Current Review-1
100% (1)
Dental Ceramics A Current Review-1
7 pages
Palaca Caprini Zlatni Rez
No ratings yet
Palaca Caprini Zlatni Rez
11 pages
Problems On Drying
100% (1)
Problems On Drying
1 page
B+V Industrietechnik GMBH: A Thyssenkrupp Technolgies Company
No ratings yet
B+V Industrietechnik GMBH: A Thyssenkrupp Technolgies Company
12 pages
Steel Grades 2 PDF
0% (1)
Steel Grades 2 PDF
2 pages
Project All Numerates-Grade 6
100% (1)
Project All Numerates-Grade 6
3 pages
Differential Forms
From Everand
Differential Forms
Henri Cartan
5/5 (2)
Worked Examples in Mathematics for Scientists and Engineers
From Everand
Worked Examples in Mathematics for Scientists and Engineers
G. Stephenson
No ratings yet

CS229 Supplemental Lecture Notes: 1 Binary Classification

Uploaded by

CS229 Supplemental Lecture Notes: 1 Binary Classification

Uploaded by

CS229 Supplemental Lecture notes

In a binary classification problem, then, the hypothesis h with parameter

(z) 0 as z , while (z) as z .

(i) The logistic loss uses

logistic (z) = log(1 + ez )

(ii) The hinge loss uses

hinge (z) = [1 z]+ = max{1 z, 0}

(iii) The exponential loss uses

In Figure 2, we plot each of these losses against the margin z = yxT ,

logistic (yxT ) = log 1 + exp(yxT ) ,

and the logistic regression algorithm corresponds to choosing that mini-

2.1 Probabilistic intrepretation

Figure 3: Sigmoid function

sigmoid function (also often called the logistic function)

and the log-likelihood is precisely

2.2 Gradient descent methods

You might also like