0% found this document useful (0 votes)
71 views4 pages

Introduction To Machine Learning: 2 Linear Classifiers

1. The document introduces linear classifiers for classification problems, focusing on logistic regression. 2. Linear classifiers separate data into categories using hyperplanes defined by a weight vector w. The decision rule assigns categories based on which side of the hyperplane a data point falls. 3. Logistic regression learns the weights w using gradient descent or Newton's method to minimize a loss function measuring errors in classification. It uses a logistic loss function instead of the 0-1 loss for easier optimization.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
71 views4 pages

Introduction To Machine Learning: 2 Linear Classifiers

1. The document introduces linear classifiers for classification problems, focusing on logistic regression. 2. Linear classifiers separate data into categories using hyperplanes defined by a weight vector w. The decision rule assigns categories based on which side of the hyperplane a data point falls. 3. Logistic regression learns the weights w using gradient descent or Newton's method to minimize a loss function measuring errors in classification. It uses a logistic loss function instead of the 0-1 loss for easier optimization.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 4

2 Linear Classifiers

Linear Classifiers
w0
Introduction to Machine Learning 1

Logistic Regression x1 w1
P
x2 w2 {-1,+1}
Varun Chandola Activation function w0 +
P d
.. .. j=1 wj xj ≥ 0

February 13, 2019 . .


xd wd

Outline inputs weights

Decision Rule 
Contents yi =
−1 if w0 + w> xi < 0
+1 if w0 + w> xi ≥ 0
1 Classification 1

2 Linear Classifiers 2 Geometric Interpretation


2.1 Linear Classification via Hyperplanes . . . . . . . . . . . . . . 2 x2 +1
w> x = −w0 −1
3 Logistic Regression 5
3.1 Using Gradient Descent for Learning Weights . . . . . . . . . 7
3.2 Using Newton’s Method . . . . . . . . . . . . . . . . . . . . . 7

w
ŵ =
1 Classification |w|

Supervised Learning - Classification


w0
− |w|
• Target y is categorical x1

• e.g., y ∈ {−1, +1} (binary classification)

• A possible problem formulation: Learn f such that y = f (x) 2.1 Linear Classification via Hyperplanes
• Separates a D-dimensional space into two half-spaces

• Defined by w ∈ <D

2
y

0
>
w

0
w

0
+

=
·x

0
w

0
w

<
·x

0
w
w

+
·x
w

w
x

w0 k
kw
– Orthogonal to the hyperplane
– This w goes through the origin
– How do you check if a point lies “above” or “below” w?
xxx
– What happens for points on w? xxxx x
For a hyperplane that passes through the origin, a point x will lie above the
hyperplane if w> x > 0 and will lie below the plane if w> x < 0, otherwise.
This can be further understood by understanding that bf w> x is essentially • w> x + w0 < 0 ⇒ y = −1
equal to |w||x| cos θ, where θ is the angle between w and x.
• Find a hyperplane that separates the data
• Add a bias w0
– . . . if the data is linearly separable
– w0 > 0 - move along w
– w0 < 0 - move opposite to w • But there can be many choices!

• How to check if point lies above or below w? • Find the one with lowest error
>
– If w x + w0 > 0 then x is above
– Else, below Learning w

• Decision boundary represented by the hyperplane w • What is an appropriate loss function?

• For binary classification, w points towards the positive class 0-1 Loss

Decision Rule • Number of mistakes in training data


y = sign(w> x + w0 ) n
X
J(w) = min I(yi (w> xi + w0 ) < 0)
• w> x + w0 ≥ 0 ⇒ y = +1 w,w0
i=1

3 4
• Hard to optimize Logistic Loss Function

• Solution - replace it with a mathematically manageable loss • For one training observation,
– if yi = +1, the probability of the predicted value to be +1
Different Loss Functions 1
pi =
Note 1 + exp (−w> xi )
From now on, assuming that intercept and constant terms are included in w
– if yi = −1, the probability of the predicted value to be -1
and xi , respectively.
1 1
pi = 1 − =
• Squared Loss - Perceptron 1 + exp (−w> xi ) 1 + exp (w> xi )
N – In general
1X 1
J(w) = (yi − w> xi )2 (1) pi =
2 i=1 1 + exp (−yi w> xi )
• For logistic regression, the objective is to minimize the negative of the
• Logistic Loss - Logistic Regression
log probability:
n
1X Xn n
X
J(w) = log (1 + exp (−yi w> xi )) (2) J(w) = − log (pi ) = log (1 + exp (−yi w> xi ))
n i=1
i=1 i=1

• Hinge Loss - Support Vector Machine Learning Logistic Regression Model


n
X • Direct minimization??
J(w) = max (0, 1 − yi w> xi ) (3) – No closed form solution for minimizing error
i=1
• Gradient Descent
3 Logistic Regression • Newton’s Method

Geometric Interpretation To understand why there is no closed form solution for maximizing the log-
likelihood, we first differentiate J(w) with respect to w.
• Use regression to predict discrete values ∇J(w) =
Xn
• Squash output to [0, 1] using sigmoid function d
J(w) = log(1 + exp (−yi w> xi ))
dw
• Output less than 0.5 is one class and greater than 0.5 is the other i=1
n
1X yi
= − xi
Probabilistic Interpretation n i=1 1 + exp (yi w> xi )

• Probability of x to belong to class +1 Obviously, given that ∇J(w) is a non-linear function of w, a closed form
solution is not possible.

5 6
3.1 Using Gradient Descent for Learning Weights
• Compute gradient of J(w) with respect to w
• A convex function of w with a unique global minima
n
1X yi
∇J(w) = − xi
n i=1 1 + exp (yi w> xi )
• Update rule:
d
wk+1 = wk − η LL(wk )
dwk

−0.5

10
5 10
4 6 8
0 0 2

3.2 Using Newton’s Method


• Setting η is sometimes tricky
• Too large – incorrect results
• Too small – slow convergence
• Another way to speed up convergence:

Newton’s Method
wk+1 = wk − ηH−1
k ∇J(wk )

Hessian n
1X exp (yi w> xi )
H(w) = x i x>
i
n i=1 (1 + exp (yi w> xi ))2

References

You might also like