Introduction To Machine Learning: 2 Linear Classifiers
Introduction To Machine Learning: 2 Linear Classifiers
Linear Classifiers
w0
Introduction to Machine Learning 1
Logistic Regression x1 w1
P
x2 w2 {-1,+1}
Varun Chandola Activation function w0 +
P d
.. .. j=1 wj xj ≥ 0
Decision Rule
Contents yi =
−1 if w0 + w> xi < 0
+1 if w0 + w> xi ≥ 0
1 Classification 1
w
ŵ =
1 Classification |w|
• A possible problem formulation: Learn f such that y = f (x) 2.1 Linear Classification via Hyperplanes
• Separates a D-dimensional space into two half-spaces
• Defined by w ∈ <D
2
y
0
>
w
0
w
0
+
=
·x
0
w
0
w
<
·x
0
w
w
+
·x
w
w
x
w0 k
kw
– Orthogonal to the hyperplane
– This w goes through the origin
– How do you check if a point lies “above” or “below” w?
xxx
– What happens for points on w? xxxx x
For a hyperplane that passes through the origin, a point x will lie above the
hyperplane if w> x > 0 and will lie below the plane if w> x < 0, otherwise.
This can be further understood by understanding that bf w> x is essentially • w> x + w0 < 0 ⇒ y = −1
equal to |w||x| cos θ, where θ is the angle between w and x.
• Find a hyperplane that separates the data
• Add a bias w0
– . . . if the data is linearly separable
– w0 > 0 - move along w
– w0 < 0 - move opposite to w • But there can be many choices!
• How to check if point lies above or below w? • Find the one with lowest error
>
– If w x + w0 > 0 then x is above
– Else, below Learning w
• For binary classification, w points towards the positive class 0-1 Loss
3 4
• Hard to optimize Logistic Loss Function
• Solution - replace it with a mathematically manageable loss • For one training observation,
– if yi = +1, the probability of the predicted value to be +1
Different Loss Functions 1
pi =
Note 1 + exp (−w> xi )
From now on, assuming that intercept and constant terms are included in w
– if yi = −1, the probability of the predicted value to be -1
and xi , respectively.
1 1
pi = 1 − =
• Squared Loss - Perceptron 1 + exp (−w> xi ) 1 + exp (w> xi )
N – In general
1X 1
J(w) = (yi − w> xi )2 (1) pi =
2 i=1 1 + exp (−yi w> xi )
• For logistic regression, the objective is to minimize the negative of the
• Logistic Loss - Logistic Regression
log probability:
n
1X Xn n
X
J(w) = log (1 + exp (−yi w> xi )) (2) J(w) = − log (pi ) = log (1 + exp (−yi w> xi ))
n i=1
i=1 i=1
Geometric Interpretation To understand why there is no closed form solution for maximizing the log-
likelihood, we first differentiate J(w) with respect to w.
• Use regression to predict discrete values ∇J(w) =
Xn
• Squash output to [0, 1] using sigmoid function d
J(w) = log(1 + exp (−yi w> xi ))
dw
• Output less than 0.5 is one class and greater than 0.5 is the other i=1
n
1X yi
= − xi
Probabilistic Interpretation n i=1 1 + exp (yi w> xi )
• Probability of x to belong to class +1 Obviously, given that ∇J(w) is a non-linear function of w, a closed form
solution is not possible.
5 6
3.1 Using Gradient Descent for Learning Weights
• Compute gradient of J(w) with respect to w
• A convex function of w with a unique global minima
n
1X yi
∇J(w) = − xi
n i=1 1 + exp (yi w> xi )
• Update rule:
d
wk+1 = wk − η LL(wk )
dwk
−0.5
10
5 10
4 6 8
0 0 2
Newton’s Method
wk+1 = wk − ηH−1
k ∇J(wk )
Hessian n
1X exp (yi w> xi )
H(w) = x i x>
i
n i=1 (1 + exp (yi w> xi ))2
References