Optimization For ML (2) : CS771: Introduction To Machine Learning Piyush Rai
Optimization For ML (2) : CS771: Introduction To Machine Learning Piyush Rai
Optimization For ML (2) : CS771: Introduction To Machine Learning Piyush Rai
CS771: Intro to ML
3
Optimization Problems in ML
The general form of an optimization problem in ML will usually be
Usually a sum of the
training error +
regularizer However, possible to have
linear/ridge regression
Here denotes the loss function to be optimized where solution has some
constraints (e.g., non-neg,
sparsity, or even both)
is the constraint set that the solution must belong to, e.g.,
Linear and ridge regression
Non-negativity constraint: All entries in must be non-negative that we saw were
Sparsity constraint: is a sparse vector with atmost non-zeros unconstrained ( was a real-
valued vector)
Constrained opt. probs can be converted into unconstrained opt. (will see later)
For now, assume we have an unconstrained optimization problem
CS771: Intro to ML
4
CS771: Intro to ML
5
Method 1: Using First-Order Optimality
Very simple. Already used this approach for linear and ridge regression
Called “first order” since only gradient
is used and gradient provides the first
order info about the function being
optimized
The approach works only for very
simple problems where the objective
is convex and there are no
constraints on the values can take
First order optimality: The gradient must be equal to zero at the optima
=0
Sometimes, setting and solving for gives a closed form solution
If closed form solution is not available, the gradient vector can still be used in
iterative optimization algos, like gradient descent
CS771: Intro to ML
6
Method 2: Iterative Optimiz. via Gradient Descent
Can I used this approach For max. problems we Iterative since it requires
to solve maximization several steps/iterations to find
can use gradient ascent
problems? the optimal solution
Fact: Gradient gives the For convex functions, Good
direction of steepest Will move in the
GD will converge to initialization
change in function’s direction of the gradient
the global minima needed for non-
value Gradient Descent convex functions
The learning rate
very imp. Should be
Initialize as set carefully (fixed
or chosen
adaptively). Will
For iteration (or until convergence) discuss some
Calculate the gradient using the current iterates strategies later
Set the learning rate Will see the Sometimes may be
justification shortly tricky to to assess
Move in the opposite direction of gradient convergence? Will
(𝑡 +1) (𝑡 ) (𝑡 ) see some methods
𝒘 =𝒘 −𝜂 𝒈
𝑡 later
CS771: Intro to ML
7
Gradient Descent: An Illustration
Negative gradient here . Let’s move Learning rate is very important
𝐿(𝒘 ) in the positive direction
Positive gradient
here. Let’s move in
the negative direction
𝒘
(0 )
𝒘
𝒘 𝒘
(1) (3 )
∗
𝒘
(2 )
𝒘
(2 )
𝒘(3 )
𝒘∗ 𝒘
(1) 𝒘
(0 )
𝒘
Woohoo!
Stuck at a local
Global minima
minima
found!!!
GD thanks you for the Good initialization
good initialization is very important CS771: Intro to ML
8
GD: An Example
Let’s apply GD for least squares linear regression
=
Training
The gradient: examples on
Prediction error of current model which the current
Each GD update will be of the form on the training example model’s error is
large contribute
more to the
update
Some examples that we have already seen: Linear regression with absolute
loss, or Huber loss, or -insensitive loss; even norm regularizer is non-diff
¿ 𝑦 𝑛 − 𝑓 ( 𝒙 𝑛 ) ∨−𝜖
Loss ¿ 𝑦 𝑛 − 𝑓 ( 𝒙 𝑛 ) ∨¿ Loss Loss
) −𝛿 𝛿 ) −𝜖 𝜖 )
Not diff. here Not diff. here Not diff. here
Basically, any function in which there are points with kink is non-diff
At such points, the function is non-differentiable and thus gradients not defined
Reason: Can’t define a unique tangent at such points
CS771: Intro to ML
10
Sub-gradients
For convex non-diff fn, can define sub-gradients at point(s) of non-
differentiabilty Convex, thus lies
𝑓 (𝑥 )
above all its tangents
Equation of unique tangent at differentiable
here One extreme tangent at
non-differentiable
here Region containing all sub-gradients
The other extreme tangent at
𝑥1 𝑥2
is a stationary point for a non-diff function if the zero vector belongs to the
sub-differential at , i.e.,
CS771: Intro to ML
12
Sub-Gradient For Absolute Loss Regression
0 𝑦 𝑛 − 𝒘 ⊤ 𝒙𝑛 0 𝑡
The loss function for linear reg. with absolute loss:
Non-differentiable at
Can use the affine transform rule of sub-diff calculus
Assume . Then
if
if
where if
CS771: Intro to ML
13
Sub-Gradient Descent
Suppose we have a non-differentiable function
CS771: Intro to ML