Optimization For ML (2) : CS771: Introduction To Machine Learning Piyush Rai

Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 14

Optimization for ML (2)

CS771: Introduction to Machine Learning


Piyush Rai
2
The Plan
 Some basic techniques for solving optimization problems
 First-order optimality
 Gradient descent
 Dealing with non-differentiable functions
 Sub-gradients and sub-differential

CS771: Intro to ML
3
Optimization Problems in ML
 The general form of an optimization problem in ML will usually be
Usually a sum of the
training error +
regularizer However, possible to have
linear/ridge regression
 Here denotes the loss function to be optimized where solution has some
constraints (e.g., non-neg,
sparsity, or even both)
 is the constraint set that the solution must belong to, e.g.,
Linear and ridge regression
 Non-negativity constraint: All entries in must be non-negative that we saw were
 Sparsity constraint: is a sparse vector with atmost non-zeros unconstrained ( was a real-
valued vector)

 If no is specified, it is an unconstrained optimization problem

 Constrained opt. probs can be converted into unconstrained opt. (will see later)
 For now, assume we have an unconstrained optimization problem
CS771: Intro to ML
4

Methods for Solving


Optimization Problems

CS771: Intro to ML
5
Method 1: Using First-Order Optimality
 Very simple. Already used this approach for linear and ridge regression
Called “first order” since only gradient
is used and gradient provides the first
order info about the function being
optimized
The approach works only for very
simple problems where the objective
is convex and there are no
constraints on the values can take

 First order optimality: The gradient must be equal to zero at the optima

=0
 Sometimes, setting and solving for gives a closed form solution

 If closed form solution is not available, the gradient vector can still be used in
iterative optimization algos, like gradient descent
CS771: Intro to ML
6
Method 2: Iterative Optimiz. via Gradient Descent
Can I used this approach For max. problems we Iterative since it requires
to solve maximization several steps/iterations to find
can use gradient ascent
problems? the optimal solution
Fact: Gradient gives the For convex functions, Good
direction of steepest Will move in the
GD will converge to initialization
change in function’s direction of the gradient
the global minima needed for non-
value Gradient Descent convex functions
The learning rate
very imp. Should be
 Initialize as set carefully (fixed
or chosen
adaptively). Will
 For iteration (or until convergence) discuss some
 Calculate the gradient using the current iterates strategies later
 Set the learning rate Will see the Sometimes may be
justification shortly tricky to to assess
 Move in the opposite direction of gradient convergence? Will
(𝑡 +1) (𝑡 ) (𝑡 ) see some methods
𝒘 =𝒘 −𝜂 𝒈
𝑡 later
CS771: Intro to ML
7
Gradient Descent: An Illustration
Negative gradient here . Let’s move Learning rate is very important
𝐿(𝒘 ) in the positive direction

Positive gradient
here. Let’s move in
the negative direction

𝒘
(0 )
𝒘
𝒘 𝒘
(1) (3 )

𝒘
(2 )
𝒘
(2 )
𝒘(3 )
𝒘∗ 𝒘
(1) 𝒘
(0 )
𝒘
Woohoo! 
Stuck at a local
Global minima
minima 
found!!!
GD thanks you for the Good initialization
good initialization  is very important CS771: Intro to ML
8
GD: An Example
 Let’s apply GD for least squares linear regression

=
Training
 The gradient: examples on
Prediction error of current model which the current
 Each GD update will be of the form on the training example model’s error is
large contribute
more to the
update

 Exercise: Assume , and show that GD update improves prediction on the


training input (, ), i.e, is closer to than to
 This is sort of a proof that GD updates are “corrective” in nature (and it actually is true
not just for linear regression but can also be shown for various other ML models)CS771: Intro to ML
9
Dealing with Non-differentiable Functions
 In many ML problems, the objective function will be non-differentiable

 Some examples that we have already seen: Linear regression with absolute
loss, or Huber loss, or -insensitive loss; even norm regularizer is non-diff
¿ 𝑦 𝑛 − 𝑓 ( 𝒙 𝑛 ) ∨−𝜖
Loss ¿ 𝑦 𝑛 − 𝑓 ( 𝒙 𝑛 ) ∨¿ Loss Loss

Not diff. here Not diff. here

) −𝛿 𝛿 ) −𝜖 𝜖 )
Not diff. here Not diff. here Not diff. here

 Basically, any function in which there are points with kink is non-diff
 At such points, the function is non-differentiable and thus gradients not defined
 Reason: Can’t define a unique tangent at such points
CS771: Intro to ML
10
Sub-gradients
 For convex non-diff fn, can define sub-gradients at point(s) of non-
differentiabilty Convex, thus lies
𝑓 (𝑥 )
above all its tangents
Equation of unique tangent at differentiable
here One extreme tangent at
non-differentiable
here Region containing all sub-gradients
The other extreme tangent at

𝑥1 𝑥2

 For a convex, non-diff function , sub-gradient



atis any vector s.t.
𝑓 ( 𝑥 ) ≥ 𝑓 ( 𝒙 ∗) + 𝒈 ( 𝒙 − 𝒙 ∗)
CS771: Intro to ML
11
Sub-gradients, Sub-differential, and Some Rules
 Set of all sub-gradient at a non-diff point is called the sub-differential
𝜕 𝑓 ( 𝒙 ∗ ) ≜ { 𝒈 : 𝑓 ( 𝐱 ) ≥ 𝑓 ( 𝒙 ∗ ) +𝒈 ( 𝒙 − 𝒙∗ ) ∀ 𝐱 }

The affine transform


 Some basic rules of sub-diff calculus to keep in mind rule is a special case of
 Scaling rule: the more general chain
rule
 Sum rule:
 Affine trans: , where
 Max rule: If then we calculate at as
 If , If
 If

 is a stationary point for a non-diff function if the zero vector belongs to the
sub-differential at , i.e.,
CS771: Intro to ML
12
Sub-Gradient For Absolute Loss Regression

0 𝑦 𝑛 − 𝒘 ⊤ 𝒙𝑛 0 𝑡
 The loss function for linear reg. with absolute loss:
 Non-differentiable at
 Can use the affine transform rule of sub-diff calculus
 Assume . Then
 if
 if
 where if
CS771: Intro to ML
13
Sub-Gradient Descent
 Suppose we have a non-differentiable function

 Sub-gradient descent is almost identical to GD except we use subgradients


Sub-Gradient Descent
 Initialize as

 For iteration (or until convergence)


 Calculate the sub-gradient
 Set the learning rate
 Move in the opposite direction of subgradient
(𝑡 +1) (𝑡 ) (𝑡 )
𝒘 =𝒘 − 𝜂𝑡 𝒈
CS771: Intro to ML
14
Coming up next
 Making GD faster: Stochastic gradient descent
 Constrained optimization
 Co-ordinate descent
 Alternating optimization
 Practical issue in optimization for ML

CS771: Intro to ML

You might also like