0% found this document useful (0 votes)
14 views

Machine Learning - Lecture 2

Uploaded by

Athmajan Vu
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views

Machine Learning - Lecture 2

Uploaded by

Athmajan Vu
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 28

Machine Learning

(521289S)
Lecture 2

Prof. Tapio Seppänen


Center for Machine Vision and Signal Analysis
University of Oulu

1
FROM PREVIOUS LECTURE:

2
Optimizing a model for linear regression: cost
function minimization
• Parameters to be optimized are the Slope and Intercept of the fitting line
– Parameter space is 2-dimensional with axes Slope and Intercept
• Different values are tried for the parameters until an optimal solution is found
• The goodness of a model is evaluated by computing the cost function value g()
– A smaller value of g() indicates a better model

• An example with two model


candidates:
– Left: An optimal solution
(g() has the smallest value)

– Right: A non-optimal solution

– Red dot shows the parameter values

3
Zero-order optimality condition
• Our aim is to search for a minimum in the cost function g()
• Minimization problem can be stated as:

– The function arguments wi are the model parameters


• By stacking all the parameters in an N-dimensional vector w,
we can express the minimization problem as:

– The vector w represents a point in the parameter space

• Global minimum: Search for a point w* in the parameter space such that

• Local minimum: Sometimes there are several minima in the cost function and we may be
satisfied with any of them or with the one that represents sufficient accuracy / low cost in a
small neighborhood of parameter space:

4
Local optimization techniques
• Aim: take a randomly selected initial point w0 in the parameter space and
start sequentally refining (=iteration) it to search for a minimum in the
cost function
– In each iteration round k, always select the next point wk such that we go
downwards on the cost function
– While iterating K rounds we get a sequence
w0, w1, w2, w3,…, wK such that:

• With complex-shaped cost functions,


one may find a local minimum instead of
the global minimum
– If the particular local optimum has too high
cost for the application, then restart from
a different initial point and repeat
the procedure
5
CHAPTER 3:
FIRST-ORDER OPTIMIZATION TECHNIQUES

6
First-order optimality condition
• At a minimum of a 1-dimensional function, the derivative of a function
g(w) has the value of zero

– The slope of the tangent is zero at the minimum point w=v

• In N-dimensional function, the gradient is


zero at a minimum of the function

– All partial derivatives are zero!


– The slope of the hyperplane tangent is zero
at the minimum point w=v

7
Solving the minimum point analytically
• The first-order system of equations can be very difficult
or virtually impossible to resolve algebraically in a general case
when the cost function is very complex
– Solvable is some special cases only
– The function may include several minima, maxima and
saddle points which all are solutions to the minimization problem
• Therefore, the solution is searched with iterative techniques in practice

8
Steepest ascent and descent directions
• The gradient computed at a particular point w of the cost function yields a vector
that points to the direction of steepest ascent of the cost function: the fastest
increase direction
• When searching the minimum point of the cost function, we need to go to the
steepest descent direction instead, i.e. to the opposite direction!
• Note: the ascent/descent direction is defined in the parameter space as we are
searching for good increments for the parameter values which will eventually
minimize the cost function

up
down

9
Gradient descent algorithm
• The basic structure of the algorithm is iterative:
– Set an initial point for parameter vector: w0
– Iterate parameter values wk until halting condition is met:

• The direction vector dk-1 is now the negative gradient at the


current point in the parameter space:

• The update rule is thus:

w2
gradient

w1
d = -gradient 10
Gradient descent in nonconvex cost
functions
• If the cost function has a complex shape with more than one extremum
point, the algorithm finds one of them by random chance
– Minimum, maximum, saddle points
– Nonconvex cost functions (convex = an ’upwards opening bowl’)
• Thus, the algorithm must be run
several times with different
initial w0 to find satisfactory Nonconvex function
solutions
• With convex cost functions,
the algorithm will find the
optimum if it keeps iterating Convex function
long enough

11
Steplength α control
• The previously shown choices can be applied: fixed steplength and
diminishing steplength (α = 1/k)
• A choice of fixed steplength must be done experimentally
• The cost function history plot below demonstrates that if α is not chosen
properly, the search can be slow or optimum may not be found

Cost function optimization with


three α values

Cost function history plots showing


the convergence of search

12
An example of comparing the fixed
and diminishing steplength control
• Cost function history plot may indicate that the solution
avoids the minimum or even oscillates

13
An example of nonconvex cost
function optimization
• Depending of the α value, the cost function history plot may
show fast convergence, slow convergence, oscillation, and
halting at a local minimum

14
Halting conditions
• A halting condition should be tested at each iteration to decide when to
stop the search
– One should halt near a stationary point (minimum, maximum, saddle point)
where cost function does not change any more much:

• Alternative implementations of halting conditions at iteration k:


– Magnitude of the gradient is sufficiently small (→ε)

– Changes to the parameter vector are sufficiently small (→ε)

– Change to the cost function value is sufficiently small (→ε)

– When a fixed maximum number of iterations Kmax has been completed

15
About gradient direction
• The gradient is always perpendicular to the local contour of the cost
function
• Thus, the search algorithm always proceeds towards the closest minimum
downhill
• Thus, if there are many local
minima, the search will end
at one of them

16
Zig-zagging behavior of gradient
descent (1/2)
• Very often in real applications, a minimum is located in an area of parameter space
in which the cost function shape is a slowly varying, long and narrow valley
surrounded by steep walls
• Due to the fact that gradient direction is always perpendicular to the local contour
of the cost function, the trajectory may not proceed directly towards the minimum
but does ’zig-zagging’ when making update steps in the iteration

Optimization with three different


cost functions

17
Zig-zagging behavior of gradient
descent (2/2)
• A popular solution to the problem
is to apply momentum-accelerated (0 < β < 1)
gradient descent
– Basically, it smoothes the direction trajectory
(exponential smoothing) Cost function optimization:

Basic gradient descent, β = 0.0

Momentum acceleration gradient


descent with β = 0.2

Momentum acceleration gradient


descent with β = 0.7

18
Slow-crawling behavior of gradient
descent (1/2)
• As the gradient magnitude starts vanishing close to a stationary point, the iteration
becomes very slow and not much progress is done

19
Slow-crawling behavior of gradient
descent (2/2)
• A popular solution to the problem is to apply normalized gradient descent
• Let’s keep just the gradient direction and normalize the gradient length to unity:

• Here, only α determines the steplength (fixed length, diminishing length,…)

• Often, the following modification is used instead:


– Small ε>0 is used to avoid
division by zero

20
CHAPTER 4:
SECOND-ORDER OPTIMIZATION
TECHNIQUES

21
The basic idea
• Also the local curvature (2nd derivative) of the cost function at point wk is
considered when deciding on the best direction and steplength to proceed in the
optimization process

• Often, the Newton’s method is used


– 1-D case (1 parameter only):

– N-D case (multiple parameters):

– The second order gradient is NxN-size Hessian matrix


• Appears inverted in the update rule!
– Fast convergence in many problems
– If the cost function g() is a polynomial of order 2,
the algorithm solves the optimization in one step
22
Second-order optimality condition
• An iterative search algorithm halts at a stationary point (local optimum) which
represents a minimum, maximum or an inflection point on the cost function
– Gradient is then equal to zero vector!
• If a cost function is convex, the algorithms will eventually halt at the minimum
• However, in nonconvex cases, the algorithm can also halt at a local maximum or an
inflection point

23
Second-order optimality condition
• One can test which condition is the case by
checking the second derivative (1-D case):

• In N-dimensional case, the condition is


defined with gradients instead
– Eigenvalues of Hessian matrix are considered

24
Weakness 1 of Newton’s method
• The basic form of Newton’s algorithm may halt at a local maximum or an inflection
point if the cost function is nonconvex

25
Weakness 1 of Newton’s method
• However, by using the version with the regularization term ε>0, the cost function is
locally transformed into convex shape (proof skipped here)
– Then, the Newton’s method will continue towards the minimum point!

– ε>0 is a regularization term which also Convexification of a cost function:


stabilizes the update rule by avoiding
division by zero
– I is an identity matrix with all 1’s on the
main diagonal

26
Weakness 2 of Newton’s method
• The method suffers from a scaling limitation: computational complexity
grows very fast with the number of parameters N
– The problem is the computation of the inverse of the Hessian matrix related with the
second derivatives
• Hessian-free optimization methods have been developed to solve the
issue
– Hessian matrix is replaced with a close approximation that does not suffer the issue
– Subsampling the Hessian: uses only a fraction of the matrix entries
– Quasi-Newton method: Hessian is replaced with a low-rank approximation that can be
computed effectively
• The details of these methods are omitted here (an interested reader can
check text book appendix for more details)

27
About laboratory assignments
• Currently, software is available for automatic differentiation of
functions
• The functions are parsed into elementary functions automatically
which helps automatic differentiation
• The numerical methods that are used have been developed by
highly skilled professional programmers and mathematicians

• The latest Matlab version includes functionality for this


• In the laboratory exercises, automatic differentiation will be used in
machine learning assignments
– Programmer defines the cost function and the automatic differentiation
yields gradients etc.

28

You might also like