0% found this document useful (0 votes)

14 views

Machine Learning - Lecture 2

Uploaded by

Athmajan Vu

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

14 views

Machine Learning - Lecture 2

Uploaded by

Athmajan Vu

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 28

Machine Learning

(521289S)
Lecture 2

Prof. Tapio Seppänen

Center for Machine Vision and Signal Analysis
University of Oulu

1
FROM PREVIOUS LECTURE:

2
Optimizing a model for linear regression: cost
function minimization
• Parameters to be optimized are the Slope and Intercept of the fitting line
– Parameter space is 2-dimensional with axes Slope and Intercept
• Different values are tried for the parameters until an optimal solution is found
• The goodness of a model is evaluated by computing the cost function value g()
– A smaller value of g() indicates a better model

• An example with two model

candidates:
– Left: An optimal solution
(g() has the smallest value)

– Right: A non-optimal solution

– Red dot shows the parameter values

3
Zero-order optimality condition
• Our aim is to search for a minimum in the cost function g()
• Minimization problem can be stated as:

– The function arguments wi are the model parameters

• By stacking all the parameters in an N-dimensional vector w,
we can express the minimization problem as:

– The vector w represents a point in the parameter space

• Global minimum: Search for a point w* in the parameter space such that

• Local minimum: Sometimes there are several minima in the cost function and we may be
satisfied with any of them or with the one that represents sufficient accuracy / low cost in a
small neighborhood of parameter space:

4
Local optimization techniques
• Aim: take a randomly selected initial point w0 in the parameter space and
start sequentally refining (=iteration) it to search for a minimum in the
cost function
– In each iteration round k, always select the next point wk such that we go
downwards on the cost function
– While iterating K rounds we get a sequence
w0, w1, w2, w3,…, wK such that:

• With complex-shaped cost functions,

one may find a local minimum instead of
the global minimum
– If the particular local optimum has too high
cost for the application, then restart from
a different initial point and repeat
the procedure
5
CHAPTER 3:
FIRST-ORDER OPTIMIZATION TECHNIQUES

6
First-order optimality condition
• At a minimum of a 1-dimensional function, the derivative of a function
g(w) has the value of zero

– The slope of the tangent is zero at the minimum point w=v

• In N-dimensional function, the gradient is

zero at a minimum of the function

– All partial derivatives are zero!

– The slope of the hyperplane tangent is zero
at the minimum point w=v

7
Solving the minimum point analytically
• The first-order system of equations can be very difficult
or virtually impossible to resolve algebraically in a general case
when the cost function is very complex
– Solvable is some special cases only
– The function may include several minima, maxima and
saddle points which all are solutions to the minimization problem
• Therefore, the solution is searched with iterative techniques in practice

8
Steepest ascent and descent directions
• The gradient computed at a particular point w of the cost function yields a vector
that points to the direction of steepest ascent of the cost function: the fastest
increase direction
• When searching the minimum point of the cost function, we need to go to the
steepest descent direction instead, i.e. to the opposite direction!
• Note: the ascent/descent direction is defined in the parameter space as we are
searching for good increments for the parameter values which will eventually
minimize the cost function

up
down

9
Gradient descent algorithm
• The basic structure of the algorithm is iterative:
– Set an initial point for parameter vector: w0
– Iterate parameter values wk until halting condition is met:

• The direction vector dk-1 is now the negative gradient at the

current point in the parameter space:

• The update rule is thus:

w2
gradient

w1
d = -gradient 10
Gradient descent in nonconvex cost
functions
• If the cost function has a complex shape with more than one extremum
point, the algorithm finds one of them by random chance
– Minimum, maximum, saddle points
– Nonconvex cost functions (convex = an ’upwards opening bowl’)
• Thus, the algorithm must be run
several times with different
initial w0 to find satisfactory Nonconvex function
solutions
• With convex cost functions,
the algorithm will find the
optimum if it keeps iterating Convex function
long enough

11
Steplength α control
• The previously shown choices can be applied: fixed steplength and
diminishing steplength (α = 1/k)
• A choice of fixed steplength must be done experimentally
• The cost function history plot below demonstrates that if α is not chosen
properly, the search can be slow or optimum may not be found

Cost function optimization with

three α values

Cost function history plots showing

the convergence of search

12
An example of comparing the fixed
and diminishing steplength control
• Cost function history plot may indicate that the solution
avoids the minimum or even oscillates

13
An example of nonconvex cost
function optimization
• Depending of the α value, the cost function history plot may
show fast convergence, slow convergence, oscillation, and
halting at a local minimum

14
Halting conditions
• A halting condition should be tested at each iteration to decide when to
stop the search
– One should halt near a stationary point (minimum, maximum, saddle point)
where cost function does not change any more much:

• Alternative implementations of halting conditions at iteration k:

– Magnitude of the gradient is sufficiently small (→ε)

– Changes to the parameter vector are sufficiently small (→ε)

– Change to the cost function value is sufficiently small (→ε)

– When a fixed maximum number of iterations Kmax has been completed

15
About gradient direction
• The gradient is always perpendicular to the local contour of the cost
function
• Thus, the search algorithm always proceeds towards the closest minimum
downhill
• Thus, if there are many local
minima, the search will end
at one of them

16
Zig-zagging behavior of gradient
descent (1/2)
• Very often in real applications, a minimum is located in an area of parameter space
in which the cost function shape is a slowly varying, long and narrow valley
surrounded by steep walls
• Due to the fact that gradient direction is always perpendicular to the local contour
of the cost function, the trajectory may not proceed directly towards the minimum
but does ’zig-zagging’ when making update steps in the iteration

Optimization with three different

cost functions

17
Zig-zagging behavior of gradient
descent (2/2)
• A popular solution to the problem
is to apply momentum-accelerated (0 < β < 1)
gradient descent
– Basically, it smoothes the direction trajectory
(exponential smoothing) Cost function optimization:

Basic gradient descent, β = 0.0

Momentum acceleration gradient

descent with β = 0.2

Momentum acceleration gradient

descent with β = 0.7

18
Slow-crawling behavior of gradient
descent (1/2)
• As the gradient magnitude starts vanishing close to a stationary point, the iteration
becomes very slow and not much progress is done

19
Slow-crawling behavior of gradient
descent (2/2)
• A popular solution to the problem is to apply normalized gradient descent
• Let’s keep just the gradient direction and normalize the gradient length to unity:

• Here, only α determines the steplength (fixed length, diminishing length,…)

• Often, the following modification is used instead:

– Small ε>0 is used to avoid
division by zero

20
CHAPTER 4:
SECOND-ORDER OPTIMIZATION
TECHNIQUES

21
The basic idea
• Also the local curvature (2nd derivative) of the cost function at point wk is
considered when deciding on the best direction and steplength to proceed in the
optimization process

• Often, the Newton’s method is used

– 1-D case (1 parameter only):

– N-D case (multiple parameters):

– The second order gradient is NxN-size Hessian matrix

• Appears inverted in the update rule!
– Fast convergence in many problems
– If the cost function g() is a polynomial of order 2,
the algorithm solves the optimization in one step
22
Second-order optimality condition
• An iterative search algorithm halts at a stationary point (local optimum) which
represents a minimum, maximum or an inflection point on the cost function
– Gradient is then equal to zero vector!
• If a cost function is convex, the algorithms will eventually halt at the minimum
• However, in nonconvex cases, the algorithm can also halt at a local maximum or an
inflection point

23
Second-order optimality condition
• One can test which condition is the case by
checking the second derivative (1-D case):

• In N-dimensional case, the condition is

defined with gradients instead
– Eigenvalues of Hessian matrix are considered

24
Weakness 1 of Newton’s method
• The basic form of Newton’s algorithm may halt at a local maximum or an inflection
point if the cost function is nonconvex

25
Weakness 1 of Newton’s method
• However, by using the version with the regularization term ε>0, the cost function is
locally transformed into convex shape (proof skipped here)
– Then, the Newton’s method will continue towards the minimum point!

– ε>0 is a regularization term which also Convexification of a cost function:

stabilizes the update rule by avoiding
division by zero
– I is an identity matrix with all 1’s on the
main diagonal

26
Weakness 2 of Newton’s method
• The method suffers from a scaling limitation: computational complexity
grows very fast with the number of parameters N
– The problem is the computation of the inverse of the Hessian matrix related with the
second derivatives
• Hessian-free optimization methods have been developed to solve the
issue
– Hessian matrix is replaced with a close approximation that does not suffer the issue
– Subsampling the Hessian: uses only a fraction of the matrix entries
– Quasi-Newton method: Hessian is replaced with a low-rank approximation that can be
computed effectively
• The details of these methods are omitted here (an interested reader can
check text book appendix for more details)

27
About laboratory assignments
• Currently, software is available for automatic differentiation of
functions
• The functions are parsed into elementary functions automatically
which helps automatic differentiation
• The numerical methods that are used have been developed by
highly skilled professional programmers and mathematicians

• The latest Matlab version includes functionality for this

• In the laboratory exercises, automatic differentiation will be used in
machine learning assignments
– Programmer defines the cost function and the automatic differentiation
yields gradients etc.

AHBPA
100% (1)
AHBPA
3,973 pages
Capsule Research Proposal
80% (5)
Capsule Research Proposal
4 pages
Unconstrained Numerical Optimization An Introduction For Econometricians
100% (1)
Unconstrained Numerical Optimization An Introduction For Econometricians
32 pages
Notes HQ
No ratings yet
Notes HQ
96 pages
OpTimIzation Overview
No ratings yet
OpTimIzation Overview
47 pages
L3 Linear Regression and Gradient Descent
No ratings yet
L3 Linear Regression and Gradient Descent
46 pages
Lecture 05 - Unconstrained
No ratings yet
Lecture 05 - Unconstrained
21 pages
Process Optimization
No ratings yet
Process Optimization
70 pages
Optim
No ratings yet
Optim
70 pages
Optimization
No ratings yet
Optimization
21 pages
(K) K (k+1) (K) K (K)
No ratings yet
(K) K (k+1) (K) K (K)
6 pages
CCS355 Neural Networks and Deep Learning
No ratings yet
CCS355 Neural Networks and Deep Learning
142 pages
11 Gradient Descent
No ratings yet
11 Gradient Descent
58 pages
Stats 102B Cheat Sheet
No ratings yet
Stats 102B Cheat Sheet
4 pages
4. Gradient Descent
No ratings yet
4. Gradient Descent
15 pages
Lecture 2
No ratings yet
Lecture 2
19 pages
Nocedal_Wright Ch_02-02
No ratings yet
Nocedal_Wright Ch_02-02
12 pages
02-Subgrad Method Notes
No ratings yet
02-Subgrad Method Notes
27 pages
Algorithms Process Optimization
No ratings yet
Algorithms Process Optimization
5 pages
Gradient_Descent
No ratings yet
Gradient_Descent
52 pages
Lecture 5
No ratings yet
Lecture 5
6 pages
Chapter 3 Unconstrained Convex Optimization
No ratings yet
Chapter 3 Unconstrained Convex Optimization
28 pages
NLP Slides
No ratings yet
NLP Slides
201 pages
ECOM 6302: Engineering Optimization: Chapter Three
100% (1)
ECOM 6302: Engineering Optimization: Chapter Three
56 pages
[ML&PR 2025] Lec2 Regression II
No ratings yet
[ML&PR 2025] Lec2 Regression II
41 pages
Gradient Descent: Ryan Tibshirani Convex Optimization 10-725
No ratings yet
Gradient Descent: Ryan Tibshirani Convex Optimization 10-725
27 pages
Unit VI Optimization Techniques question bank solved answer
No ratings yet
Unit VI Optimization Techniques question bank solved answer
20 pages
Lecture 7 (with notes)
No ratings yet
Lecture 7 (with notes)
39 pages
L07 Optimization
No ratings yet
L07 Optimization
12 pages
6 Gradient Method
No ratings yet
6 Gradient Method
19 pages
Chapter 17: Pattern Search For Unconstrained NLP
No ratings yet
Chapter 17: Pattern Search For Unconstrained NLP
7 pages
ch4
No ratings yet
ch4
28 pages
1 Intro
No ratings yet
1 Intro
25 pages
Backpropagation_optimization_tutorial
No ratings yet
Backpropagation_optimization_tutorial
14 pages
GA_ex_2
No ratings yet
GA_ex_2
21 pages
Berkeley-tutorial Optimization for Machine Learningpart2
No ratings yet
Berkeley-tutorial Optimization for Machine Learningpart2
35 pages
Gradient Descent
No ratings yet
Gradient Descent
9 pages
ML MODULE 5 FULL NOTES
No ratings yet
ML MODULE 5 FULL NOTES
23 pages
Nonlinearity in Structural Dynamics Chapter App G
No ratings yet
Nonlinearity in Structural Dynamics Chapter App G
11 pages
Optimization PDF
No ratings yet
Optimization PDF
59 pages
05 Gradient Descent
No ratings yet
05 Gradient Descent
23 pages
Chapter 4: Unconstrained Optimization
No ratings yet
Chapter 4: Unconstrained Optimization
25 pages
CS-6777 Liu Abs
No ratings yet
CS-6777 Liu Abs
103 pages
Gradient Descent and Cost Function
No ratings yet
Gradient Descent and Cost Function
14 pages
Lecture 12
No ratings yet
Lecture 12
16 pages
(k+1) K (K) (K) (K) : Recall That A Direction Is A Vector of Unit Length
No ratings yet
(k+1) K (K) (K) (K) : Recall That A Direction Is A Vector of Unit Length
5 pages
Screenshot 2024-10-19 at 10.37.25 AM
No ratings yet
Screenshot 2024-10-19 at 10.37.25 AM
25 pages
Unconstrained Optimization Methods: Amirkabir University of Technology Dr. Madadi
No ratings yet
Unconstrained Optimization Methods: Amirkabir University of Technology Dr. Madadi
13 pages
06_23ECE216_GradientDescent_v2
No ratings yet
06_23ECE216_GradientDescent_v2
73 pages
project
No ratings yet
project
17 pages
Lecture 11
No ratings yet
Lecture 11
30 pages
Lecture02a Optimization Annotated PDF
No ratings yet
Lecture02a Optimization Annotated PDF
23 pages
Optimization Algorithms For Data Analysis Wright
No ratings yet
Optimization Algorithms For Data Analysis Wright
49 pages
week 10 notes MLF
No ratings yet
week 10 notes MLF
20 pages
Opt_Lec_10
No ratings yet
Opt_Lec_10
16 pages
CSD411 Week7 Regression
No ratings yet
CSD411 Week7 Regression
75 pages
Gradient Descent Algorithm in Machine Learning
No ratings yet
Gradient Descent Algorithm in Machine Learning
21 pages
NLO Notes
No ratings yet
NLO Notes
75 pages
Gradient - Descent Important 23-24
No ratings yet
Gradient - Descent Important 23-24
37 pages
Exercises of Multi-Variable Functions
From Everand
Exercises of Multi-Variable Functions
Simone Malacrida
No ratings yet
Backpropagation: Fundamentals and Applications for Preparing Data for Training in Deep Learning
From Everand
Backpropagation: Fundamentals and Applications for Preparing Data for Training in Deep Learning
Fouad Sabry
No ratings yet
Bundle Adjustment: Optimizing Visual Data for Precise Reconstruction
From Everand
Bundle Adjustment: Optimizing Visual Data for Precise Reconstruction
Fouad Sabry
No ratings yet
Must We Suffer to Succeed
No ratings yet
Must We Suffer to Succeed
13 pages
Blended Learning: The Impact of Constructivist Learning Approach in The Learning Process of Nursing Students
No ratings yet
Blended Learning: The Impact of Constructivist Learning Approach in The Learning Process of Nursing Students
13 pages
CSE District Roll Out Pre Test
100% (1)
CSE District Roll Out Pre Test
2 pages
Online conversation N2 A2+
No ratings yet
Online conversation N2 A2+
3 pages
GO English - Official Letter
No ratings yet
GO English - Official Letter
1 page
Cubism worksheet
No ratings yet
Cubism worksheet
2 pages
Profile: Personal Information
No ratings yet
Profile: Personal Information
16 pages
Application For Issue of Transcript
No ratings yet
Application For Issue of Transcript
2 pages
Epc Assignment 1
No ratings yet
Epc Assignment 1
2 pages
Courese Contents OM
No ratings yet
Courese Contents OM
9 pages
Job Interview
No ratings yet
Job Interview
12 pages
Doing Competencies Well - Best Practices in Competency Modeling PDF
No ratings yet
Doing Competencies Well - Best Practices in Competency Modeling PDF
39 pages
Sample Lesson Plan - Reading
No ratings yet
Sample Lesson Plan - Reading
6 pages
POD - F009 Home Visitation Form
No ratings yet
POD - F009 Home Visitation Form
1 page
Schedule of Doors and Windows: SECTION ''A''-''A'' SECTION ''B''-''B'' Roof Plan
No ratings yet
Schedule of Doors and Windows: SECTION ''A''-''A'' SECTION ''B''-''B'' Roof Plan
1 page
Oct 20 - MGT162 - Lesson Plan and Assessment
No ratings yet
Oct 20 - MGT162 - Lesson Plan and Assessment
6 pages
Teachers' Guide
No ratings yet
Teachers' Guide
32 pages
ACHS - New Registration (Update 1)
No ratings yet
ACHS - New Registration (Update 1)
27 pages
25 Years of Transformations of Higher Education Systems in Post-Soviet Countries Jeroen Huisman - Own the complete ebook with all chapters in PDF format
100% (4)
25 Years of Transformations of Higher Education Systems in Post-Soviet Countries Jeroen Huisman - Own the complete ebook with all chapters in PDF format
65 pages
Lesson Plan 11
No ratings yet
Lesson Plan 11
2 pages
Tnhs Maharlika Annex Smea Dashboard
100% (1)
Tnhs Maharlika Annex Smea Dashboard
30 pages
Batia's
No ratings yet
Batia's
14 pages
Handbook of Rehabilitation Psychology 3rd Edition Lisa A. Brenner - Download the full ebook now for a seamless reading experience
100% (4)
Handbook of Rehabilitation Psychology 3rd Edition Lisa A. Brenner - Download the full ebook now for a seamless reading experience
79 pages
Play Ball
No ratings yet
Play Ball
12 pages
Developmental Models
No ratings yet
Developmental Models
3 pages
6-Bert T5 GPT
No ratings yet
6-Bert T5 GPT
31 pages
Book Review
No ratings yet
Book Review
2 pages
Zhejiang English Taught Course Info 2015
No ratings yet
Zhejiang English Taught Course Info 2015
155 pages

Machine Learning - Lecture 2

Uploaded by

Machine Learning - Lecture 2

Uploaded by

Machine Learning

Prof. Tapio Seppänen

• An example with two model

– Right: A non-optimal solution

– Red dot shows the parameter values

– The function arguments wi are the model parameters

– The vector w represents a point in the parameter space

• With complex-shaped cost functions,

– The slope of the tangent is zero at the minimum point w=v

• In N-dimensional function, the gradient is

– All partial derivatives are zero!

• The direction vector dk-1 is now the negative gradient at the

• The update rule is thus:

Cost function optimization with

Cost function history plots showing

• Alternative implementations of halting conditions at iteration k:

– Changes to the parameter vector are sufficiently small (→ε)

– Change to the cost function value is sufficiently small (→ε)

– When a fixed maximum number of iterations Kmax has been completed

Optimization with three different

Basic gradient descent, β = 0.0

Momentum acceleration gradient

Momentum acceleration gradient

• Here, only α determines the steplength (fixed length, diminishing length,…)

• Often, the following modification is used instead:

• Often, the Newton’s method is used

– N-D case (multiple parameters):

– The second order gradient is NxN-size Hessian matrix

• In N-dimensional case, the condition is

– ε>0 is a regularization term which also Convexification of a cost function:

• The latest Matlab version includes functionality for this

You might also like