0% found this document useful (0 votes)
16 views14 pages

Lec03 2 Linear Regression Slides

The document provides an overview of gradient descent, an iterative method for optimizing functions, particularly in the context of linear regression. It explains the rationale behind the update rule, the importance of the learning rate, and the conditions for terminating the algorithm. Key points include the direction and size of updates based on the gradient, as well as the efficiency of gradient descent compared to direct solutions.

Uploaded by

brenden.wang
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
16 views14 pages

Lec03 2 Linear Regression Slides

The document provides an overview of gradient descent, an iterative method for optimizing functions, particularly in the context of linear regression. It explains the rationale behind the update rule, the importance of the learning rate, and the conditions for terminating the algorithm. Key points include the direction and size of updates based on the gradient, as well as the efficiency of gradient descent compared to direct solutions.

Uploaded by

brenden.wang
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 14

CSC 311 Intro to Machine Learning

Gradient Descent

Alice Gao
Learning Objectives
By the end of this lecture, you should be able to

• Explain the rationale behind the gradient descent


update rule (direction and step size).
• Derive the gradient descent updates for linear
regression with multiple features.
• Compare and contrast direction solution and gradient
descent.

Alice Gao, CSC311 Intro to ML, University of Toronto. 2


Outline
1. Gradient Descent

Alice Gao, CSC311 Intro to ML, University of Toronto. 3


GRADIENT DESCENT

Alice Gao, CSC311 Intro to ML, University of Toronto. 4


Why Use Gradient Descent?
• A general method for optimizing a function

• Easier to implement than direct solution

• More efficient than direct solution


• Each GD update costs 𝑂(𝑁𝐷)
• rather than 𝑂 𝐷! for matrix inversion
• much cheaper if 𝐷 is large (i.e., high-dimensional data)

Alice Gao, CSC311 Intro to ML, University of Toronto. 5


What is Gradient Descent?
An iterative method to find the minima of a function.

Consider a scalar function 𝐹: ℝ ↦ ℝ.


We want to minimize the function 𝐹(𝑤).

Gradient descent procedure:


1. Start with a random point 𝑤!
2. Apply an update rule iteratively
until a stopping condition is met.

Alice Gao, CSC311 Intro to ML, University of Toronto. 6


In which direction should we move?
Sign of the update should be the At 𝑤 = 𝑤" , the
same as/opposite of sign of the gradient. derivative is
positive/negative,
and we want to
increase/decrease 𝑤.
𝑑𝐹(𝑤! )
<0
𝑑𝑤 𝑑𝐹(𝑤" )
>0
𝑑𝑤 At 𝑤 = 𝑤# , the
derivative is
positive/negative,
and we want to
increase/decrease 𝑤.
𝑤" 𝑤#

Alice Gao, CSC311 Intro to ML, University of Toronto. 7


What is the size of each update?
Each update’s size should be
_______________________ When the curve is steep,
the gradient’s magnitude. we are likely close to/far from
the minimum, and the
gradient’s magnitude is
small/large.

When the curve is flat,


we are likely close to/far from
the minimum, and the
gradient’s magnitude is
small/large.

Alice Gao, CSC311 Intro to ML, University of Toronto. 9


Gradient Descent Update Rule
To minimize the function 𝐹(w), we use the update rule:

w ← w − 𝛼 ∇w 𝐹(w)
or
"#(%) "#(%)
𝑤! ← 𝑤! − 𝛼 ......... 𝑤( ← 𝑤( − 𝛼
"'! "'"

Each update
Direction Negative of gradient’s sign
Size Proportional to gradient’s magnitude

Alice Gao, CSC311 Intro to ML, University of Toronto. 11


Gradient Descent: When Do We Stop?
In theory:
• Stop when w stop changing (convergence)

In practice:
• Stop when the change in 𝐹(w) is small enough.
• Stop when we are tired of waiting.

Alice Gao, CSC311 Intro to ML, University of Toronto. 12


Choosing a Learning Rate
Too small Too large
• Takes too long to converge. • May diverge

Alice Gao, CSC311 Intro to ML, University of Toronto. 13


Gradient Descent Update for Linear Regression
The general update rule:
w ← w − 𝛼 ∇w ℰ(𝐰)

Update rule for linear regression:

𝛼 )
w ← w − X (Xw − t)
𝑁
or
-
𝛼
w ← w − 2 x * (𝐰 ) x (*) − t (*) )
𝑁
*+,

Alice Gao, CSC311 Intro to ML, University of Toronto. 14


Summary: Gradient Descent
w ← w − 𝛼 ∇w ℰ(𝐰)

Update rule • Direction: negative of gradient


• Magnitude: proportional to gradient.

Terminating • When change in ℰ is small enough.


conditions • After a # of iterations.

• Too small: takes too long to converge.


Learning rate
• Too large: may diverge.

Alice Gao, CSC311 Intro to ML, University of Toronto. 15


A Modular Approach to ML

Model Describes relationships between variables.

Loss function Quantifies how badly the model fits the data.

Optimization
Fit a model that minimizes the loss.
algorithm

Alice Gao, CSC311 Intro to ML, University of Toronto. 16

You might also like