0% found this document useful (0 votes)
124 views30 pages

Linear Regression With One Variable: Gradient Descent

1) Gradient descent is an algorithm used to find the minimum of a cost function by taking iterative steps proportional to the negative gradient. 2) For linear regression, gradient descent works by simultaneously updating the parameters theta0 and theta1 in order to minimize the cost function J(θ0, θ1), which is a convex function. 3) There are two types of gradient descent - batch gradient descent, which uses all training examples in each step, and stochastic gradient descent, which uses a single example or mini-batch, making it faster for large datasets.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
124 views30 pages

Linear Regression With One Variable: Gradient Descent

1) Gradient descent is an algorithm used to find the minimum of a cost function by taking iterative steps proportional to the negative gradient. 2) For linear regression, gradient descent works by simultaneously updating the parameters theta0 and theta1 in order to minimize the cost function J(θ0, θ1), which is a convex function. 3) There are two types of gradient descent - batch gradient descent, which uses all training examples in each step, and stochastic gradient descent, which uses a single example or mini-batch, making it faster for large datasets.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 30

Linear regression

with one variable

Gradient
Machine Learning
Descent
Slides from CS-229 by Andrew Ng
Andrew Ng
Recap
Hypothesis:

Parameters:

Cost Function:

Goal:

2
Andrew Ng
Recap
(for fixed , this is a function of x) (function of the parameter )

3 3

2 2
y
1 1

0 0
0 1 2 3 -0.5 0 0.5 1 1.5 2 2.5
x
  J() =
  = = 2.3
3
Andrew Ng
Idea
Have some function
Want

Outline:
• Start with some
• Keep changing to reduce
until we hopefully end up at a minimum
4
Andrew Ng
Intuitively

J(0,1)

1
0

5
Andrew Ng
Intuitively

J(0,1)

1
0

6
Andrew Ng
Gradient descent algorithm assignment
a:=b

 Simultaneously update
Learning rate &

Correct: Simultaneous update Incorrect:

7
Andrew Ng
Linear regression
with one variable
Gradient descent
intuition
Machine Learning

Andrew Ng
Gradient descent algorithm

Learning rate derivative

9
Andrew Ng
  := -
 ≥ 0
  := -

  := -
 ≤ 0
  := -

10
Andrew Ng
If α is too small, gradient descent
can be slow.

If α is too large, gradient descent


can overshoot the minimum. It may
fail to converge, or even diverge.

11
Andrew Ng
Gradient descent can converge to a local
minimum, even with the learning rate α fixed.

As we approach a local
minimum, gradient
descent will automatically
take smaller steps. So, no
need to decrease α over
time. 12
Andrew Ng
Linear regression
with one variable
Gradient descent for
linear regression
Machine Learning

Andrew Ng
Gradient descent algorithm Linear Regression Model

14
Andrew Ng
Gradient descent algorithm (Linear Regression)
 

update
and
simultaneously

15
Andrew Ng
Hypothesis:
Cost Function:
Convex function

(Bowl-shaped)

16
Andrew Ng
(for fixed , this is a function of x) (function of the parameters )

17
Andrew Ng
(for fixed , this is a function of x) (function of the parameters )

18
Andrew Ng
(for fixed , this is a function of x) (function of the parameters )

19
Andrew Ng
(for fixed , this is a function of x) (function of the parameters )

20
Andrew Ng
(for fixed , this is a function of x) (function of the parameters )

21
Andrew Ng
(for fixed , this is a function of x) (function of the parameters )

22
Andrew Ng
(for fixed , this is a function of x) (function of the parameters )

23
Andrew Ng
(for fixed , this is a function of x) (function of the parameters )

24
Andrew Ng
(for fixed , this is a function of x) (function of the parameters )

25
Andrew Ng
(Batch) Gradient Descent algorithm
 

update
and
simultaneously

26
Andrew Ng
“Batch” Gradient Descent (BGD)
“Batch”: Each step of gradient descent uses
all the ‘m’ training examples.

27
Andrew Ng
(Stochastic) Gradient Descent algorithm
repeat until convergence {
 
for i = 1 to m {
(𝒊 ) ( 𝒊) update
 𝜽
0 :=𝜽 0 − 𝜶 ( 𝒉 𝜽
( 𝒙 ) − 𝒚 )
and
( 𝒊) (𝒊) (𝒊) simultaneously
 𝜽
1 := 𝜽 1 − 𝜶 ( 𝒉𝜽
( 𝒙 ) − 𝒚 ) . 𝒙
}  
}

28
Andrew Ng
“Stochastic” Gradient Descent (SGD)
“Stochastic”: Each step of gradient descent
uses one training example or a mini-batch of
data (especially in deep learning).

29
Andrew Ng
Batch Vs. Stochastic Gradient Descent
• BGD computationally expensive on large datasets

• SGD often gets close to the minimum much faster than


BGD

• SGD typically used in practice on large datasets

• SGD has a possibility of escaping from local/global


minima
30
Andrew Ng

You might also like