0% found this document useful (0 votes)

30 views75 pages

Lecture2 Gradient Descent Linear Regression

The document discusses gradient descent methods and linear regression. It introduces gradient descent as an iterative method to find the minimum of a differentiable function. It discusses the concepts of descent directions and steepest descent direction. It then applies gradient descent to solve linear regression problems with examples.

Uploaded by

jiayuan0113

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

30 views75 pages

Lecture2 Gradient Descent Linear Regression

Uploaded by

jiayuan0113

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 75

Gradient (descent) methods and linear

regression
DSA5103 Lecture 2

Yangjing Zhang
22-Aug-2023
NUS
Today’s content

1. Gradient (descent) methods

2. Linear regression with one variable
3. Linear regression with multiple variables
4. Gradient descent for linear regression
5. Normal equation for linear regression

lecture2 1/55
Gradient methods
Unconstrained problem

To minimize a differentiable function f

min f (x)
x∈Rn

Recall that a global minimizer is a

local minimizer, and a local
minimizer is a stationary point

• We may try to find stationary points x, i.e., ∇f (x) = 0 for solving

an unconstrained problem
• When it is difficult to solve ∇f (x) = 0, we look for an approximate
solution via iterative methods
lecture2 2/55
Algorithmic framework

A general algorithmic framework: choose x(0) and repeat

x(k+1) = x(k) + αk p(k) , k = 0, 1, 2, . . .

until some stopping criteria is satisfied

• x(0) initial guess of the solution

• αk > 0 is called the step length/step size/learning rate
• p(k) is a search direction

lecture2 3/55
Algorithmic framework

A general algorithmic framework: choose x(0) and repeat

x(k+1) = x(k) + αk p(k) , k = 0, 1, 2, . . .

until some stopping criteria is satisfied

• x(0) initial guess of the solution

• αk > 0 is called the step length/step size/learning rate
• p(k) is a search direction
(we hope the search direction can “improve” the iterative point in
some sense)

lecture2 3/55
Descent direction

The search direction p(k) should be a descent direction at x(k)

• We say p(k) is a descent direction at x(k) if

∇f (x(k) )T p(k) < 0

lecture2 4/55
Descent direction

The search direction p(k) should be a descent direction at x(k)

• We say p(k) is a descent direction at x(k) if

∇f (x(k) )T p(k) < 0

• The function value f can be reduced along this descent direction

with “appropriate” step length

∃ δ > 0 such that f (x(k) + αk p(k) ) < f (x(k) ) ∀ αk ∈ (0, δ)

lecture2 4/55
Example

Example 1:
min f (x) = 12 x2
x∈R

At x(k) = −1, p(k) = 1 is a descent direction since

∇f (x(k) ) = x(k) = −1, ∇f (x(k) )T p(k) = (−1) × 1 = −1 < 0

We observe that for αk ∈ (0, 2)

f (x(k) + αk p(k) ) < f (x(k) ) lecture2 5/55
Example

Example 2: min f (x) = x21 + 2x22 − 2x1 x2 − 2x2 .

x=(x1 ;x2 )∈R2
" #
0
At x(0) = , show that p(0) = −∇f (x(0) ) is a descent direction.
0

lecture2 6/55
Example

Example 2: min f (x) = x21 + 2x22 − 2x1 x2 − 2x2 .

x=(x1 ;x2 )∈R2
" #
0
At x(0) = , show that p(0) = −∇f (x(0) ) is a descent direction.
0
" #
2x1 − 2x2
Solution. Compute the gradient ∇f (x) = .
4x2 − 2x1 − 2
" # " #
0 0
Then ∇f (x(0) ) = ∇f (0; 0) = and p(0) = −∇f (x(0) ) = .
−2 2
" #
0
Since ∇f (x(0) )T p(0) = [0 − 2] = −4 < 0, p(0) is a descent direction
2
at x(0) .

lecture2 6/55
Example

Example 2: min f (x) = x21 + 2x22 − 2x1 x2 − 2x2 .

x=(x1 ;x2 )∈R2
" # " # " #
(0) 0 (0) 1 (0) −2
At x = , show that p = and p = are descent
0 1 0.1
directions.

lecture2 7/55
Example

Example 2: min f (x) = x21 + 2x22 − 2x1 x2 − 2x2 .

x=(x1 ;x2 )∈R2
" # " # " #
(0) 0 (0) 1 (0) −2
At x = , show that p = and p = are descent
0 1 0.1
directions.
" #
(0) 0
Solution. ∇f (x ) =
−2
" #
1
∇f (x(0) )T p(0) = [0 − 2] = −2 < 0
1
" #
−2
∇f (x(0) )T p(0) = [0 − 2] = −0.2 < 0
0.1

lecture2 7/55
Example

Example 2: min f (x) = x21 + 2x22 − 2x1 x2 − 2x2 .

x=(x1 ;x2 )∈R2
" # " #
0 −1
At x(0) = , show that p(0) = is not a descent direction.
0 −1

lecture2 8/55
Example

Example 2: min f (x) = x21 + 2x22 − 2x1 x2 − 2x2 .

x=(x1 ;x2 )∈R2
" # " #
0 −1
At x(0) = , show that p(0) = is not a descent direction.
0 −1
" #
(0) 0
Solution. ∇f (x ) =
−2
" #
(0) T (0) −1
Since ∇f (x ) p = [0 − 2] = 2 > 0, p(0) is not a descent
−1
direction at x(0) .
[Exercise] Construct another descent direction.

lecture2 8/55
Example

Example 2: at x(0) = [0; 0]

Descent directions:
[0; 2], [1; 1], [−2; 0.1]
Not descent directions:
[−1; −1]

lecture2 9/55
Example

Example 2: at x(0) = [0; 0]

Descent directions:
[0; 2], [1; 1], [−2; 0.1]
Not descent directions:
[−1; −1]
• There can be infinitely
many descent directions
• The “angle” between a
descent direction and
−∇f (·) is less than 90o
• Among all directions, the value of f decreases most rapidly along
the direction −∇f (·)
• The direction −∇f (·) is known as the steepest descent direction

lecture2 9/55
Steepest descent method

Algorithm (Steepest descent method)

Choose x(0) and > 0; Set k ← 0
while k∇f (x(k) )k > do
find the step length αk (e.g., by certain line search rule)
x(k+1) = x(k) − αk ∇f (x(k) )
k ←k+1
end(while)
return x(k)

One may choose to use a constant step length (say αk = 0.1), or find it
via line search rules:
• Exact line search
• Backtracking line search
lecture2 10/55
Constant step length

Example 3: min f (x) = 12 x2

x∈R

Apply steepest descent method with x(0) = −1, = 10−4 , and constant
step length αk = 0.1

lecture2 11/55
Constant step length

Example 3: min f (x) = 12 x2

x∈R

Apply steepest descent method with x(0) = −1, = 10−4 , and constant
step length αk = 0.1
1.2

0.8

0.6

0.4

0.2

0
-1.5 -1 -0.5 0 0.5 1 1.5

• Return x = −9.40 × 10 at iteration k = 89

(k) −5

• When the step length is too small, the method can be slow
• As it approaches the minimizer, the method will automatically take
smaller steps lecture2 11/55
Constant step length

Example 3: constant step length αk = 0.5

1.2

0.8

0.6

0.4

0.2

0
-1.5 -1 -0.5 0 0.5 1 1.5

• Return x(k) = −6.10 × 10−5 at iteration k = 15

• In this particular example, αk = 0.5 (converge in 15 steps) is better
than αk = 0.1 (converge in 89 steps)
lecture2 12/55
Constant step length

Example 3: constant step length αk = 1.8

1.2

0.8

0.6

0.4

0.2

0
-1.5 -1 -0.5 0 0.5 1 1.5

• Return x(k) = −8.51 × 10−5 at iteration k = 43

• Due to the big step length, the iterates are oscillating around the
solution but still converge
lecture2 13/55
Constant step length

Example 3: constant step length αk = 2.1

1.2

0.8

0.6

0.4

0.2

0
-1.5 -1 -0.5 0 0.5 1 1.5

• The step length is too large, and the method diverges

lecture2 14/55
Constant step length

lecture2 15/55
Exact line search

Exact line search tries to find αk by solving the one-dimensional problem

min φ(α) := f (x(k) + αp(k) )

α>0

• In general, exact line search is the most difficult part of the steepest
descent method
• If f is a simple function, it may be possible to obtain an analytical
solution for αk by solving φ0 (α) = 0

lecture2 16/55
Example

Example 4: min f (x) = 12 x2

x∈R

Apply steepest descent method with exact line search, given x(0) = −1.

lecture2 17/55
Example

Example 4: min f (x) = 12 x2

x∈R

Apply steepest descent method with exact line search, given x(0) = −1.
Solution. k = 0, p(0) = −∇f (x(0) ) = −∇f (−1) = 1. Find α0 by
solving
1
min φ(α) = f (x(0) + αp(0) ) = (α − 1)2 .
α>0 2
Obviously, α0 = 1, and x(1) = x(0) + α0 p(0) = 0 is actually the global
minimizer.

lecture2 17/55
Example

Example 5: min f (x) = x21 + x22 + 2x1 + 4 (A convex problem)

x=(x1 ;x2 )∈R2

Apply steepest descent method with exact line search, given x(0) = [2; 1].

lecture2 18/55
Example

Example 5: min f (x) = x21 + x22 + 2x1 + 4 (A convex problem)

x=(x1 ;x2 )∈R2

Apply steepest descent method with exact line search, given x(0) = [2; 1].
" # " #
2x1 + 2 6
Solution: Calculate the gradient ∇f (x) = , ∇f (x(0) ) = .
2x2 2
Then p(0) = −∇f (x(0) ) = [−6; −2], and find α0 by solving

min φ(α) = f (x(0) + αp(0) ) = f (2 − 6α; 1 − 2α)

α>0
= (2 − 6α)2 + (1 − 2α)2 + 2(2 − 6α) + 4

φ is quadratic (convex). By setting φ0 (α) = 0, we obtain α0 = 0.5. And

x(1) = x(0) + α0 p(0) = [−1; 0]. The steepest descent method will
terminate since ∇f (x(1) ) = [0; 0].
[Exercise] Verify that [−1; 0] is a global minimizer (by sufficient condition
and convexity).
lecture2 18/55
Example

Example 6: min f (x) = x21 + 2x22 − 2x1 x2 − 2x2 (convex)

x=(x1 ;x2 )∈R2

Apply steepest descent method with exact line search, given x(0) = [0; 0].
Compute x(1) , x(2) , and x(3) .

lecture2 19/55
Example

Example 6: min f (x) = x21 + 2x22 − 2x1 x2 − 2x2 (convex)

x=(x1 ;x2 )∈R2

Apply steepest descent method with exact line search, given x(0) = [0; 0].
Compute x(1) , x(2) , and x(3) .
" #
2x1 − 2x2
Solution. Compute the gradient ∇f (x) = .
4x2 − 2x1 − 2

k = 0: p(0) = −∇f (x(0) ) = [0; 2]

min φ(α) = f (x(0) + αp(0) ) = f (0; 2α) = 8α2 − 4α, φ is convex.
α>0

We let 0 = φ0 (α) = 16α − 4, and obtain that α0 = 14 .

x(1) = x(0) + α0 p(0) = [0; 12 ]

lecture2 19/55
Example

Example 6: min f (x) = x21 + 2x22 − 2x1 x2 − 2x2 (convex)

x=(x1 ;x2 )∈R2

Apply steepest descent method with exact line search, x(0) = [0; 0].
Compute x(1) , x(2) , and x(3) .
" #
2x1 − 2x2
Solution. ∇f (x) = , x(1) = [0; 12 ].
4x2 − 2x1 − 2

k = 1: p(1) = −∇f (x(1) ) = [1; 0]

min φ(α) = f (x(1) + αp(1) ) = f (α; 12 ) = α2 − α − 12 , φ is convex.
α>0

We let 0 = φ0 (α) = 2α − 1, and obtain that α1 = 12 .

x(2) = x(1) + α1 p(1) = [ 12 ; 21 ]
k = 2: x(3) = [ 12 ; 43 ] [Exercise]
In fact, the global minimizer (x∗ = [1; 1]) can be found by solving
∇f (x) = 0.
lecture2 20/55
Contour plot

A contour is a fixed height f (x1 , x2 ) = c.

Left: plot z = f (x1 , x2 ) = x21 + x22
Right: contour plot f (x1 , x2 ) = 1, f (x1 , x2 ) = 8, f (x1 , x2 ) = 18

-1

-2

-3

-4

-5
-5 0 5

lecture2 21/55
Contour plot

Contour plot of f (x) = x21 + 2x22 − 2x1 x2 − 2x2 in Example 6

2.5
1

2 1
0

0
1.5 -0.5 -0.75

1
75

.5
8
-0.

-0
1 0
75
5

-0.8
-0.7

1
-0.5

5
0.5 -0.7 0
-0.5

0 0 0 1

1
-0.5
-0.5 0 0.5 1 1.5 2 2.5

lecture2 22/55
Contour plot

Contour plot of f (x) = x21 + 2x22 − 2x1 x2 − 2x2 in Example 6

2.5
1

x(0) = [0; 0]
2 1
0
x(1) = [0; 12 ]

0
1.5 -0.5 -0.75
x(2) = [ 12 ; 12 ]
1
75
x(3) = [ 12 ; 34 ]

.5
8
-0.

-0
0
1 ...
75
5

-0.8
-0.7

1
-0.5

5
0.5 -0.7 0
-0.5

0 0 0 1

1
-0.5
-0.5 0 0.5 1 1.5 2 2.5

Steepest descent method with exact line search follows a zig-zag path
towards the solution. This zigzagging makes the method inherently slow.
lecture2 22/55
Steepest descent method with exact line search

Algorithm (Steepest descent method with exact line search)

Choose x(0) and > 0; Set k ← 0
while k∇f (x(k) )k > do
p(k) = −∇f (x(k) )
αk = arg min φ(α) = f (x(k) + αp(k) )
α>0

x(k+1) = x(k) + αk p(k)

k ←k+1
end(while)
return x(k)

lecture2 23/55
Properties of steepest descent method with exact line search*

Let {x(k) } be the sequence generated by steepest descent method with

exact line search

• The steepest descent method with exact line search moves in

perpendicular steps. More precisely, x(k) − x(k+1) is orthogonal
(perpendicular) to x(k+1) − x(k+2) .
• Monotonic decreasing property:

f (x(k+1) ) < f (x(k) ) if ∇f (x(k) ) 6= 0.

• Suppose f is a coercive1 function with continuous first order
derivatives on Rn . Then some subsequence of {x(k) } converges.
The limit of any convergent subsequence of {x(k) } is a stationary
point of f .
1A continuous function f : Rn → R is said to be coercive if limkxk→∞ f (x) = +∞
lecture2 24/55
Backtracking line search

Backtracking line search starts with a relatively large step length and
iteratively shrinks it (i.e., “backtracking”) until the Armijo condition
holds.

Algorithm (Backtracking line search)

Choose ᾱ > 0, ρ ∈ (0, 1), c1 ∈ (0, 1); Set α ← ᾱ
repeat until f (x(k) + αp(k) ) ≤ f (x(k) ) + c1 α∇f (x(k) )T p(k)
| {z }
Armijo condition
α ← ρα
end(repeat)
return αk = α

lecture2 25/55
Backtracking line search

• Note that p(k) is a descent direction

∇f (x(k) )T p(k) < 0

The Armijo condition

f (x(k) + αp(k) ) ≤ f (x(k) ) + c1 α∇f (x(k) )T p(k)

requires a reasonable amount of decrease in the objective function,

rather than find the best step length (as in exact line search).
• For example, one can set

ᾱ = 1, ρ = 0.9, c1 = 10−4

in practice. Namely, we start with step length 1 and continue with

0.9, 0.92 , 0.93 , . . . until the Armijo condition is satisfied.

lecture2 26/55
Example

Example 7: min f (x) = x21 + x22 + 2x1 + 4

x=(x1 ;x2 )∈R2

Apply steepest descent method with backtracking line search, given

x(0) = [2; 1], ᾱ = 1, ρ = 0.9, c1 = 10−4 . Compute x(1) .

lecture2 27/55
Example

Example 7: min f (x) = x21 + x22 + 2x1 + 4

x=(x1 ;x2 )∈R2

Apply steepest descent method with backtracking line search, given

x(0) = [2; 1], ᾱ = 1, ρ = 0.9, c1 = 10−4 . Compute x(1) .
" # " #
2x1 + 2 (0) 6
Solution: Calculate the gradient ∇f (x) = , ∇f (x ) = .
2x2 2
k = 0. p(0) = −∇f (x(0) ) = [−6; −2]. Do backtracking line search:

1. α = ᾱ = 1. Check Armijo condition

f (x(0) + αp(0) ) ≤ f (x(0) ) + c1 α∇f (x(0) )T p(0)
LHS = f (−4; −1) = (−4)2 + (−1)2 + 2(−4) + "4= # 13
−6
RHS = (22 + 12 + 2 · 2 + 4) + 10−4 · 1 · [6, 2] = 12.996
−2
Armijo condition fails.

lecture2 27/55
Example

Example 7: min f (x) = x21 + x22 + 2x1 + 4

x=(x1 ;x2 )∈R2

Apply steepest descent method with backtracking line search, given

x(0) = [2; 1], ᾱ = 1, ρ = 0.9, c1 = 10−4 . Compute x(1) .
Solution:
k = 0. p(0) = −∇f (x(0) ) = [−6; −2]. Do backtracking line search:

2. α = ρᾱ = 0.9. Check Armijo condition

f (x(0) + αp(0) ) ≤ f (x(0) ) + c1 α∇f (x(0) )T p(0)
LHS = f (−3.4; −0.8) = (−3.4)2 + (−0.8)2 + 2(−3.4) " # + 4 = 9.4
−6
RHS = (22 + 12 + 2 · 2 + 4) + 10−4 · 0.9 · [6, 2] = 12.9964
−2
Armijo condition holds. Set α0 = 0.9.

New iterate x(1) = [−3.4; −0.8].

lecture2 28/55
Steepest descent method with backtracking line search

Algorithm (Steepest descent method with backtracking line search)

Choose x(0) , > 0, ᾱ > 0, ρ ∈ (0, 1), c1 ∈ (0, 1); Set k ← 0
while k∇f (x(k) )k > do
p(k) = −∇f (x(k) )
α ← ᾱ
repeat until f (x(k) + αp(k) ) ≤ f (x(k) ) + c1 α∇f (x(k) )T p(k)
α ← ρα
end(repeat)
αk ← α
x(k+1) = x(k) + αk p(k)
k ←k+1
end(while)
return x(k) lecture2 29/55
Linear regression with one
variable
Housing prices data

• Predict the price for house with area 3500

10 6
8

6
Price

1
1000 2000 3000 4000 5000 6000 7000 8000 9000

Area

Figure 1: Housing prices data2

2 https://www.kaggle.com/datasets/yasserh/housing-prices-dataset

lecture2 30/55
Housing prices data

Training set of housing prices

area (x) price (y)

7420 13300000
8960 12250000
.. ..
. .

• Predictor/feature/“input” vector x
• Response/target/“output” variable y
• i-th training example (xi , yi )
• n: number of samples/training examples

lecture2 31/55
Housing prices data

Housing prices data

⇓
Learn a linear function f (x) = fβ (x) = β1 x + β0

f
input a new area of house x̂ ⇒ predict its price f (x̂)

• Parameters β1 , β0
• How to choose β1 , β0 ?

lecture2 32/55
Illustration

f (x) = fβ (x) = β1 x + β0

β1 = 1, β0 = 2 β1 = 0, β0 = 3 β1 = 2, β0 = 2

lecture2 33/55
Linear regression with one variable

• Want to choose β0 , β1 such that f (xi ) is close to yi for every

training example (xi , yi )

lecture2 34/55
Linear regression with one variable

• Want to choose β0 , β1 such that f (xi ) is close to yi for every

training example (xi , yi )
• Want to find β0 , β1
n
1X
minimize (β1 xi + β0 − yi )2
β0 ,β1 2 i=1
| {z }
squared error

• Objective/cost/loss function
n
1X
L(β0 , β1 ) = (β1 xi + β0 − yi )2
2 i=1

• Squared error function is probably the most commonly used cost

function in linear regression

lecture2 34/55
Understand the cost function

• Assume β0 = 0 for simplicity

• Consider linear model f (x) = β1 x

• Optimization model
n
1X
minimize L(β1 ) = (β1 xi − yi )2
β1 2 i=1

lecture2 35/55
Understand the cost function
Toy example
Given training data (1, 1), (2, 2.5), (4, 4) and linear model f (x) = β1 x
Pn
(assume β0 = 0). Compute the values of L(β1 ) = 12 i=1 (β1 xi − yi )2
at β1 = 0.5, β1 = 1, β1 = 1.1.

β1 = 0.5

4.5

3.5

2.5

1.5

0.5

0
0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5
lecture2 36/55
Understand the cost function

β1 = 1

4.5

3.5

2.5

1.5

0.5

0
0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5

lecture2 37/55
Understand the cost function

β1 = 1.1

4.5

3.5

2.5

1.5

0.5

0
0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5

lecture2 38/55
Understand the cost function

5 5 5

4.5 4.5 4.5

4 4 4

3.5 3.5 3.5

3 3 3

2.5 2.5 2.5

2 2 2

1.5 1.5 1.5

1 1 1

0.5 0.5 0.5

0 0 0
0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5

3.5
. β1 = 0.5 L(β1 ) = 3.25
3
. β1 = 1 L(β1 ) = 0.125
2.5

2
. β1 = 1.1 L(β1 ) = 0.13
1.5

0.5

0
0.5 0.6 0.7 0.8 0.9 1 1.1 1.2 1.3 1.4 1.5

lecture2 39/55
Steepest descent method for univariate linear regression

• Linear regression with one variable aims to find β0 , β1

n
1X
minimize L(β0 , β1 ) = (β1 xi + β0 − yi )2
β0 ,β1 2 i=1
• In steepest descent method, we repeat

lecture2 40/55
Steepest descent method for univariate linear regression

• Linear regression with one variable aims to find β0 , β1

n
1X
minimize L(β0 , β1 ) = (β1 xi + β0 − yi )2
β0 ,β1 2 i=1
• In steepest descent method, we repeat
∂
 
 (k+1)   (k)  (k) (k)
β0 β0 L(β0 , β1 )
 − αk  ∂β0

 = 
(k+1) (k)  ∂ (k) (k)

β1 β1 L(β0 , β1 )
∂β1
(k) (k)
= ∇L(β0 , β1 )
• Calculate
n
∂ X
L(β0 , β1 ) = (β1 xi + β0 − yi )
∂β0 i=1
n
∂ X
L(β0 , β1 ) = (β1 xi + β0 − yi )xi
∂β1 i=1
lecture2 40/55
Steepest descent method for univariate linear regression

Algorithm (Steepest descent method for univariate linear regression)

(0) (0)
Choose β0 , β1 and > 0; Set k ← 0
(k) (k)
while k∇L(β0 , β1 )k > do
determine the step length αk
n
(k+1) (k) (k) (k)
X
β0 = β0 − αk (β1 xi + β0 − yi )
i=1
n
(k+1) (k) (k) (k)
X
β1 = β1 − αk (β1 xi + β0 − yi )xi
i=1
k ←k+1
end(while)
(k) (k)
return β0 , β1

lecture2 41/55
Steepest descent method for univariate linear regression

lecture2 42/55
Linear regression with multiple
variables
Multiple features

Training set of housing prices

area #bedrooms #bathrooms stories ··· price

7420 4 2 2 ··· 13300000
8960 4 4 4 ··· 12250000
..
.

• Predictor/feature/“input” vector x
• Response/target/“output” variable y
• i-th training example (xi , yi )
• j-th feature in i-th training example xij
• n: number of samples/training example
• p: number of features/variables

lecture2 43/55
One feature vs. multiple features

One feature Multiple features

Fit linear function Fit linear function

f (x) = β1 x + β0 , β1 ∈ R, β0 ∈ R f (x) = β T x + β0 , β ∈ Rp , β0 ∈ R

Cost function Cost function

n
1X L(β0 , β) = L(β0 , β1 , β2 , . . . , βp )
L(β0 , β1 ) = (β1 xi + β0 − yi )2
2 i=1 n
1X T
= (β xi + β0 − yi )2
2 i=1

Optimization Optimization

minimize L(β0 , β1 ) minimize L(β0 , β1 , . . . , βp )

β0 ,β1 β0 ,β1 ,...,βp

lecture2 44/55
Steepest descent method for multivariate linear regression

• Linear regression with multiple variables aims to find β0 , β1 , . . . , βp

n
1X T
minimize L(β0 , β1 , . . . , βp ) = (β xi + β0 − yi )2
β0 ,β1 ,...,βp 2 i=1 | {z }
k

β1 xi1 + β2 xi2 + · · · + βp xip

lecture2 45/55
Steepest descent method for multivariate linear regression

• Linear regression with multiple variables aims to find β0 , β1 , . . . , βp

n
1X T
minimize L(β0 , β1 , . . . , βp ) = (β xi + β0 − yi )2
β0 ,β1 ,...,βp 2 i=1 | {z }
k

β1 xi1 + β2 xi2 + · · · + βp xip

• Calculate
n
∂ X
L(β0 , β1 , . . . , βp ) = (β T xi + β0 − yi )
∂β0 i=1
n
∂ X
L(β0 , β1 , . . . , βp ) = (β T xi + β0 − yi )xi1
∂β1 i=1
n
∂ X
L(β0 , β1 , . . . , βp ) = (β T xi + β0 − yi )xi2
∂β2
.. i=1
. n
∂ X
L(β0 , β1 , . . . , βp ) = (β T xi + β0 − yi )xip
∂βp i=1 lecture2 45/55
One feature vs. multiple features

One feature Multiple features

Steepest descent Steepest descent

(k+1) (k+1) (k)
β0 = β0 = β0 −
n n
(k) (k) (k) (k)
X X
β0 − αk (β1 xi + β0 − yi ) αk ((β (k) )T xi + β0 − yi )
i=1 i=1

(k+1) (k+1) (k)

β1 = βj = βj −
n n
(k) (k) (k) (k)
X X
β1 − αk (β1 xi + β0 − yi )xi αk ((β (k) )T xi + β0 − yi )xij
i=1 i=1

for j = 1, 2, . . . , p

lecture2 46/55
Steepest descent method for multivariate linear regression

Algorithm (Steepest descent method for multivariate linear regression)

(0) (0) (0)
Choose β0 , β (0) = (β1 , . . . , βp )T and > 0; Set k ← 0
(k)
while k∇L(β0 , β (k) )k > do
determine the step length αk
n
(k+1) (k) (k)
X
β0 = β0 − αk ((β (k) )T xi + β0 − yi )
i=1
for j = 1, 2, . . . , p
n
(k+1) (k) (k)
X
βj = βj − αk ((β (k) )T xi + β0 − yi )xij
i=1

end(for)
k ←k+1
end(while)
(k) (k) (k)
return β0 , β (k) = (β1 , . . . , βp )T lecture2 47/55
Standardization: feature scaling

area #bedrooms #bathrooms stories ··· price

7420 4 2 2 ··· 13300000
8960 4 4 4 ··· 12250000
..
.

• Feature matrix: an n × p matrix X, each row is a sample, each

column is a feature
• Response vector: Y = (y1 , y2 , . . . , yp )T
• Standardization of each column of X (feature scaling) — transform
all features to the same scale
X·j − mean(X·j )
X·j =
standard deviation(X·j )
• You may also scale the response vector
Y − mean(Y )
Y =
standard deviation(Y )
lecture2 48/55
Step length

In practice, you may use backtracking line search or simply a constant

step length.

lecture2 49/55
Normal equation

Normal equation is to solve the linear regression problem analytically

n
1X T
minimize L(β0 , β1 , . . . , βp ) = (β xi + β0 − yi )2
β0 ,β1 ,...,βp 2 i=1

• The cost function L is convex

• β̂ is a global minimizer of L if and only if ∇L(β̂) = 0
• ∇L(β̂) = 0 can be written equivalently as
bT X
X b T Y normal equation
b β̂ = X

Notation. Recall feature matrix X ∈ Rn×p , response vector Y ∈ Rp

       
xT1 y1 1 xT1 β0
 xT2   y2  1 xT2   β1 
       
X=  ..  Y =  ..  . Define X = 
   b  ..  β̂ =  .. 
 
 .  . 1 .  .
xTn yn 1 xTn βp
lecture2 50/55
Normal equation

Derivation*.

  2
(β T x1 + β0 − y1 )2 b1· β̂ − y1
X
+
 
 
T 2
 
(β x2 + β0 − y2 )  b2· β̂ − y2
X 
1 + 1   1 b
L(β̂) = = = kX β̂ − Y k2
 
2 2 2
 
..  .. 
. 
 . 

+
 
 
(β T xn + β0 − yn )2 bn· β̂ − yn
X

bT X
∇L(β̂) = X bT Y
b β̂ − X

lecture2 51/55
Normal equation
bT X
How to solve X b T Y normal equation ?
b β̂ = X

bT X
Case 1. When X b is invertible, the normal equation implies that

β̂ = (X b −1 X
b T X) bT Y

is the unique solution of linear regression.

This often happens when we face an over-determined system — number
of training examples n is much larger than number of features p.
We have many training samples to fit but do not have enough degree of
freedom.

lecture2 52/55
Normal equation
bT X
How to solve X b T Y normal equation ?
b β̂ = X

Case 2. When X bT Xb is not invertible, the normal equation will have

infinite number of solutions.
bT X
X b is not invertible when we face a under-determined problem —
n < p.
We have too many degree of freedom and do not have enough training
samples.
We can apply any method for solving a linear system (e.g., Gaussian
elimination) to obtain a solution.

lecture2 53/55
Normal equation

Example. Give the normal equation for linear regression on the data set
feature 1 feature 2 response
1 0.2 1
0.3 4 2
5 0.6 3

lecture2 54/55
Normal equation

Example. Give the normal equation for linear regression on the data set
feature 1 feature 2 response
1 0.2 1
0.3 4 2
5 0.6 3
   
1 1 0.2 1
Solution. Data Xb = 1 0.3 4 , Y = 2. Compute
  
 
1 5 0.6 3
   
3 6.3 4.8 6
XbT X
b =  bT
6.3 26.09 4.4  , X Y = 16.6. Then the normal
 
4.8 4.4 16.4 10
equation is
        
3 6.3 4.8 β0 6 β0 0.4651
6.3 26.09 4.4  β1  = 16.6 ⇒ β1  = 0.4651
        
4.8 4.4 16.4 β2 10 β2 0.3488
lecture2 54/55
Steepest descent vs. normal equation

Steepest descent Normal equation

iterative method analytical solution

need to choose step length no need to choose step length

works well with large number of solving the linear system of normal
features p equation is slow when p is large

In practice,
p ≤ 5000, normal equation
p > 5000, steepest descent

lecture2 55/55

Lecture10 PDF
No ratings yet
Lecture10 PDF
4 pages
Optimization PPT - Part-2
No ratings yet
Optimization PPT - Part-2
42 pages
Lecture8 UnconstrainedII 2023
No ratings yet
Lecture8 UnconstrainedII 2023
57 pages
Clnote Oct8
No ratings yet
Clnote Oct8
39 pages
(k+1) K (K) (K) (K) : Recall That A Direction Is A Vector of Unit Length
No ratings yet
(k+1) K (K) (K) (K) : Recall That A Direction Is A Vector of Unit Length
5 pages
Download
No ratings yet
Download
7 pages
Steepest Descent
No ratings yet
Steepest Descent
7 pages
5 1 SD 17122020
No ratings yet
5 1 SD 17122020
47 pages
Cauchy Gradient Based Technique Lecture 5
No ratings yet
Cauchy Gradient Based Technique Lecture 5
21 pages
(K) K (k+1) (K) K (K)
No ratings yet
(K) K (k+1) (K) K (K)
6 pages
Tut 5s
No ratings yet
Tut 5s
5 pages
6 Gradient Method
No ratings yet
6 Gradient Method
19 pages
4 Pattern Directions, 21-08-2024
No ratings yet
4 Pattern Directions, 21-08-2024
58 pages
Topic3 PDF
No ratings yet
Topic3 PDF
50 pages
Chapter 8 Lecture Notes
No ratings yet
Chapter 8 Lecture Notes
4 pages
Optimization of Chemical Processes (Che1011)
No ratings yet
Optimization of Chemical Processes (Che1011)
9 pages
Hauser Lecture2
No ratings yet
Hauser Lecture2
26 pages
Ame-2 One-Dimensional-Search
No ratings yet
Ame-2 One-Dimensional-Search
11 pages
BSC Part 3
No ratings yet
BSC Part 3
29 pages
O4MD 03 Descent Methods
No ratings yet
O4MD 03 Descent Methods
18 pages
Algorithms Process Optimization
No ratings yet
Algorithms Process Optimization
5 pages
Lec4 Gradient Method Revise
No ratings yet
Lec4 Gradient Method Revise
33 pages
Lec 02
No ratings yet
Lec 02
43 pages
Cs3491 - Aiml - Unit III - Gradient Descent
No ratings yet
Cs3491 - Aiml - Unit III - Gradient Descent
12 pages
Lecture 05 - Unconstrained
No ratings yet
Lecture 05 - Unconstrained
21 pages
Chương 9
No ratings yet
Chương 9
12 pages
Optimization Class Notes MTH-9842
No ratings yet
Optimization Class Notes MTH-9842
25 pages
Maximum Slope Method
No ratings yet
Maximum Slope Method
14 pages
L10 - Subgrad - PGD (Partially Annotated)
No ratings yet
L10 - Subgrad - PGD (Partially Annotated)
39 pages
Lecture 7 8 Other Descent Methods
No ratings yet
Lecture 7 8 Other Descent Methods
7 pages
Hawassa University (Hu), Institute of Technology (Iot) Chemical Engineering Department
No ratings yet
Hawassa University (Hu), Institute of Technology (Iot) Chemical Engineering Department
30 pages
Lecture 5
No ratings yet
Lecture 5
6 pages
School of Computer Science and Applied Mathematics
No ratings yet
School of Computer Science and Applied Mathematics
5 pages
33-Cauchy Method and Fletcher-Reeves Method-13-04-2024
No ratings yet
33-Cauchy Method and Fletcher-Reeves Method-13-04-2024
37 pages
Method of Steepest Descent and Its Applications: Department of Engineering, University of Tennessee, Knoxville, TN 37996
No ratings yet
Method of Steepest Descent and Its Applications: Department of Engineering, University of Tennessee, Knoxville, TN 37996
3 pages
Gradient Descent PDF
No ratings yet
Gradient Descent PDF
9 pages
US - TMC - 05 - Optimization 2022
No ratings yet
US - TMC - 05 - Optimization 2022
43 pages
Lecture 5
No ratings yet
Lecture 5
31 pages
06 23ECE216 GradientDescent v2
No ratings yet
06 23ECE216 GradientDescent v2
73 pages
Mathematical Methods of Optimization
No ratings yet
Mathematical Methods of Optimization
62 pages
CS-6777 Liu Abs
No ratings yet
CS-6777 Liu Abs
103 pages
Gradient Of A Function هّلادلا رادحنإ
No ratings yet
Gradient Of A Function هّلادلا رادحنإ
11 pages
Unconstrained Optimization Methods: Amirkabir University of Technology Dr. Madadi
No ratings yet
Unconstrained Optimization Methods: Amirkabir University of Technology Dr. Madadi
13 pages
Line Search Methods
No ratings yet
Line Search Methods
7 pages
Unconstrained Minimization
No ratings yet
Unconstrained Minimization
7 pages
Lecture Notes: Some Notes On Gradient Descent: Marc Toussaint
No ratings yet
Lecture Notes: Some Notes On Gradient Descent: Marc Toussaint
4 pages
Doan BFGS
No ratings yet
Doan BFGS
72 pages
FALLSEM2023-24 EEE1020 ETH VL2023240103124 2023-08-19 Reference-Material-I
No ratings yet
FALLSEM2023-24 EEE1020 ETH VL2023240103124 2023-08-19 Reference-Material-I
9 pages
Structural and Multidisciplinary Optimization
No ratings yet
Structural and Multidisciplinary Optimization
33 pages
Optimumengineeringdesign Day3a
No ratings yet
Optimumengineeringdesign Day3a
34 pages
Algorithm For Unconstrained-Multivariable Case-2 (CH 6)
No ratings yet
Algorithm For Unconstrained-Multivariable Case-2 (CH 6)
31 pages
Numerical Methods for Unconstrained Optimum Design
No ratings yet
Numerical Methods for Unconstrained Optimum Design
50 pages
HW4 Solutions Autotag
No ratings yet
HW4 Solutions Autotag
7 pages
Week02 Convex Optimization
No ratings yet
Week02 Convex Optimization
48 pages
Optimumengineeringdesign Day5
No ratings yet
Optimumengineeringdesign Day5
84 pages
Lecture 9 Si416
No ratings yet
Lecture 9 Si416
14 pages
19 Newton Method
No ratings yet
19 Newton Method
10 pages
Process Optimization
No ratings yet
Process Optimization
70 pages
Optim
No ratings yet
Optim
70 pages