0% found this document useful (0 votes)
20 views

Lecture2 Gradient Descent Linear Regression

The document discusses gradient descent methods and linear regression. It introduces gradient descent as an iterative method to find the minimum of a differentiable function. It discusses the concepts of descent directions and steepest descent direction. It then applies gradient descent to solve linear regression problems with examples.

Uploaded by

jiayuan0113
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
20 views

Lecture2 Gradient Descent Linear Regression

The document discusses gradient descent methods and linear regression. It introduces gradient descent as an iterative method to find the minimum of a differentiable function. It discusses the concepts of descent directions and steepest descent direction. It then applies gradient descent to solve linear regression problems with examples.

Uploaded by

jiayuan0113
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 75

Gradient (descent) methods and linear

regression
DSA5103 Lecture 2

Yangjing Zhang
22-Aug-2023
NUS
Today’s content

1. Gradient (descent) methods


2. Linear regression with one variable
3. Linear regression with multiple variables
4. Gradient descent for linear regression
5. Normal equation for linear regression

lecture2 1/55
Gradient methods
Unconstrained problem

To minimize a differentiable function f

min f (x)
x∈Rn

Recall that a global minimizer is a


local minimizer, and a local
minimizer is a stationary point

• We may try to find stationary points x, i.e., ∇f (x) = 0 for solving


an unconstrained problem
• When it is difficult to solve ∇f (x) = 0, we look for an approximate
solution via iterative methods
lecture2 2/55
Algorithmic framework

A general algorithmic framework: choose x(0) and repeat

x(k+1) = x(k) + αk p(k) , k = 0, 1, 2, . . .

until some stopping criteria is satisfied

• x(0) initial guess of the solution


• αk > 0 is called the step length/step size/learning rate
• p(k) is a search direction

lecture2 3/55
Algorithmic framework

A general algorithmic framework: choose x(0) and repeat

x(k+1) = x(k) + αk p(k) , k = 0, 1, 2, . . .

until some stopping criteria is satisfied

• x(0) initial guess of the solution


• αk > 0 is called the step length/step size/learning rate
• p(k) is a search direction
(we hope the search direction can “improve” the iterative point in
some sense)

lecture2 3/55
Descent direction

The search direction p(k) should be a descent direction at x(k)

• We say p(k) is a descent direction at x(k) if

∇f (x(k) )T p(k) < 0

lecture2 4/55
Descent direction

The search direction p(k) should be a descent direction at x(k)

• We say p(k) is a descent direction at x(k) if

∇f (x(k) )T p(k) < 0

• The function value f can be reduced along this descent direction


with “appropriate” step length

∃ δ > 0 such that f (x(k) + αk p(k) ) < f (x(k) ) ∀ αk ∈ (0, δ)

lecture2 4/55
Example

Example 1:
min f (x) = 12 x2
x∈R

At x(k) = −1, p(k) = 1 is a descent direction since


∇f (x(k) ) = x(k) = −1, ∇f (x(k) )T p(k) = (−1) × 1 = −1 < 0

We observe that for αk ∈ (0, 2)


f (x(k) + αk p(k) ) < f (x(k) ) lecture2 5/55
Example

Example 2: min f (x) = x21 + 2x22 − 2x1 x2 − 2x2 .


x=(x1 ;x2 )∈R2
" #
0
At x(0) = , show that p(0) = −∇f (x(0) ) is a descent direction.
0

lecture2 6/55
Example

Example 2: min f (x) = x21 + 2x22 − 2x1 x2 − 2x2 .


x=(x1 ;x2 )∈R2
" #
0
At x(0) = , show that p(0) = −∇f (x(0) ) is a descent direction.
0
" #
2x1 − 2x2
Solution. Compute the gradient ∇f (x) = .
4x2 − 2x1 − 2
" # " #
0 0
Then ∇f (x(0) ) = ∇f (0; 0) = and p(0) = −∇f (x(0) ) = .
−2 2
" #
0
Since ∇f (x(0) )T p(0) = [0 − 2] = −4 < 0, p(0) is a descent direction
2
at x(0) .

lecture2 6/55
Example

Example 2: min f (x) = x21 + 2x22 − 2x1 x2 − 2x2 .


x=(x1 ;x2 )∈R2
" # " # " #
(0) 0 (0) 1 (0) −2
At x = , show that p = and p = are descent
0 1 0.1
directions.

lecture2 7/55
Example

Example 2: min f (x) = x21 + 2x22 − 2x1 x2 − 2x2 .


x=(x1 ;x2 )∈R2
" # " # " #
(0) 0 (0) 1 (0) −2
At x = , show that p = and p = are descent
0 1 0.1
directions.
" #
(0) 0
Solution. ∇f (x ) =
−2
" #
1
∇f (x(0) )T p(0) = [0 − 2] = −2 < 0
1
" #
−2
∇f (x(0) )T p(0) = [0 − 2] = −0.2 < 0
0.1

lecture2 7/55
Example

Example 2: min f (x) = x21 + 2x22 − 2x1 x2 − 2x2 .


x=(x1 ;x2 )∈R2
" # " #
0 −1
At x(0) = , show that p(0) = is not a descent direction.
0 −1

lecture2 8/55
Example

Example 2: min f (x) = x21 + 2x22 − 2x1 x2 − 2x2 .


x=(x1 ;x2 )∈R2
" # " #
0 −1
At x(0) = , show that p(0) = is not a descent direction.
0 −1
" #
(0) 0
Solution. ∇f (x ) =
−2
" #
(0) T (0) −1
Since ∇f (x ) p = [0 − 2] = 2 > 0, p(0) is not a descent
−1
direction at x(0) .
[Exercise] Construct another descent direction.

lecture2 8/55
Example

Example 2: at x(0) = [0; 0]


Descent directions:
[0; 2], [1; 1], [−2; 0.1]
Not descent directions:
[−1; −1]

lecture2 9/55
Example

Example 2: at x(0) = [0; 0]


Descent directions:
[0; 2], [1; 1], [−2; 0.1]
Not descent directions:
[−1; −1]
• There can be infinitely
many descent directions
• The “angle” between a
descent direction and
−∇f (·) is less than 90o
• Among all directions, the value of f decreases most rapidly along
the direction −∇f (·)
• The direction −∇f (·) is known as the steepest descent direction

lecture2 9/55
Steepest descent method

Algorithm (Steepest descent method)


Choose x(0) and  > 0; Set k ← 0
while k∇f (x(k) )k >  do
find the step length αk (e.g., by certain line search rule)
x(k+1) = x(k) − αk ∇f (x(k) )
k ←k+1
end(while)
return x(k)

One may choose to use a constant step length (say αk = 0.1), or find it
via line search rules:
• Exact line search
• Backtracking line search
lecture2 10/55
Constant step length

Example 3: min f (x) = 12 x2


x∈R

Apply steepest descent method with x(0) = −1,  = 10−4 , and constant
step length αk = 0.1

lecture2 11/55
Constant step length

Example 3: min f (x) = 12 x2


x∈R

Apply steepest descent method with x(0) = −1,  = 10−4 , and constant
step length αk = 0.1
1.2

0.8

0.6

0.4

0.2

0
-1.5 -1 -0.5 0 0.5 1 1.5

• Return x = −9.40 × 10 at iteration k = 89


(k) −5

• When the step length is too small, the method can be slow
• As it approaches the minimizer, the method will automatically take
smaller steps lecture2 11/55
Constant step length

Example 3: constant step length αk = 0.5

1.2

0.8

0.6

0.4

0.2

0
-1.5 -1 -0.5 0 0.5 1 1.5

• Return x(k) = −6.10 × 10−5 at iteration k = 15


• In this particular example, αk = 0.5 (converge in 15 steps) is better
than αk = 0.1 (converge in 89 steps)
lecture2 12/55
Constant step length

Example 3: constant step length αk = 1.8

1.2

0.8

0.6

0.4

0.2

0
-1.5 -1 -0.5 0 0.5 1 1.5

• Return x(k) = −8.51 × 10−5 at iteration k = 43


• Due to the big step length, the iterates are oscillating around the
solution but still converge
lecture2 13/55
Constant step length

Example 3: constant step length αk = 2.1

1.2

0.8

0.6

0.4

0.2

0
-1.5 -1 -0.5 0 0.5 1 1.5

• The step length is too large, and the method diverges

lecture2 14/55
Constant step length

lecture2 15/55
Exact line search

Exact line search tries to find αk by solving the one-dimensional problem

min φ(α) := f (x(k) + αp(k) )


α>0

• In general, exact line search is the most difficult part of the steepest
descent method
• If f is a simple function, it may be possible to obtain an analytical
solution for αk by solving φ0 (α) = 0

lecture2 16/55
Example

Example 4: min f (x) = 12 x2


x∈R

Apply steepest descent method with exact line search, given x(0) = −1.

lecture2 17/55
Example

Example 4: min f (x) = 12 x2


x∈R

Apply steepest descent method with exact line search, given x(0) = −1.
Solution. k = 0, p(0) = −∇f (x(0) ) = −∇f (−1) = 1. Find α0 by
solving
1
min φ(α) = f (x(0) + αp(0) ) = (α − 1)2 .
α>0 2
Obviously, α0 = 1, and x(1) = x(0) + α0 p(0) = 0 is actually the global
minimizer.

lecture2 17/55
Example

Example 5: min f (x) = x21 + x22 + 2x1 + 4 (A convex problem)


x=(x1 ;x2 )∈R2

Apply steepest descent method with exact line search, given x(0) = [2; 1].

lecture2 18/55
Example

Example 5: min f (x) = x21 + x22 + 2x1 + 4 (A convex problem)


x=(x1 ;x2 )∈R2

Apply steepest descent method with exact line search, given x(0) = [2; 1].
" # " #
2x1 + 2 6
Solution: Calculate the gradient ∇f (x) = , ∇f (x(0) ) = .
2x2 2
Then p(0) = −∇f (x(0) ) = [−6; −2], and find α0 by solving

min φ(α) = f (x(0) + αp(0) ) = f (2 − 6α; 1 − 2α)


α>0
= (2 − 6α)2 + (1 − 2α)2 + 2(2 − 6α) + 4

φ is quadratic (convex). By setting φ0 (α) = 0, we obtain α0 = 0.5. And


x(1) = x(0) + α0 p(0) = [−1; 0]. The steepest descent method will
terminate since ∇f (x(1) ) = [0; 0].
[Exercise] Verify that [−1; 0] is a global minimizer (by sufficient condition
and convexity).
lecture2 18/55
Example

Example 6: min f (x) = x21 + 2x22 − 2x1 x2 − 2x2 (convex)


x=(x1 ;x2 )∈R2

Apply steepest descent method with exact line search, given x(0) = [0; 0].
Compute x(1) , x(2) , and x(3) .

lecture2 19/55
Example

Example 6: min f (x) = x21 + 2x22 − 2x1 x2 − 2x2 (convex)


x=(x1 ;x2 )∈R2

Apply steepest descent method with exact line search, given x(0) = [0; 0].
Compute x(1) , x(2) , and x(3) .
" #
2x1 − 2x2
Solution. Compute the gradient ∇f (x) = .
4x2 − 2x1 − 2

k = 0: p(0) = −∇f (x(0) ) = [0; 2]


min φ(α) = f (x(0) + αp(0) ) = f (0; 2α) = 8α2 − 4α, φ is convex.
α>0

We let 0 = φ0 (α) = 16α − 4, and obtain that α0 = 14 .


x(1) = x(0) + α0 p(0) = [0; 12 ]

lecture2 19/55
Example

Example 6: min f (x) = x21 + 2x22 − 2x1 x2 − 2x2 (convex)


x=(x1 ;x2 )∈R2

Apply steepest descent method with exact line search, x(0) = [0; 0].
Compute x(1) , x(2) , and x(3) .
" #
2x1 − 2x2
Solution. ∇f (x) = , x(1) = [0; 12 ].
4x2 − 2x1 − 2

k = 1: p(1) = −∇f (x(1) ) = [1; 0]


min φ(α) = f (x(1) + αp(1) ) = f (α; 12 ) = α2 − α − 12 , φ is convex.
α>0

We let 0 = φ0 (α) = 2α − 1, and obtain that α1 = 12 .


x(2) = x(1) + α1 p(1) = [ 12 ; 21 ]
k = 2: x(3) = [ 12 ; 43 ] [Exercise]
In fact, the global minimizer (x∗ = [1; 1]) can be found by solving
∇f (x) = 0.
lecture2 20/55
Contour plot

A contour is a fixed height f (x1 , x2 ) = c.


Left: plot z = f (x1 , x2 ) = x21 + x22
Right: contour plot f (x1 , x2 ) = 1, f (x1 , x2 ) = 8, f (x1 , x2 ) = 18

-1

-2

-3

-4

-5
-5 0 5

lecture2 21/55
Contour plot

Contour plot of f (x) = x21 + 2x22 − 2x1 x2 − 2x2 in Example 6


2.5
1

2 1
0

0
1.5 -0.5 -0.75

1
75

.5
8
-0.

-0
1 0
75
5

-0.8
-0.7

1
-0.5

5
0.5 -0.7 0
-0.5

0 0 0 1

1
-0.5
-0.5 0 0.5 1 1.5 2 2.5

lecture2 22/55
Contour plot

Contour plot of f (x) = x21 + 2x22 − 2x1 x2 − 2x2 in Example 6


2.5
1

x(0) = [0; 0]
2 1
0
x(1) = [0; 12 ]

0
1.5 -0.5 -0.75
x(2) = [ 12 ; 12 ]
1
75
x(3) = [ 12 ; 34 ]

.5
8
-0.

-0
0
1 ...
75
5

-0.8
-0.7

1
-0.5

5
0.5 -0.7 0
-0.5

0 0 0 1

1
-0.5
-0.5 0 0.5 1 1.5 2 2.5

Steepest descent method with exact line search follows a zig-zag path
towards the solution. This zigzagging makes the method inherently slow.
lecture2 22/55
Steepest descent method with exact line search

Algorithm (Steepest descent method with exact line search)


Choose x(0) and  > 0; Set k ← 0
while k∇f (x(k) )k >  do
p(k) = −∇f (x(k) )
αk = arg min φ(α) = f (x(k) + αp(k) )
α>0

x(k+1) = x(k) + αk p(k)


k ←k+1
end(while)
return x(k)

lecture2 23/55
Properties of steepest descent method with exact line search*

Let {x(k) } be the sequence generated by steepest descent method with


exact line search

• The steepest descent method with exact line search moves in


perpendicular steps. More precisely, x(k) − x(k+1) is orthogonal
(perpendicular) to x(k+1) − x(k+2) .
• Monotonic decreasing property:

f (x(k+1) ) < f (x(k) ) if ∇f (x(k) ) 6= 0.


• Suppose f is a coercive1 function with continuous first order
derivatives on Rn . Then some subsequence of {x(k) } converges.
The limit of any convergent subsequence of {x(k) } is a stationary
point of f .
1A continuous function f : Rn → R is said to be coercive if limkxk→∞ f (x) = +∞
lecture2 24/55
Backtracking line search

Backtracking line search starts with a relatively large step length and
iteratively shrinks it (i.e., “backtracking”) until the Armijo condition
holds.

Algorithm (Backtracking line search)


Choose ᾱ > 0, ρ ∈ (0, 1), c1 ∈ (0, 1); Set α ← ᾱ
repeat until f (x(k) + αp(k) ) ≤ f (x(k) ) + c1 α∇f (x(k) )T p(k)
| {z }
Armijo condition
α ← ρα
end(repeat)
return αk = α

lecture2 25/55
Backtracking line search

• Note that p(k) is a descent direction

∇f (x(k) )T p(k) < 0

The Armijo condition

f (x(k) + αp(k) ) ≤ f (x(k) ) + c1 α∇f (x(k) )T p(k)

requires a reasonable amount of decrease in the objective function,


rather than find the best step length (as in exact line search).
• For example, one can set

ᾱ = 1, ρ = 0.9, c1 = 10−4

in practice. Namely, we start with step length 1 and continue with


0.9, 0.92 , 0.93 , . . . until the Armijo condition is satisfied.

lecture2 26/55
Example

Example 7: min f (x) = x21 + x22 + 2x1 + 4


x=(x1 ;x2 )∈R2

Apply steepest descent method with backtracking line search, given


x(0) = [2; 1], ᾱ = 1, ρ = 0.9, c1 = 10−4 . Compute x(1) .

lecture2 27/55
Example

Example 7: min f (x) = x21 + x22 + 2x1 + 4


x=(x1 ;x2 )∈R2

Apply steepest descent method with backtracking line search, given


x(0) = [2; 1], ᾱ = 1, ρ = 0.9, c1 = 10−4 . Compute x(1) .
" # " #
2x1 + 2 (0) 6
Solution: Calculate the gradient ∇f (x) = , ∇f (x ) = .
2x2 2
k = 0. p(0) = −∇f (x(0) ) = [−6; −2]. Do backtracking line search:

1. α = ᾱ = 1. Check Armijo condition


f (x(0) + αp(0) ) ≤ f (x(0) ) + c1 α∇f (x(0) )T p(0)
LHS = f (−4; −1) = (−4)2 + (−1)2 + 2(−4) + "4= # 13
−6
RHS = (22 + 12 + 2 · 2 + 4) + 10−4 · 1 · [6, 2] = 12.996
−2
Armijo condition fails.

lecture2 27/55
Example

Example 7: min f (x) = x21 + x22 + 2x1 + 4


x=(x1 ;x2 )∈R2

Apply steepest descent method with backtracking line search, given


x(0) = [2; 1], ᾱ = 1, ρ = 0.9, c1 = 10−4 . Compute x(1) .
Solution:
k = 0. p(0) = −∇f (x(0) ) = [−6; −2]. Do backtracking line search:

2. α = ρᾱ = 0.9. Check Armijo condition


f (x(0) + αp(0) ) ≤ f (x(0) ) + c1 α∇f (x(0) )T p(0)
LHS = f (−3.4; −0.8) = (−3.4)2 + (−0.8)2 + 2(−3.4) " # + 4 = 9.4
−6
RHS = (22 + 12 + 2 · 2 + 4) + 10−4 · 0.9 · [6, 2] = 12.9964
−2
Armijo condition holds. Set α0 = 0.9.

New iterate x(1) = [−3.4; −0.8].

lecture2 28/55
Steepest descent method with backtracking line search

Algorithm (Steepest descent method with backtracking line search)


Choose x(0) ,  > 0, ᾱ > 0, ρ ∈ (0, 1), c1 ∈ (0, 1); Set k ← 0
while k∇f (x(k) )k >  do
p(k) = −∇f (x(k) )
α ← ᾱ
repeat until f (x(k) + αp(k) ) ≤ f (x(k) ) + c1 α∇f (x(k) )T p(k)
α ← ρα
end(repeat)
αk ← α
x(k+1) = x(k) + αk p(k)
k ←k+1
end(while)
return x(k) lecture2 29/55
Linear regression with one
variable
Housing prices data

• Predict the price for house with area 3500

10 6
8

6
Price

1
1000 2000 3000 4000 5000 6000 7000 8000 9000

Area

Figure 1: Housing prices data2

2 https://www.kaggle.com/datasets/yasserh/housing-prices-dataset

lecture2 30/55
Housing prices data

Training set of housing prices

area (x) price (y)


7420 13300000
8960 12250000
.. ..
. .

• Predictor/feature/“input” vector x
• Response/target/“output” variable y
• i-th training example (xi , yi )
• n: number of samples/training examples

lecture2 31/55
Housing prices data

Housing prices data



Learn a linear function f (x) = fβ (x) = β1 x + β0

f
input a new area of house x̂ ⇒ predict its price f (x̂)

• Parameters β1 , β0
• How to choose β1 , β0 ?

lecture2 32/55
Illustration

f (x) = fβ (x) = β1 x + β0

β1 = 1, β0 = 2 β1 = 0, β0 = 3 β1 = 2, β0 = 2

lecture2 33/55
Linear regression with one variable

• Want to choose β0 , β1 such that f (xi ) is close to yi for every


training example (xi , yi )

lecture2 34/55
Linear regression with one variable

• Want to choose β0 , β1 such that f (xi ) is close to yi for every


training example (xi , yi )
• Want to find β0 , β1
n
1X
minimize (β1 xi + β0 − yi )2
β0 ,β1 2 i=1
| {z }
squared error

• Objective/cost/loss function
n
1X
L(β0 , β1 ) = (β1 xi + β0 − yi )2
2 i=1

• Squared error function is probably the most commonly used cost


function in linear regression

lecture2 34/55
Understand the cost function

• Assume β0 = 0 for simplicity


• Consider linear model f (x) = β1 x

• Optimization model
n
1X
minimize L(β1 ) = (β1 xi − yi )2
β1 2 i=1

lecture2 35/55
Understand the cost function
Toy example
Given training data (1, 1), (2, 2.5), (4, 4) and linear model f (x) = β1 x
Pn
(assume β0 = 0). Compute the values of L(β1 ) = 12 i=1 (β1 xi − yi )2
at β1 = 0.5, β1 = 1, β1 = 1.1.

β1 = 0.5

4.5

3.5

2.5

1.5

0.5

0
0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5
lecture2 36/55
Understand the cost function

β1 = 1

4.5

3.5

2.5

1.5

0.5

0
0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5

lecture2 37/55
Understand the cost function

β1 = 1.1

4.5

3.5

2.5

1.5

0.5

0
0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5

lecture2 38/55
Understand the cost function

5 5 5

4.5 4.5 4.5

4 4 4

3.5 3.5 3.5

3 3 3

2.5 2.5 2.5

2 2 2

1.5 1.5 1.5

1 1 1

0.5 0.5 0.5

0 0 0
0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5

3.5
. β1 = 0.5 L(β1 ) = 3.25
3
. β1 = 1 L(β1 ) = 0.125
2.5

2
. β1 = 1.1 L(β1 ) = 0.13
1.5

0.5

0
0.5 0.6 0.7 0.8 0.9 1 1.1 1.2 1.3 1.4 1.5

lecture2 39/55
Steepest descent method for univariate linear regression

• Linear regression with one variable aims to find β0 , β1


n
1X
minimize L(β0 , β1 ) = (β1 xi + β0 − yi )2
β0 ,β1 2 i=1
• In steepest descent method, we repeat

lecture2 40/55
Steepest descent method for univariate linear regression

• Linear regression with one variable aims to find β0 , β1


n
1X
minimize L(β0 , β1 ) = (β1 xi + β0 − yi )2
β0 ,β1 2 i=1
• In steepest descent method, we repeat

 
 (k+1)   (k)  (k) (k)
β0 β0 L(β0 , β1 )
 − αk  ∂β0

 = 
(k+1) (k)  ∂ (k) (k)

β1 β1 L(β0 , β1 )
∂β1
(k) (k)
= ∇L(β0 , β1 )
• Calculate
n
∂ X
L(β0 , β1 ) = (β1 xi + β0 − yi )
∂β0 i=1
n
∂ X
L(β0 , β1 ) = (β1 xi + β0 − yi )xi
∂β1 i=1
lecture2 40/55
Steepest descent method for univariate linear regression

Algorithm (Steepest descent method for univariate linear regression)


(0) (0)
Choose β0 , β1 and  > 0; Set k ← 0
(k) (k)
while k∇L(β0 , β1 )k >  do
determine the step length αk
n
(k+1) (k) (k) (k)
X
β0 = β0 − αk (β1 xi + β0 − yi )
i=1
n
(k+1) (k) (k) (k)
X
β1 = β1 − αk (β1 xi + β0 − yi )xi
i=1
k ←k+1
end(while)
(k) (k)
return β0 , β1

lecture2 41/55
Steepest descent method for univariate linear regression

lecture2 42/55
Linear regression with multiple
variables
Multiple features

Training set of housing prices

area #bedrooms #bathrooms stories ··· price


7420 4 2 2 ··· 13300000
8960 4 4 4 ··· 12250000
..
.

• Predictor/feature/“input” vector x
• Response/target/“output” variable y
• i-th training example (xi , yi )
• j-th feature in i-th training example xij
• n: number of samples/training example
• p: number of features/variables

lecture2 43/55
One feature vs. multiple features

One feature Multiple features

Fit linear function Fit linear function

f (x) = β1 x + β0 , β1 ∈ R, β0 ∈ R f (x) = β T x + β0 , β ∈ Rp , β0 ∈ R

Cost function Cost function


n
1X L(β0 , β) = L(β0 , β1 , β2 , . . . , βp )
L(β0 , β1 ) = (β1 xi + β0 − yi )2
2 i=1 n
1X T
= (β xi + β0 − yi )2
2 i=1

Optimization Optimization

minimize L(β0 , β1 ) minimize L(β0 , β1 , . . . , βp )


β0 ,β1 β0 ,β1 ,...,βp

lecture2 44/55
Steepest descent method for multivariate linear regression

• Linear regression with multiple variables aims to find β0 , β1 , . . . , βp


n
1X T
minimize L(β0 , β1 , . . . , βp ) = (β xi + β0 − yi )2
β0 ,β1 ,...,βp 2 i=1 | {z }
k

β1 xi1 + β2 xi2 + · · · + βp xip

lecture2 45/55
Steepest descent method for multivariate linear regression

• Linear regression with multiple variables aims to find β0 , β1 , . . . , βp


n
1X T
minimize L(β0 , β1 , . . . , βp ) = (β xi + β0 − yi )2
β0 ,β1 ,...,βp 2 i=1 | {z }
k

β1 xi1 + β2 xi2 + · · · + βp xip


• Calculate
n
∂ X
L(β0 , β1 , . . . , βp ) = (β T xi + β0 − yi )
∂β0 i=1
n
∂ X
L(β0 , β1 , . . . , βp ) = (β T xi + β0 − yi )xi1
∂β1 i=1
n
∂ X
L(β0 , β1 , . . . , βp ) = (β T xi + β0 − yi )xi2
∂β2
.. i=1
. n
∂ X
L(β0 , β1 , . . . , βp ) = (β T xi + β0 − yi )xip
∂βp i=1 lecture2 45/55
One feature vs. multiple features

One feature Multiple features

Steepest descent Steepest descent


(k+1) (k+1) (k)
β0 = β0 = β0 −
n n
(k) (k) (k) (k)
X X
β0 − αk (β1 xi + β0 − yi ) αk ((β (k) )T xi + β0 − yi )
i=1 i=1

(k+1) (k+1) (k)


β1 = βj = βj −
n n
(k) (k) (k) (k)
X X
β1 − αk (β1 xi + β0 − yi )xi αk ((β (k) )T xi + β0 − yi )xij
i=1 i=1

for j = 1, 2, . . . , p

lecture2 46/55
Steepest descent method for multivariate linear regression

Algorithm (Steepest descent method for multivariate linear regression)


(0) (0) (0)
Choose β0 , β (0) = (β1 , . . . , βp )T and  > 0; Set k ← 0
(k)
while k∇L(β0 , β (k) )k >  do
determine the step length αk
n
(k+1) (k) (k)
X
β0 = β0 − αk ((β (k) )T xi + β0 − yi )
i=1
for j = 1, 2, . . . , p
n
(k+1) (k) (k)
X
βj = βj − αk ((β (k) )T xi + β0 − yi )xij
i=1

end(for)
k ←k+1
end(while)
(k) (k) (k)
return β0 , β (k) = (β1 , . . . , βp )T lecture2 47/55
Standardization: feature scaling

area #bedrooms #bathrooms stories ··· price


7420 4 2 2 ··· 13300000
8960 4 4 4 ··· 12250000
..
.

• Feature matrix: an n × p matrix X, each row is a sample, each


column is a feature
• Response vector: Y = (y1 , y2 , . . . , yp )T
• Standardization of each column of X (feature scaling) — transform
all features to the same scale
X·j − mean(X·j )
X·j =
standard deviation(X·j )
• You may also scale the response vector
Y − mean(Y )
Y =
standard deviation(Y )
lecture2 48/55
Step length

In practice, you may use backtracking line search or simply a constant


step length.

lecture2 49/55
Normal equation

Normal equation is to solve the linear regression problem analytically


n
1X T
minimize L(β0 , β1 , . . . , βp ) = (β xi + β0 − yi )2
β0 ,β1 ,...,βp 2 i=1

• The cost function L is convex


• β̂ is a global minimizer of L if and only if ∇L(β̂) = 0
• ∇L(β̂) = 0 can be written equivalently as
bT X
X b T Y normal equation
b β̂ = X

Notation. Recall feature matrix X ∈ Rn×p , response vector Y ∈ Rp


       
xT1 y1 1 xT1 β0
 xT2   y2  1 xT2   β1 
       
X=  ..  Y =  ..  . Define X = 
   b  ..  β̂ =  .. 
 
 .  . 1 .  .
xTn yn 1 xTn βp
lecture2 50/55
Normal equation

Derivation*.

  2
(β T x1 + β0 − y1 )2 b1· β̂ − y1
X
+
 
 
T 2
 
(β x2 + β0 − y2 )  b2· β̂ − y2
X 
1 + 1   1 b
L(β̂) = = = kX β̂ − Y k2
 
2 2 2
 
..  .. 
. 
 . 

+
 
 
(β T xn + β0 − yn )2 bn· β̂ − yn
X

bT X
∇L(β̂) = X bT Y
b β̂ − X

lecture2 51/55
Normal equation
bT X
How to solve X b T Y normal equation ?
b β̂ = X

bT X
Case 1. When X b is invertible, the normal equation implies that

β̂ = (X b −1 X
b T X) bT Y

is the unique solution of linear regression.


This often happens when we face an over-determined system — number
of training examples n is much larger than number of features p.
We have many training samples to fit but do not have enough degree of
freedom.

lecture2 52/55
Normal equation
bT X
How to solve X b T Y normal equation ?
b β̂ = X

Case 2. When X bT Xb is not invertible, the normal equation will have


infinite number of solutions.
bT X
X b is not invertible when we face a under-determined problem —
n < p.
We have too many degree of freedom and do not have enough training
samples.
We can apply any method for solving a linear system (e.g., Gaussian
elimination) to obtain a solution.

lecture2 53/55
Normal equation

Example. Give the normal equation for linear regression on the data set
feature 1 feature 2 response
1 0.2 1
0.3 4 2
5 0.6 3

lecture2 54/55
Normal equation

Example. Give the normal equation for linear regression on the data set
feature 1 feature 2 response
1 0.2 1
0.3 4 2
5 0.6 3
   
1 1 0.2 1
Solution. Data Xb = 1 0.3 4 , Y = 2. Compute
  
 
1 5 0.6 3
   
3 6.3 4.8 6
XbT X
b =  bT
6.3 26.09 4.4  , X Y = 16.6. Then the normal
 
4.8 4.4 16.4 10
equation is
        
3 6.3 4.8 β0 6 β0 0.4651
6.3 26.09 4.4  β1  = 16.6 ⇒ β1  = 0.4651
        
4.8 4.4 16.4 β2 10 β2 0.3488
lecture2 54/55
Steepest descent vs. normal equation

Steepest descent Normal equation

iterative method analytical solution

need to choose step length no need to choose step length

works well with large number of solving the linear system of normal
features p equation is slow when p is large

In practice,
p ≤ 5000, normal equation
p > 5000, steepest descent

lecture2 55/55

You might also like