Lecture2 Gradient Descent Linear Regression
Lecture2 Gradient Descent Linear Regression
regression
DSA5103 Lecture 2
Yangjing Zhang
22-Aug-2023
NUS
Today’s content
lecture2 1/55
Gradient methods
Unconstrained problem
min f (x)
x∈Rn
lecture2 3/55
Algorithmic framework
lecture2 3/55
Descent direction
lecture2 4/55
Descent direction
lecture2 4/55
Example
Example 1:
min f (x) = 12 x2
x∈R
lecture2 6/55
Example
lecture2 6/55
Example
lecture2 7/55
Example
lecture2 7/55
Example
lecture2 8/55
Example
lecture2 8/55
Example
lecture2 9/55
Example
lecture2 9/55
Steepest descent method
One may choose to use a constant step length (say αk = 0.1), or find it
via line search rules:
• Exact line search
• Backtracking line search
lecture2 10/55
Constant step length
Apply steepest descent method with x(0) = −1, = 10−4 , and constant
step length αk = 0.1
lecture2 11/55
Constant step length
Apply steepest descent method with x(0) = −1, = 10−4 , and constant
step length αk = 0.1
1.2
0.8
0.6
0.4
0.2
0
-1.5 -1 -0.5 0 0.5 1 1.5
• When the step length is too small, the method can be slow
• As it approaches the minimizer, the method will automatically take
smaller steps lecture2 11/55
Constant step length
1.2
0.8
0.6
0.4
0.2
0
-1.5 -1 -0.5 0 0.5 1 1.5
1.2
0.8
0.6
0.4
0.2
0
-1.5 -1 -0.5 0 0.5 1 1.5
1.2
0.8
0.6
0.4
0.2
0
-1.5 -1 -0.5 0 0.5 1 1.5
lecture2 14/55
Constant step length
lecture2 15/55
Exact line search
• In general, exact line search is the most difficult part of the steepest
descent method
• If f is a simple function, it may be possible to obtain an analytical
solution for αk by solving φ0 (α) = 0
lecture2 16/55
Example
Apply steepest descent method with exact line search, given x(0) = −1.
lecture2 17/55
Example
Apply steepest descent method with exact line search, given x(0) = −1.
Solution. k = 0, p(0) = −∇f (x(0) ) = −∇f (−1) = 1. Find α0 by
solving
1
min φ(α) = f (x(0) + αp(0) ) = (α − 1)2 .
α>0 2
Obviously, α0 = 1, and x(1) = x(0) + α0 p(0) = 0 is actually the global
minimizer.
lecture2 17/55
Example
Apply steepest descent method with exact line search, given x(0) = [2; 1].
lecture2 18/55
Example
Apply steepest descent method with exact line search, given x(0) = [2; 1].
" # " #
2x1 + 2 6
Solution: Calculate the gradient ∇f (x) = , ∇f (x(0) ) = .
2x2 2
Then p(0) = −∇f (x(0) ) = [−6; −2], and find α0 by solving
Apply steepest descent method with exact line search, given x(0) = [0; 0].
Compute x(1) , x(2) , and x(3) .
lecture2 19/55
Example
Apply steepest descent method with exact line search, given x(0) = [0; 0].
Compute x(1) , x(2) , and x(3) .
" #
2x1 − 2x2
Solution. Compute the gradient ∇f (x) = .
4x2 − 2x1 − 2
lecture2 19/55
Example
Apply steepest descent method with exact line search, x(0) = [0; 0].
Compute x(1) , x(2) , and x(3) .
" #
2x1 − 2x2
Solution. ∇f (x) = , x(1) = [0; 12 ].
4x2 − 2x1 − 2
-1
-2
-3
-4
-5
-5 0 5
lecture2 21/55
Contour plot
2 1
0
0
1.5 -0.5 -0.75
1
75
.5
8
-0.
-0
1 0
75
5
-0.8
-0.7
1
-0.5
5
0.5 -0.7 0
-0.5
0 0 0 1
1
-0.5
-0.5 0 0.5 1 1.5 2 2.5
lecture2 22/55
Contour plot
x(0) = [0; 0]
2 1
0
x(1) = [0; 12 ]
0
1.5 -0.5 -0.75
x(2) = [ 12 ; 12 ]
1
75
x(3) = [ 12 ; 34 ]
.5
8
-0.
-0
0
1 ...
75
5
-0.8
-0.7
1
-0.5
5
0.5 -0.7 0
-0.5
0 0 0 1
1
-0.5
-0.5 0 0.5 1 1.5 2 2.5
Steepest descent method with exact line search follows a zig-zag path
towards the solution. This zigzagging makes the method inherently slow.
lecture2 22/55
Steepest descent method with exact line search
lecture2 23/55
Properties of steepest descent method with exact line search*
Backtracking line search starts with a relatively large step length and
iteratively shrinks it (i.e., “backtracking”) until the Armijo condition
holds.
lecture2 25/55
Backtracking line search
ᾱ = 1, ρ = 0.9, c1 = 10−4
lecture2 26/55
Example
lecture2 27/55
Example
lecture2 27/55
Example
lecture2 28/55
Steepest descent method with backtracking line search
10 6
8
6
Price
1
1000 2000 3000 4000 5000 6000 7000 8000 9000
Area
2 https://www.kaggle.com/datasets/yasserh/housing-prices-dataset
lecture2 30/55
Housing prices data
• Predictor/feature/“input” vector x
• Response/target/“output” variable y
• i-th training example (xi , yi )
• n: number of samples/training examples
lecture2 31/55
Housing prices data
f
input a new area of house x̂ ⇒ predict its price f (x̂)
• Parameters β1 , β0
• How to choose β1 , β0 ?
lecture2 32/55
Illustration
f (x) = fβ (x) = β1 x + β0
β1 = 1, β0 = 2 β1 = 0, β0 = 3 β1 = 2, β0 = 2
lecture2 33/55
Linear regression with one variable
lecture2 34/55
Linear regression with one variable
• Objective/cost/loss function
n
1X
L(β0 , β1 ) = (β1 xi + β0 − yi )2
2 i=1
lecture2 34/55
Understand the cost function
• Optimization model
n
1X
minimize L(β1 ) = (β1 xi − yi )2
β1 2 i=1
lecture2 35/55
Understand the cost function
Toy example
Given training data (1, 1), (2, 2.5), (4, 4) and linear model f (x) = β1 x
Pn
(assume β0 = 0). Compute the values of L(β1 ) = 12 i=1 (β1 xi − yi )2
at β1 = 0.5, β1 = 1, β1 = 1.1.
β1 = 0.5
4.5
3.5
2.5
1.5
0.5
0
0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5
lecture2 36/55
Understand the cost function
β1 = 1
4.5
3.5
2.5
1.5
0.5
0
0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5
lecture2 37/55
Understand the cost function
β1 = 1.1
4.5
3.5
2.5
1.5
0.5
0
0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5
lecture2 38/55
Understand the cost function
5 5 5
4 4 4
3 3 3
2 2 2
1 1 1
0 0 0
0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5
3.5
. β1 = 0.5 L(β1 ) = 3.25
3
. β1 = 1 L(β1 ) = 0.125
2.5
2
. β1 = 1.1 L(β1 ) = 0.13
1.5
0.5
0
0.5 0.6 0.7 0.8 0.9 1 1.1 1.2 1.3 1.4 1.5
lecture2 39/55
Steepest descent method for univariate linear regression
lecture2 40/55
Steepest descent method for univariate linear regression
lecture2 41/55
Steepest descent method for univariate linear regression
lecture2 42/55
Linear regression with multiple
variables
Multiple features
• Predictor/feature/“input” vector x
• Response/target/“output” variable y
• i-th training example (xi , yi )
• j-th feature in i-th training example xij
• n: number of samples/training example
• p: number of features/variables
lecture2 43/55
One feature vs. multiple features
f (x) = β1 x + β0 , β1 ∈ R, β0 ∈ R f (x) = β T x + β0 , β ∈ Rp , β0 ∈ R
Optimization Optimization
lecture2 44/55
Steepest descent method for multivariate linear regression
lecture2 45/55
Steepest descent method for multivariate linear regression
for j = 1, 2, . . . , p
lecture2 46/55
Steepest descent method for multivariate linear regression
end(for)
k ←k+1
end(while)
(k) (k) (k)
return β0 , β (k) = (β1 , . . . , βp )T lecture2 47/55
Standardization: feature scaling
lecture2 49/55
Normal equation
Derivation*.
2
(β T x1 + β0 − y1 )2 b1· β̂ − y1
X
+
T 2
(β x2 + β0 − y2 ) b2· β̂ − y2
X
1 + 1 1 b
L(β̂) = = = kX β̂ − Y k2
2 2 2
.. ..
.
.
+
(β T xn + β0 − yn )2 bn· β̂ − yn
X
bT X
∇L(β̂) = X bT Y
b β̂ − X
lecture2 51/55
Normal equation
bT X
How to solve X b T Y normal equation ?
b β̂ = X
bT X
Case 1. When X b is invertible, the normal equation implies that
β̂ = (X b −1 X
b T X) bT Y
lecture2 52/55
Normal equation
bT X
How to solve X b T Y normal equation ?
b β̂ = X
lecture2 53/55
Normal equation
Example. Give the normal equation for linear regression on the data set
feature 1 feature 2 response
1 0.2 1
0.3 4 2
5 0.6 3
lecture2 54/55
Normal equation
Example. Give the normal equation for linear regression on the data set
feature 1 feature 2 response
1 0.2 1
0.3 4 2
5 0.6 3
1 1 0.2 1
Solution. Data Xb = 1 0.3 4 , Y = 2. Compute
1 5 0.6 3
3 6.3 4.8 6
XbT X
b = bT
6.3 26.09 4.4 , X Y = 16.6. Then the normal
4.8 4.4 16.4 10
equation is
3 6.3 4.8 β0 6 β0 0.4651
6.3 26.09 4.4 β1 = 16.6 ⇒ β1 = 0.4651
4.8 4.4 16.4 β2 10 β2 0.3488
lecture2 54/55
Steepest descent vs. normal equation
works well with large number of solving the linear system of normal
features p equation is slow when p is large
In practice,
p ≤ 5000, normal equation
p > 5000, steepest descent
lecture2 55/55