0% found this document useful (0 votes)
7 views59 pages

ML - Lec 5 - Regression - Gradient Descent Least Square

The document discusses linear regression, focusing on the formulation of the regression model using basis functions and the application of gradient descent for optimization. It outlines the loss function, the learning algorithm, and the closed-form solution using ordinary least squares. Additionally, it emphasizes that linear regression is a convex optimization problem, ensuring that gradient descent will reach a global optimum.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views59 pages

ML - Lec 5 - Regression - Gradient Descent Least Square

The document discusses linear regression, focusing on the formulation of the regression model using basis functions and the application of gradient descent for optimization. It outlines the loss function, the learning algorithm, and the closed-form solution using ordinary least squares. Additionally, it emphasizes that linear regression is a convex optimization problem, ensuring that gradient descent will reach a global optimum.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 59

General linear regression

problem
• Using our new notations for the basis function linear
regression can be written as
n
y = ∑ w j φ j (x)
j= 0

• Where φj(x) can be either xj for multivariate regression or


one of the non-linear basis functions we defined

• … and φ0(x)=1 for the intercept term
Gradient descent
Gradient Descent for Linear Regression
n

Goal: minimize the following predict with : yˆ i = ∑ w j φ j (x i )


j
loss function:
2
# &
i 2
J X,y (w) = ∑ ( y − ŷ
i
) = ∑% y − ∑ w jφ j (x )((
% i i

i
€ i $ j '

sum over n examples sum over k+1 basis vectors


Gradient Descent for Linear Regression
n
Goal: minimize the following predict with : yˆ i = ∑ w j φ j (x i )
loss function: j

2
% (2
J X, y (w) = ∑ ( y − yˆ
i i
) = ∑'' y i − ∑ w j φ j (x i )**
i i & j )
∂ ∂ €i 2

∂w j
J(w) =
∂w j
∑( y − ˆ
y i
)
i

€ ∂
= 2∑ ( y i − yˆ i ) yˆ i
i
∂w j

= 2∑ ( y − yˆ i )
i
∑w φ j j (x i
)
i ∂w j j

= 2∑ ( y i − yˆ i ) φ j (x i )
i
Gradient Descent for Linear Regression
Learning algorithm:

• Initialize weights w=0


• For t=1,… until convergence: k
• Predict for each example xi using w: ŷi = ∑ w jφ j (x i )
j=0


• Compute gradient of loss: J(w) = 2∑ ( y i − yˆ i ) φ j (x i )
∂w j
• This is a vector g i

• Update: w = w – λg
•λ is the learning rate.

Gradient Descent for Linear
Regression
We can use any of the tricks
stochastic gradient descent (if the data is too big to put in
(memory
regularization
… –
Linear regression is a convex
optimization problem
so again gradient descent will reach a global optimum

proof: differentiate again to get the second derivative


Multivariate Least Squares

Approach 2: Matrix Inversion


OLS (Ordinary Least Squares Solution)
n
Goal: minimize the following predict with : yˆ i = ∑ w j φ j (x i )
loss function: j

2
% (2
J X, y (w) = ∑ ( y − yˆ
i i
) = ∑'' y i − ∑ w j φ j (x i )**
i i & j )
∂ €
J(w) = 2∑ ( y i − yˆ i ) φ j (x i )
∂w j i

n
Goal: minimize the following predict with : yˆ i = ∑ w j φ j (x i )
loss function: j

2
% (2
J X, y (w) = ∑ ( y − yˆ
i i
) = ∑'' y i − ∑ w j φ j (x i )**
i i & j )
∂ €
J(w) = 2∑ ( y i − yˆ i ) φ j (x i )
∂w j i
k+1 basis vectors

" %
Notation: φ
$ 0 (x 1
) φ1 (x1
)  φ k (x1 ) '
$ φ (x 2 ) φ (x 2 )  φ k (x 2 ) '
Φ=$ 0 1
' n examples
$     '
$ φ (x n ) φ (x n )  φ k (x n ) '&
# 0 1
n
Goal: minimize the following predict with : yˆ i = ∑ w j φ j (x i )
loss function: j

2
% (2
J X, y (w) = ∑ ( y − yˆ
i i
) = ∑'' y i − ∑ w j φ j (x i )**
i i & j )
∂ €
J(w) = 2∑ ( y i − yˆ i ) φ j (x i ) ! w $
∂w j i # 0 &
k+1 basis vectors
€ w = # .. &
" % ## w &&
" y1 % φ
$ 0 (x 1
) φ1 (x1
)  φ k (x1 ) ' " k %
$ ... ' $ φ (x 2 ) φ (x 2 )  φ k (x 2 ) '
y =$ ' Φ=$ 0 1
' n examples
$ ... ' $     '
$ n'
#y & $ φ (x n ) φ (x n )  φ k (x n ) '&
# 0 1
k+1 basis vectors

" % ! φ1 $
" y1 % φ
$ 0 (x 1
) φ1 (x1
)  1
φ k (x ) ' # &
$ ... ' $ φ (x 2 ) φ (x 2 )  φ k (x 2 ) ' ## ...
&
y =$ ' Φ=$ 0 1
'= & n examples
$ ... ' $     ' # ... &
$ n' # &
#y & $ φ (x n ) φ (x n )  φ k (x n ) '& # φn & ! w $
# 0 1
" % # 0 &
w = # .. &
## w &&
" k %

J(w) = 2∑ ( y i − ŷ i ) φ 0 (x i )
∂ w0 i
notation: φ ij ≡ φ j (x i )


J(w) = 2∑ ( y i − ŷ i ) φ k (x i )
∂ wk i
k+1 basis vectors

" % ! $
" y1 % φ (x 1
) φ (x1
)  1
φ k (x ) ' # φ1 &
$ 0 1
$ ... ' &
$ φ (x 2 ) φ (x 2 )  φ k (x 2 ) ' ## ...
y =$ ' Φ=$ 0 1
'= & n examples
$ ... ' $     ' # ... &
$ n' # &
#y & $ φ (x n ) φ (x n )  φ k (x n ) '& # φn & ! w $
# 0 1
" % # 0 &
w = # .. &
## w &&
" k %

J(w) = 2∑ ( y iφ 1i − ŷ iφ 1i )
∂ w0 i
n
recall ŷ i = ∑ w jφ ij
… j
i
=φ w

J(w) = 2∑ ( y iφ ki − ŷ iφ ki )
∂ wk i
k+1 basis vectors

" %
"y %1 φ
$ 0 (x 1
) φ (x1
)  φ k (x1 ) ' # 1 &
1 φ
$ ... ' $ φ (x 2 ) φ (x 2 )  φ k (x 2 ) ' % ... (
y =$ ' Φ=$ 0 1
' =% ( n examples
$ ... ' $     ' %% ... ((
$ n' n
#y & $ φ (x n ) φ (x n )  φ k (x ) &
n ' $ φ ' ! w $
# 0 1
# 0 &
w = # .. &
## w &&
€ " k %

J(w) = 2∑ ( y iφ 0i − φ i w φ 0i )
∂ w0 i


= 2ΦT y − 2ΦT Φw

J(w) = 2∑ ( y iφ ki − φ i w φ ki )
∂ wk i
n examples

" % ! $
φ (x1
) φ (x 2
)  φ (x n
) # y1 &
$ 0 0 0 '
T $ ' # ... &
Φ =$ ' # &
$     ' y =# ... &
$ φ k (x1 ) φ k (x 2 )  φ k (x n ) ' # ... &
# & # &
# yn &
" %


J(w) = 2∑ (φ 0i y i − φ 0i φ i w )
∂ w0 i


= 2ΦT y −...

J(w) = 2∑ (φ ki y i − φ kiφ i w )
∂ wk i
n examples k+1 basis

! $ ! ! w $
 φ k (x ) $&
1 n
# φ (x ) ... φ (x ) & φ (x1
)  1
0 0
# 0 # 0 &
# φ (x1 ) φ (x n
) & # φ (x 2 )  φ k (x 2 ) & # .. &
1 1
# & # 0 & ## w &&
#     & #     &
# φ (x1 ) & " k %
 φ (x n
) # φ (x n )  φ k (x n ) &%
" k k % " 0


J(w) = 2∑ (φ 0i y i − φ 0i φ i w )
∂ w0 i


T
= ... − 2Φ Φw

J(w) = 2∑ (φ ki y i − φ kiφ i w )
∂ wk i
k+1 basis vectors

" %
"y %1 φ
$ 0 (x 1
) φ (x1
)  φ k (x1 ) ' # 1 &
1 φ
$ ... ' $ φ (x 2 ) φ (x 2 )  φ k (x 2 ) ' % ... (
y =$ ' Φ=$ 0 1
' =% ( n examples
$ ... ' $     ' %% ... ((
$ n' n
#y & $ φ (x n ) φ (x n )  φ k (x ) &
n ' $ φ ' ! w $
# 0 1
# 0 &
w = # .. &
## w &&
€ " k %

J(w) = 2∑ ( y iφ 0i − φ i w φ 0i )
∂ w0 i


= 2ΦT y − 2ΦT Φw = 0

J(w) = 2∑ ( y iφ ki − φ i w φ ki ) −1
∂ wk i
w = (Φ Φ) ΦT y
T
recap: Solving linear regression
• To optimize – closed form:
• We just take the derivative w.r.t. to w and set to 0:



∂w i
(yi − wx i ) 2
= 2∑ −xi (yi − wxi ) ⇒
i

2∑ xi (yi − wxi ) = 0 ⇒ 2∑ xi yi − 2∑ wxi xi = 0


i i i

∑ x y = ∑ wx
i i
2
i ⇒
i i
covar(X,Y)/var(X)
∑x y i i
if mean(X)=mean(Y)=0
i
w=
∑x 2
i
i
k+1 basis vectors

" %
"y %1 φ
$ 0 (x 1
) φ (x1
)  φ k (x1 ) ' # 1 &
1 φ
$ ... ' $ φ (x 2 ) φ (x 2 )  φ k (x 2 ) ' % ... (
y =$ ' Φ=$ 0 1
' =% ( n examples
$ ... ' $     ' %% ... ((
$ n' n
#y & $ φ (x n ) φ (x n )  φ k (x ) &
n ' $ φ ' ! w $
# 0 1
# 0 &
w = # .. &
## w &&
€ " k %

J(w) = 2∑ ( y iφ 0i − φ i w φ 0i )
∂ w0 i 2∑ xi yi − 2∑ wxi xi = 0
i i


= 2ΦT y − 2ΦT Φw = 0

J(w) = 2∑ ( y iφ ki − φ i w φ ki ) −1
∂ wk i
w = (Φ Φ) ΦT y
T
LMS for general linear regression problem
J(w) = ∑ (y i − w T φ (x i )) 2
i
T −1 T
Deriving w we get: w = (Φ Φ) Φ y

n entries vector
k+1 entries vector

€ n by k+1 matrix

This solution is also known as ‘pseudo inverse’

Another reason to start with an objective function: you can see


when two learning methods are the same!
LMS versus gradient descent
J(w) = ∑ (y i − w T φ (x i )) 2 w = (ΦT Φ)−1 ΦT y
i

LMS solution:
+ Very simple in Matlab or something similar
€ which is expensive for a large
- Requires matrix inverse,
matrix.

Gradient descent:
+ Fast for large matrices
+ Stochastic GD is very memory efficient
+ Easily extended to other cases
- Parameters to tweak (how to decide convergence?
what is the learning rate? ….)
Regression and Overfitting
An example: polynomial basis
vectors on a small dataset

– From Bishop Ch 1
0 th Order Polynomial

n=10
1 st Order Polynomial
3 rd Order Polynomial
9 th Order Polynomial
Over-fitting

Root-Mean-Square (RMS) Error:


Polynomial Coefficients
Data Set Size:
9th Order Polynomial
Regularization
Penalize large coefficient values

2
1 # i & λ 2
J X,y (w) = ∑%% y − ∑ w jφ j (x )(( −
i
w
2 i $ j ' 2
Regularization:
+
Polynomial Coefficients
none exp(18) huge
Over Regularization:
Regularized Gradient Descent for LR
n
Goal: minimize the following predict with : yˆ i = ∑ w j φ j (x i )
loss function: j

2
1 # i & λ 2
J X,y (w) = ∑%% y − ∑ w jφ j (x )(( −
i
w
2 i $ € j ' 2
2
1 # i & λ
J X,y (w) = ∑%% y − ∑ w jφ j (x )(( − ∑ w 2j
i

2 i $ j ' 2 j


J(w) = ∑ ( y i − ŷ i ) φ j (x i ) − λ w j
∂wj i
Probabilistic Interpretation of
Least Squares
A probabilistic interpretation
Our least squares minimization solution can also be
motivated by a probabilistic in interpretation of the
regression problem: y = w T φ (x) + ε
where ε is Gaussian noise

€ The MLE for w in this model


is the same as the solution
we derived for least squares
criteria:

w = (ΦT Φ)−1 ΦT y
Bias-Variance
Example
Tom Dietterich, Oregon St
Example
Tom Dietterich, Oregon St

Same experiment, repeated:


with 50 samples of 20 points each

You might also like