ML - Lec 5 - Regression - Gradient Descent Least Square
ML - Lec 5 - Regression - Gradient Descent Least Square
problem
• Using our new notations for the basis function linear
regression can be written as
n
y = ∑ w j φ j (x)
j= 0
i
€ i $ j '
2
% (2
J X, y (w) = ∑ ( y − yˆ
i i
) = ∑'' y i − ∑ w j φ j (x i )**
i i & j )
∂ ∂ €i 2
∂w j
J(w) =
∂w j
∑( y − ˆ
y i
)
i
€ ∂
= 2∑ ( y i − yˆ i ) yˆ i
i
∂w j
∂
= 2∑ ( y − yˆ i )
i
∑w φ j j (x i
)
i ∂w j j
= 2∑ ( y i − yˆ i ) φ j (x i )
i
Gradient Descent for Linear Regression
Learning algorithm:
∂
• Compute gradient of loss: J(w) = 2∑ ( y i − yˆ i ) φ j (x i )
∂w j
• This is a vector g i
• Update: w = w – λg
•λ is the learning rate.
€
Gradient Descent for Linear
Regression
We can use any of the tricks
stochastic gradient descent (if the data is too big to put in
(memory
regularization
… –
Linear regression is a convex
optimization problem
so again gradient descent will reach a global optimum
2
% (2
J X, y (w) = ∑ ( y − yˆ
i i
) = ∑'' y i − ∑ w j φ j (x i )**
i i & j )
∂ €
J(w) = 2∑ ( y i − yˆ i ) φ j (x i )
∂w j i
€
n
Goal: minimize the following predict with : yˆ i = ∑ w j φ j (x i )
loss function: j
2
% (2
J X, y (w) = ∑ ( y − yˆ
i i
) = ∑'' y i − ∑ w j φ j (x i )**
i i & j )
∂ €
J(w) = 2∑ ( y i − yˆ i ) φ j (x i )
∂w j i
k+1 basis vectors
€
" %
Notation: φ
$ 0 (x 1
) φ1 (x1
) φ k (x1 ) '
$ φ (x 2 ) φ (x 2 ) φ k (x 2 ) '
Φ=$ 0 1
' n examples
$ '
$ φ (x n ) φ (x n ) φ k (x n ) '&
# 0 1
n
Goal: minimize the following predict with : yˆ i = ∑ w j φ j (x i )
loss function: j
2
% (2
J X, y (w) = ∑ ( y − yˆ
i i
) = ∑'' y i − ∑ w j φ j (x i )**
i i & j )
∂ €
J(w) = 2∑ ( y i − yˆ i ) φ j (x i ) ! w $
∂w j i # 0 &
k+1 basis vectors
€ w = # .. &
" % ## w &&
" y1 % φ
$ 0 (x 1
) φ1 (x1
) φ k (x1 ) ' " k %
$ ... ' $ φ (x 2 ) φ (x 2 ) φ k (x 2 ) '
y =$ ' Φ=$ 0 1
' n examples
$ ... ' $ '
$ n'
#y & $ φ (x n ) φ (x n ) φ k (x n ) '&
# 0 1
k+1 basis vectors
" % ! φ1 $
" y1 % φ
$ 0 (x 1
) φ1 (x1
) 1
φ k (x ) ' # &
$ ... ' $ φ (x 2 ) φ (x 2 ) φ k (x 2 ) ' ## ...
&
y =$ ' Φ=$ 0 1
'= & n examples
$ ... ' $ ' # ... &
$ n' # &
#y & $ φ (x n ) φ (x n ) φ k (x n ) '& # φn & ! w $
# 0 1
" % # 0 &
w = # .. &
## w &&
" k %
∂
J(w) = 2∑ ( y i − ŷ i ) φ 0 (x i )
∂ w0 i
notation: φ ij ≡ φ j (x i )
…
∂
J(w) = 2∑ ( y i − ŷ i ) φ k (x i )
∂ wk i
k+1 basis vectors
" % ! $
" y1 % φ (x 1
) φ (x1
) 1
φ k (x ) ' # φ1 &
$ 0 1
$ ... ' &
$ φ (x 2 ) φ (x 2 ) φ k (x 2 ) ' ## ...
y =$ ' Φ=$ 0 1
'= & n examples
$ ... ' $ ' # ... &
$ n' # &
#y & $ φ (x n ) φ (x n ) φ k (x n ) '& # φn & ! w $
# 0 1
" % # 0 &
w = # .. &
## w &&
" k %
∂
J(w) = 2∑ ( y iφ 1i − ŷ iφ 1i )
∂ w0 i
n
recall ŷ i = ∑ w jφ ij
… j
i
=φ w
∂
J(w) = 2∑ ( y iφ ki − ŷ iφ ki )
∂ wk i
k+1 basis vectors
" %
"y %1 φ
$ 0 (x 1
) φ (x1
) φ k (x1 ) ' # 1 &
1 φ
$ ... ' $ φ (x 2 ) φ (x 2 ) φ k (x 2 ) ' % ... (
y =$ ' Φ=$ 0 1
' =% ( n examples
$ ... ' $ ' %% ... ((
$ n' n
#y & $ φ (x n ) φ (x n ) φ k (x ) &
n ' $ φ ' ! w $
# 0 1
# 0 &
w = # .. &
## w &&
€ " k %
∂
J(w) = 2∑ ( y iφ 0i − φ i w φ 0i )
∂ w0 i
…
= 2ΦT y − 2ΦT Φw
∂
J(w) = 2∑ ( y iφ ki − φ i w φ ki )
∂ wk i
n examples
" % ! $
φ (x1
) φ (x 2
) φ (x n
) # y1 &
$ 0 0 0 '
T $ ' # ... &
Φ =$ ' # &
$ ' y =# ... &
$ φ k (x1 ) φ k (x 2 ) φ k (x n ) ' # ... &
# & # &
# yn &
" %
∂
J(w) = 2∑ (φ 0i y i − φ 0i φ i w )
∂ w0 i
…
= 2ΦT y −...
∂
J(w) = 2∑ (φ ki y i − φ kiφ i w )
∂ wk i
n examples k+1 basis
! $ ! ! w $
φ k (x ) $&
1 n
# φ (x ) ... φ (x ) & φ (x1
) 1
0 0
# 0 # 0 &
# φ (x1 ) φ (x n
) & # φ (x 2 ) φ k (x 2 ) & # .. &
1 1
# & # 0 & ## w &&
# & # &
# φ (x1 ) & " k %
φ (x n
) # φ (x n ) φ k (x n ) &%
" k k % " 0
∂
J(w) = 2∑ (φ 0i y i − φ 0i φ i w )
∂ w0 i
…
T
= ... − 2Φ Φw
∂
J(w) = 2∑ (φ ki y i − φ kiφ i w )
∂ wk i
k+1 basis vectors
" %
"y %1 φ
$ 0 (x 1
) φ (x1
) φ k (x1 ) ' # 1 &
1 φ
$ ... ' $ φ (x 2 ) φ (x 2 ) φ k (x 2 ) ' % ... (
y =$ ' Φ=$ 0 1
' =% ( n examples
$ ... ' $ ' %% ... ((
$ n' n
#y & $ φ (x n ) φ (x n ) φ k (x ) &
n ' $ φ ' ! w $
# 0 1
# 0 &
w = # .. &
## w &&
€ " k %
∂
J(w) = 2∑ ( y iφ 0i − φ i w φ 0i )
∂ w0 i
…
= 2ΦT y − 2ΦT Φw = 0
∂
J(w) = 2∑ ( y iφ ki − φ i w φ ki ) −1
∂ wk i
w = (Φ Φ) ΦT y
T
recap: Solving linear regression
• To optimize – closed form:
• We just take the derivative w.r.t. to w and set to 0:
∂
∑
∂w i
(yi − wx i ) 2
= 2∑ −xi (yi − wxi ) ⇒
i
∑ x y = ∑ wx
i i
2
i ⇒
i i
covar(X,Y)/var(X)
∑x y i i
if mean(X)=mean(Y)=0
i
w=
∑x 2
i
i
k+1 basis vectors
" %
"y %1 φ
$ 0 (x 1
) φ (x1
) φ k (x1 ) ' # 1 &
1 φ
$ ... ' $ φ (x 2 ) φ (x 2 ) φ k (x 2 ) ' % ... (
y =$ ' Φ=$ 0 1
' =% ( n examples
$ ... ' $ ' %% ... ((
$ n' n
#y & $ φ (x n ) φ (x n ) φ k (x ) &
n ' $ φ ' ! w $
# 0 1
# 0 &
w = # .. &
## w &&
€ " k %
∂
J(w) = 2∑ ( y iφ 0i − φ i w φ 0i )
∂ w0 i 2∑ xi yi − 2∑ wxi xi = 0
i i
…
= 2ΦT y − 2ΦT Φw = 0
∂
J(w) = 2∑ ( y iφ ki − φ i w φ ki ) −1
∂ wk i
w = (Φ Φ) ΦT y
T
LMS for general linear regression problem
J(w) = ∑ (y i − w T φ (x i )) 2
i
T −1 T
Deriving w we get: w = (Φ Φ) Φ y
€
n entries vector
k+1 entries vector
€ n by k+1 matrix
LMS solution:
+ Very simple in Matlab or something similar
€ which is expensive for a large
- Requires matrix inverse,
matrix.
Gradient descent:
+ Fast for large matrices
+ Stochastic GD is very memory efficient
+ Easily extended to other cases
- Parameters to tweak (how to decide convergence?
what is the learning rate? ….)
Regression and Overfitting
An example: polynomial basis
vectors on a small dataset
– From Bishop Ch 1
0 th Order Polynomial
n=10
1 st Order Polynomial
3 rd Order Polynomial
9 th Order Polynomial
Over-fitting
2
1 # i & λ 2
J X,y (w) = ∑%% y − ∑ w jφ j (x )(( −
i
w
2 i $ j ' 2
Regularization:
+
Polynomial Coefficients
none exp(18) huge
Over Regularization:
Regularized Gradient Descent for LR
n
Goal: minimize the following predict with : yˆ i = ∑ w j φ j (x i )
loss function: j
2
1 # i & λ 2
J X,y (w) = ∑%% y − ∑ w jφ j (x )(( −
i
w
2 i $ € j ' 2
2
1 # i & λ
J X,y (w) = ∑%% y − ∑ w jφ j (x )(( − ∑ w 2j
i
2 i $ j ' 2 j
∂
J(w) = ∑ ( y i − ŷ i ) φ j (x i ) − λ w j
∂wj i
Probabilistic Interpretation of
Least Squares
A probabilistic interpretation
Our least squares minimization solution can also be
motivated by a probabilistic in interpretation of the
regression problem: y = w T φ (x) + ε
where ε is Gaussian noise
w = (ΦT Φ)−1 ΦT y
Bias-Variance
Example
Tom Dietterich, Oregon St
Example
Tom Dietterich, Oregon St