0% found this document useful (0 votes)
6 views

Lecture15 Regression

Uploaded by

yitongwu766
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views

Lecture15 Regression

Uploaded by

yitongwu766
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 15

Regression

George Lan

A. Russell Chandler III Chair Professor


H. Milton Stewart School of Industrial & Systems
Engineering
Machine learning for apartment hunting
Suppose you are to move to Atlanta
And you want to find the most
reasonably priced apartment satisfying
your needs:
square-ft., # of bedroom, distance to campus …

Living area (ft2) # bedroom Rent ($)

230 1 600
506 2 1000
433 2 1100
109 1 500

150 1 ?
270 1.5 ?
The learning problem
Features:
Living area, distance to campus, # bedroom …
Denote as 𝑥 = 𝑥! , 𝑥" , … , 𝑥# $
rent

Target:
Rent
Living area Denoted as y

Training set:
𝑋 = 𝑥! , 𝑥 " , … 𝑥 %
rent

𝑦 = 𝑦! , 𝑦 " , … , 𝑦 % $

Location

Living area
Linear Regression Model
Assume 𝑦 is a linear function of 𝑥 (features) plus noise 𝜖

𝑦 = 𝜃! + 𝜃" 𝑥" + ⋯ + 𝜃# 𝑥# + 𝜖

where 𝜖 is an error term of unmodeled effects or random noise

Let 𝜃 = 𝜃! , 𝜃" , … , 𝜃# $, and augment data by one dimension

𝑥 ← 1, 𝑥 $

Then 𝑦 = 𝜃 $ 𝑥 + 𝜖

4
Least mean square method
Given m data points, find 𝜃 that minimizes the mean square
error
(
1 )
-
𝜃 = 𝑎𝑟𝑔𝑚𝑖𝑛% 𝐿 𝜃 = 5 𝑦 − 𝜃 𝑥 & $ &
𝑚
&'"

Our usual trick: set gradient to 0 and find parameter

(
𝜕𝐿 𝜃 2
= − 5 𝑦& − 𝜃$𝑥 & 𝑥 & = 0
𝜕𝜃 𝑚
&'"
( (
2 & &
2 & & !
⇔ − 5𝑦 𝑥 + 5𝑥 𝑥 𝜃 = 0
𝑚 𝑚
&'" &'"

5
Matrix version of the gradient
*+ % ) ( ) ( !
*%
= − ( ∑&'" 𝑦 & 𝑥 & & &
+ ( ∑&'" 𝑥 𝑥 𝜃 =0

Equivalent to

𝜕𝐿 𝜃 2 ! 2 !
=− 𝑥 … , 𝑥% !
𝑦 …,𝑦 % $
+ 𝑥 , … 𝑥 % 𝑥! , … 𝑥 % $
𝜃=0
𝜕𝜃 𝑚 𝑚

Define 𝑋 = 𝑥 ! , 𝑥 " , … 𝑥 % , 𝑦 = 𝑦 ! , 𝑦 " , … , 𝑦 % $ , gradient becomes

𝜕𝐿 𝜃 2 2
= − 𝑋𝑦 + 𝑋𝑋 ! 𝜃 = 0
𝜕𝜃 𝑚 𝑚

⇒ 𝜃+ = 𝑋𝑋 ! "#
𝑋𝑦
6
Alternative way of obtaining 𝜃!
The matrix inversion in 𝜃+ = 𝑋𝑋 ! "#
𝑋𝑦 can be very expensive to
compute

*+ % ) (
*%
= − ( ∑&'" 𝑦& − 𝜃$𝑥 & 𝑥 &

Gradient descent
(
𝛼 !
𝜃+ $%# ← 𝜃+ $ + . 𝑦 & − 𝜃+ $ 𝑥 & 𝑥 &
𝑚
&'#

Stochastic gradient descent (use one data point at a time)

+ $%# + + $! &
𝜃 ← 𝜃 + 𝛽$ 𝑦 − 𝜃 𝑥 𝑥 &
$ &

7
A recap:
Stochastic gradient update rule
%
𝜃, !"# ← 𝜃, ! + 𝛽 𝑦 $ − 𝜃, ! 𝑥 $ 𝑥 $

Pros: on-line, low per-step cost


Cons: coordinate, maybe slow-converging
Gradient descent
'
𝛼 %
𝜃, !"# ← 𝜃, ! + 6 𝑦 $ − 𝜃, ! 𝑥 $ 𝑥 $
𝑚
$&#

Pros: fast-converging, easy to implement


Cons: need to read all data
Solve normal equations
(𝑋𝑋 % )𝜃, = 𝑋𝑦
Pros: a single-shot algorithm! Easiest to implement.
Cons: need to compute inverse 𝑋𝑇𝑋 "# , expensive, numerical
issues (e.g., matrix is singular ..)
Geometric Interpretation of LMS
The predictions on the training data are:
𝑦< = 𝑋 $ 𝜃 = 𝑋 $ 𝑋𝑋 $ ;" 𝑋𝑦
Look at residue 𝑦< − 𝑦

𝑦< − 𝑦 = 𝑋 $ 𝑋𝑋 $ ;" 𝑋 −𝐼 𝑦

𝑋 𝑦< − 𝑦 = 𝑋 𝑋 $ 𝑋𝑋 $ ;" 𝑋 −𝐼 𝑦 =0

𝑦< is the orthogonal projection of 𝑦 into the


space spanned by the columns of 𝑋
Probabilistic Interpretation of LMS
Assume 𝑦 is a linear in 𝑥 plus noise 𝜖
𝑦 = 𝜃$𝑥 + 𝜖

Assume 𝜖 follows a Gaussian N(0,σ)


)
1 𝑦& $
−𝜃 𝑥 &
𝑝 𝑦& 𝑥& ; 𝜃 = exp −
2𝜋𝜎 2𝜎 )

By independence assumption, likelihood is


𝐿 𝜃
( ( ( & − 𝜃$𝑥 & )
1 ∑ &'" 𝑦
= F 𝑝 𝑦& 𝑥& ; 𝜃 = exp −
2𝜋𝜎 2𝜎 )
&'"
Probabilistic Interpretation of LMS, cont.
Hence the log-likelihood is:

(
1
1 )
log 𝐿 𝜃 = 𝑚 log − ) 5 𝑦& − 𝜃$𝑥 &
2𝜋𝜎 2𝜎 &'"

Do you recognize the last term?

(
1 )
𝐿𝑀𝑆: 5 𝑦& − 𝜃$𝑥 &
𝑚
&

Thus under independence assumption and Gaussian noise


assumption, LMS is equivalent to MLE of 𝜃 !
Nonlinear regression

Want to fit a polynomial regression model

𝑦 = 𝜃! + 𝜃" 𝑥 + 𝜃) 𝑥 ) + ⋯ + 𝜃# 𝑥 # + 𝜖

Let 𝑥M = 1, 𝑥, 𝑥 ) , … , 𝑥 # $ and 𝜃 = 𝜃! , 𝜃" , 𝜃) , … , 𝜃# $

y = 𝜃 $ 𝑥M
12
Least mean square method
Given 𝑚 data points, find 𝜃 that minimizes the mean square
error
(
1 )
-
𝜃 = 𝑎𝑟𝑔𝑚𝑖𝑛% 𝐿 𝜃 = 5 𝑦 − 𝜃 𝑥M & $ &
𝑚
&'"

Our usual trick: set gradient to 0 and find parameter

(
𝜕𝐿 𝜃 2
= − 5 𝑦 & − 𝜃 $ 𝑥M & 𝑥M & = 0
𝜕𝜃 𝑚
&'"
( (
2 & &
2 & & $
⇔ − 5 𝑦 𝑥M + 5 𝑥M 𝑥M 𝜃 = 0
𝑚 𝑚
&'" &'"

13
Matrix version of the gradient
$
Define 𝑋0 = 𝑥1 (!) , 𝑥1 (") , … 𝑥1 (%) , 𝑦 = 𝑦 (!) , 𝑦 (") , … , 𝑦 (%) , gradient
becomes

𝜕𝐿 𝜃 2 2
= − 𝑋𝑦O + 𝑋O 𝑋O $ 𝜃 = 0
𝜕𝜃 𝑚 𝑚
;"
- O
⇒ 𝜃 = 𝑋𝑋 O $ O
𝑋𝑦

Note that 𝑥M = 1, 𝑥, 𝑥 ) , … , 𝑥 # $

If we choose a different maximal degree 𝑛 for the polynomial,


the solution will be different.

14
Example: head acceleration in accident

15

You might also like