0% found this document useful (0 votes)
20 views

04 LinearRegression

Uploaded by

joselazaromr
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
20 views

04 LinearRegression

Uploaded by

joselazaromr
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 61

Linear Regression

These slides were assembled by Eric Eaton, with grateful acknowledgement of the many others who made
their course materials freely available online. Feel free to reuse or adapt these slides for your own academic
purposes, provided that you include proper attribution. Please send comments and corrections to Eric.
Robot Image Credit: Viktoriya Sukhanova © 123RF.com
Regression
Given:
– Data x ( 1 ) , . . . , x ( n ) wher x ( i ) Rd
X =
e
(1)
– Corresponding labels y = y ,. .., where y ( i ) 2 R
y( n )
9
8
September Arctic Sea Ice Extent

7
(1,000,000 sq km)

6
5
4
3
Linear Regression
Quadratic Regression
2
1
0
1970 1980 2010 2020
1990
2000
2
Year
Prostate Cancer Dataset
• 97 samples, partitioned into 67 train / 30 test
• Eight predictors (features):
– 6 continuous (4 log transforms), 1 binary, 1 ordinal
• Continuous outcome variable:
– lpsa: log(prostate specific antigen level)

Based on slide by Jeff Howbert


Linear Regression
• Hypothesis:
Xd
y = ✓ 0 + ✓ 1 x1 + ✓ 2 x 2 + . . . + ✓j x j
✓dxd j =0

=
• Fit model by minimizing sum of squared errors
Assume x 0 = 1

Figures are courtesy of Greg Shakhnarovich 5


Least Squares Linear Regression
• Cost Function
Xn ⇣ ⇣ ⌘
1 (i )
J (✓) = ⌘h2✓ x ( i ) — y
2n
i =1
• Fit by solving min

J (✓)

6
Intuition Behind Cost Function
Xn ⇣ ⇣ ⌘2
1
J (✓) = ⌘h✓ x (i) —
2n
i=1
For insight on J(), let’s assume xy (i) R so ✓ = [✓0 ,
✓1 ]

Based on example
by Andrew Ng 7
Intuition Behind Cost Function
Xn ⇣ ⇣ ⌘
1
J (✓) = ⌘h2✓ x (i) —
2n
i=1
For insight on J(), let’s assume xy (i) R so ✓ = [✓0 ,
✓1 ]
(for fixed , this is a function of x) (function of the parameter )
3 3

2 2
y
1
1
0
0
0.5 1 1.5 2 2.5
0 1 2 3 -0.5 0
x
Based on example
by Andrew Ng 8
Intuition Behind Cost Function
Xn ⇣ ⇣ ⌘
1
J (✓) = ⌘h2✓ x (i) —
2n
i=1
For insight on J(), let’s assume xy (i) R so ✓ = [✓0 ,
✓1 ]
(for fixed , this is a function of x) (function of the parameter )
3 3

2 2
y
1
1
0
0
0.5 1 1.5 2 2.5
0 1 2 3 -0.5 0
x

Based on example (0.5 — 1)2 + (1 — 2)2 + (1.5 — 3)2 ⇡
by Andrew Ng 1 ⇤ 9
0.58
Intuition Behind Cost Function
Xn ⇣ ⇣ ⌘
1
J (✓) = ⌘h2✓ x (i) —
2n
i=1
For insight on J(), let’s assume xy (i) R so ✓ = [✓0 ,
✓1 ]
(for fixed , this is a function of x) (function of the parameter )
3 3
J ([0, 0]) ⇡
2.333
2 2
y
1 J() is concave
1
0
0
0.5 1 1.5 2 2.5
0 1 2 3 -0.5 0
x
Based on example
by Andrew Ng 10
Intuition Behind Cost Function

Slide by Andrew Ng 11
Intuition Behind Cost Function

(for fixed , this is a function of x) (function of the parameters )

Slide by Andrew Ng 12
Intuition Behind Cost Function

(for fixed , this is a function of x) (function of the parameters )

Slide by Andrew Ng 13
Intuition Behind Cost Function

(for fixed , this is a function of x) (function of the parameters )

Slide by Andrew Ng 14
Intuition Behind Cost Function

(for fixed , this is a function of x) (function of the parameters )

Slide by Andrew Ng 15
Basic Search Procedure
• Choose initial value for ✓
• Until we reach a minimum:
– Choose a new value for ✓ to reduce J (✓)

J()



Figure by Andrew Ng 16
Basic Search Procedure
• Choose initial value for ✓
• Until we reach a minimum:
– Choose a new value for ✓ to reduce J (✓)


J()



Figure by Andrew Ng 17
Basic Search Procedure
• Choose initial value for ✓
• Until we reach a minimum:
– Choose a new value for ✓ to reduce J (✓)


J()



Figure by Andrew Ng 18
Gradient Descent
• Initialize ✓
• Repeat until convergence
@ simultaneous update
✓j ← ✓ j J (✓
for j = 0 ... d
— ↵ )
learning rate (small)
e.g., α = 0.05 @ 3

✓j 2
J(✓)
1

0
0.5 1 1.5 2 2.5
-0.5 0

✓ 19
Gradient Descent
• Initialize ✓
• Repeat until convergence
@ simultaneous update
✓j ← ✓j J (✓
@j for j = 0 ... d
— ↵ )
@
✓ @ 1 Xn ⇣ ⇣ ⌘
For Linear Regression: J (✓) = ⌘h✓ x (i ) — 2
@ @✓j 2n i =1
✓j y (i )

20
Gradient Descent
• Initialize ✓
• Repeat until convergence
@ simultaneous update
✓j ← ✓j J (✓
@j for j = 0 ... d
— ↵ )
@
✓ @ 1 Xn ⇣ ⇣ ⌘
For Linear Regression: J (✓) = ⌘h✓ x (i ) — 2
@ @✓j 2n i =1
✓j y (i ) !2
@ 1 Xn Xd
= ✓k x (ik —) y (i )
@✓j 2n
i =1 k =0

21
Gradient Descent
• Initialize ✓
• Repeat until convergence
@ simultaneous update
✓j ← ✓j J (✓
@j for j = 0 ... d
— ↵ )
@
✓ @ 1 Xn ⇣ ⇣ ⌘
For Linear Regression: J (✓) = ⌘h✓ x (i ) — 2
@ @✓j 2n i =1
✓j y (i ) !2
@ 1 Xn Xd
= ✓k x (ik ) — (i )
@✓j 2n y
i =1 k =0 ! !
1 Xn Xd @ Xd
= ✓k x (ik ) — y(i ) ⇥ ✓k x (ik —) y (i )
n @
i =1 k =0 k =0
✓j

22
Gradient Descent
• Initialize ✓
• Repeat until convergence
@ simultaneous update
✓j ← ✓j J (✓
@j for j = 0 ... d
— ↵ )
@
✓ @ 1 Xn ⇣ ⇣ ⌘
For Linear Regression: J (✓) = ⌘h✓ x (i ) — 2
@ @✓j 2n i =1
✓j y (i ) !2
@ 1 Xn Xd
= ✓k x (ik ) — (i )
@✓j 2n y
i =1 k =0 ! !
1 Xn Xd @ Xd
= ✓k x (ik ) — (i )
⇥ ✓k x (ik —) y (i )
n @
y k =0
i =1 k =0 ! ✓j
1X Xd
n
= ✓k x (ik ) — (i )
n x (j i )
i =1 k =0
y
23
Gradient Descent for Linear Regression
• Initialize ✓
• Repeat until convergence
n ⇣ ⇣
1 X (i ) ⌘ simultaneous
✓j ← ✓j ⌘h✓ x ( i ) — y x (j i ) update
n i =1 for j = 0 ... d
— ↵
• To achieve simultaneous update
• At the start of each GD iteration, compute h✓
x (i)
• Use this stored value in the update step loop
• Assume convergence when ✓n ew — ✓old k2

< ✏
24
Gradient Descent

(for fixed , this is a function of x) (function of the parameters )

h(x) = -900 – 0.1 x

Slide by Andrew Ng 25
Gradient Descent

(for fixed , this is a function of x) (function of the parameters )

Slide by Andrew Ng 26
Gradient Descent

(for fixed , this is a function of x) (function of the parameters )

Slide by Andrew Ng 27
Gradient Descent

(for fixed , this is a function of x) (function of the parameters )

Slide by Andrew Ng 28
Gradient Descent

(for fixed , this is a function of x) (function of the parameters )

Slide by Andrew Ng 29
Gradient Descent

(for fixed , this is a function of x) (function of the parameters )

Slide by Andrew Ng 30
Gradient Descent

(for fixed , this is a function of x) (function of the parameters )

Slide by Andrew Ng 31
Gradient Descent

(for fixed , this is a function of x) (function of the parameters )

Slide by Andrew Ng 32
Gradient Descent

(for fixed , this is a function of x) (function of the parameters )

Slide by Andrew Ng 33
Choosing α
α too small α too large
Increasing value for J(✓)
slow convergence

• May overshoot the minimum


• May fail to converge
• May even diverge

To see if gradient descent is working, print out J(✓) each iteration


• The value should decrease at each iteration
• If it doesn’t, adjust α
34
Extending Linear Regression to
More Complex Models
• The inputs X for linear regression can be:
– Original quantitative inputs
– Transformation of quantitative inputs
• e.g. log, exp, square root, square, etc.
– Polynomial transformation
• example: y = 0 + 1x + 2x2 + 3x3
– Basis expansions
– Dummy coding of categorical inputs
– Interactions between variables
• example: x3 = x1  x2

This allows use of linear regression techniques


to fit non-linear datasets.
Linear Basis Function Models
• Generally
, Xd
h ✓(x) = ✓j $ j (x)
j=0
basis function

• Typically, $ 0 ( x ) = 1 so that ✓0 acts as a bias


• In the simplest case, we use linear basis functions :
$ j (x ) = x j

Based on slide by Christopher Bishop (PRML)


Linear Basis Function Models
• Polynomial basis functions:

– These are global; a small


change in x affects all
basis functions

• Gaussian basis
functions:

– These are local; a small change


in x only affect nearby basis
functions. μj and s control
location and scale (width).
Based on slide by Christopher Bishop (PRML)
Linear Basis Function Models
• Sigmoidal basis functions:

where

– These are also local; a small


change in x only affects
nearby basis functions.
μj and s control location
and scale (slope).

Based on slide by Christopher Bishop (PRML)


Example of Fitting a Polynomial Curve
with a Linear Model

Xp
2 p ✓j x j
y=✓0 + ✓
1 x + 2✓ x+ .. . p x =
+✓ j =0
Linear Basis Function Models
Xd
• Basic Linear Model: ✓h (x ) = j xj
✓ j =0

Xd
• Generalized Linear Model: ✓h (x) ✓j $ j (x)
= j =0

• Once we have replaced the data by the outputs of


the basis functions, fitting the generalized model is
exactly the same problem as fitting the basic model
– Unless we use the kernel trick – more on that when
we cover support vector machines
– Therefore, there is no point in cluttering the math with
basis functions
Based on slide by Geoff Hinton 40
Linear Algebra Concepts
• Vector in R d is an ordered set of d real
numbers  1 

6
– e.g., v = [1,6,3,4] is in R 4 
 3
– “[1,6,3,4]” is a column vector: 4
– as opposed to a row vector:
1
6

 1 32 8
 
 4 78 6 

 9 43 2 
Based on slides by Joseph Bradley
• An m-by-n matrix is an object with m rows and n
Linear Algebra Concepts
• Transpose: reflect vector/matrix on line:

 a   a
T
 a b    a c
Tb  c d b d
b 
– Note: (Ax )T =x T A T (We’ll define multiplication
soon…)
• Vector norms: X
! 1
p

|vi|p
– Lp norm of v = (v1, i
– Common
…,v k) is
norms: L1 , 2
L
– Linfinity = maxi |vi|
• Length of a vector v is
Based on slides by Joseph Bradley
L2(v)
Linear Algebra Concepts

• Vector dot product: u  v  u 1 u 2   v 1 v 2    u2v2


u1v1
2
– Note: dot product of u with itself = length(u) = uk22

• Matrix product:
 a11 a12   b11 b12 
A , B   
 a21 a22  b21 b22 
AB   a11b11  a12b21 a11b12  a12b22 
 a 21 11 b 21 a21b12  a22 b22 
22

a b
Based on slides by Joseph Bradley
Linear Algebra Concepts

• Vector products:
– Dot product: u  v  u T v  u1 u2  v1 
 u1 v1  u v
 v2  2 2

– Outer product:

uv   u1 v1
T
v2   u1v1 u1v2 
 u 2   u 2v1 u2v2 

Based on slides by Joseph Bradley


Vectorization
• Benefits of vectorization
– More compact equations
– Faster code (using
optimized
• Consider ourmatrix
model:
libraries) Xd
h(x ) = j j
• Let j =0
2 ✓ x

0 7
✓=6 ✓ x| = 1 x1 . . . xd
4 ..
1

d
• Can write the model in vectorized form as h(x) = 45
Vectorization
• Consider our model for n instances:
⇣ Xd
h ⌘x ( i ) = ✓jj
j =0
• Let 2 x (i )
2 1 (1)
x1 .. (1)
xd
✓0
6 . . ... . 7
6 ✓ 7 6 . . . 7
1 7
✓=6 X = 6 1 . .. . 7
4 .. x (1i ) x (di ) 7
6 ..
✓ .. .. ..
.
d 4 x (1n ) x (dn )
1 .+. 1.)
R (d+1)⇥1 R n ⇥ ( d

• Can write the model in vectorized form as h ✓ (x ) = 46


Vectorization
• For the linear regression cost function:
n ⇣ ⇣
1 X (i ) ⌘
J (✓) = ⌘h✓ x ( i ) — y 2
2n
i =1

1 Xn ⇣ ⌘
= 2n ✓| x (i—) y (i )
2
i =1 R n⇥(d+1)
R (d+1)⇥1
Let: 1
2 = (X ✓ — y)| (X ✓ —
2n
6 y(2) 7
y (1) y)
y= 7
..
y (n)
R 1⇥n 47
Closed Form Solution
• Instead of using GD, solve for optimal ✓
@
– Notice that the solution is when
analytically J (✓) =
@ 0
• Derivation: ✓
1 | 1x1
J (✓) = (X ✓ — (X ✓ —
2n
y) y)|
/ ✓ X X ✓ — y X ✓ — ✓| X | y +
| |

y| y
/ ✓| X | X ✓ — 2✓| X | y + y | y
@
(✓| X and
Take derivative |
X set 2✓to| 0,
✓ —equal |
y +solve
X then y | y)
for
@
=0
✓:✓ (X | X )✓ — X | y = 0
(X | X)✓ = X | y
Closed Form Solution: ✓ = (X | X )— 1 X |
48
Closed Form Solution
• Can obtain ✓ by simply plugging X and y
into 2
1 (11 ) . . . xd( 1 )
✓ = (X .. X )
6 ..
x|
. .— 1 X y
. 7
|
2
y (1)
. 6 (2) 7
6 7 y= 6 y 7
X = 6 x (1i ) (di ) 7
x 7 .4
6 1. .. .. . ..
4 . .. y.
(n )

x (1n ) . x (dn )
1
.. .
• If X T X is not invertible (i.e., singular), may need
to:
– Use pseudo-inverse instead of the inverse
• In python, numpy.linalg.pinv(a)
– Remove redundant (not linearly independent) features
– Remove extra features to ensure that d ≤ n 49
Gradient Descent vs Closed Form
Gradient Descent Closed Form Solution
• Requires multiple iterations • Non-iterative
• Need to choose α • No need for α
• Works well when n is large • Slow if n is large
• Can support incremental – Computing (X T X)-1
learning is roughly O(n3)

50
Improving Learning:
Feature Scaling
• Idea: Ensure that feature have similar scales
Before Feature Scaling After Feature Scaling
20 20
15 15

✓ 10 ✓ 10
5 5
2 2
0 0
0 5 10 15 20 0 5 10 15 20
✓1 ✓1
• Makes gradient descent converge much faster

51
Feature Standardization
• Rescales features to have zero mean and unit variance
1 X n
(i )
– Let μ j be the mean of feature j : µ j = xj
n i =1
– Replace each value with:
(i)
(i ) xj — j for j =
xj ← µ (not x 0!)
1...d
sj
• s j is the standard deviation of feature j
• Could also use the range of feature j (maxj –
minj) for s j

• Must apply the same transformation to instances for


both training and prediction
• Outliers can cause problems 52
Productivity Quality of Fit

Productivity

Productivity
Time Spent Time Spent Time Spent

Underfittin Correct fit Overfitting


g (high (high variance)
bias)

Overfitting:
• The learned hypothesis may fit the training set very
well ( J (✓) ⇡ 0 )
• ...but fails to generalize to new examples

Based on example by Andrew Ng 53


Regularization
• A method for automatically controlling the
complexity of the learned hypothesis
• Idea: penalize for large values of ✓j
– Can incorporate into the cost function
– Works well when we have a lot of features, each that
contributes a bit to predicting the label

• Can also address overfitting by eliminating features


(either manually or via model selection)

54
Regularization
• Linear regression objective function
Xn ⇣ ⇣ ⌘ (i ) X d
1 λ
J (✓) ⌘h2✓ x ( i ) — y +
2n 2 j =1 ✓j
= i =1 2

model fit to data regularization

– λ is the regularization parameter (λ


0)
– No regularization on ✓0 !

55
Understanding Regularization
Xn ⇣ ⇣ ⌘ X d
1 (i) λ
J (✓) ⌘h2✓ x ( i ) — y j
2n + 2 j =1 ✓
= i =1 2

Xd 2 2
• Note that j = k✓1:d k2
✓ j=1
– This is the magnitude of the feature coefficient vector!

• We can also think of this as:


Xd 2
2
(✓j — 0) = ✓1:d — ~ 2
j=1 k 0k
• L2 regularization pulls coefficients toward 0
56
Understanding Regularization
Xn ⇣ ⇣ ⌘ X d
1 (i) λ
J (✓) ⌘h2✓ x ( i ) — y j
2n + 2 j =1 ✓
= i =1 2

• What happens as λ ?

✓0 + ✓1 x + ✓2 x 2 + ✓3 x 3
+ ✓4 x 4
Productivity

Time Spent on Work


57
Understanding Regularization
Xn ⇣ ⇣ ⌘ X d
1 (i) λ
J (✓) ⌘h2✓ x ( i ) — y j
2n + 2 j =1 ✓
= i =1 2

• What happens as λ ?
0

0
Productivity

0
✓0 + ✓1 x + ✓2 x 2 + ✓3 x 3
+ ✓4 x 4
Time Spent on Work
58
Regularized Linear Regression
• Cost Function

• Fit by solving min



J (✓)
• Gradient update:
n
@
J(
@✓
✓)
0

@
J(
@✓
✓)
j
regularization
59
Regularized Linear Regression

• We can rewrite the gradient step


n ⇣ ⇣
as: 1 X (i ) ⌘
✓j ← ✓j (1 — ⌘h✓ x ( i ) — y x (j i )
n i =1
↵λ) — ↵

60
Regularized Linear Regression
• To incorporate regularization into the closed form
solution:
0
2 1 —1
0 0 0 .. .
6
✓= X| X 3
B 0 C X| y
B 0 1 0 . . . 0 7C
7
@ 0 7 A
.. .. .. . .
7
.7
. .
4 . . . . 5
0 0 0 .. . 1

61
Regularized Linear Regression
• To incorporate regularization into the closed form
solution:
0 2 0 0 0 .. . 31— 1
0
0 1 0 .. .
B | 6 0 0 1 0 7C
✓ X X + ... X| y
= B@λ .. .. . . . . .. 7 C
0 5
0 0 0 .. . 1
6
4 A
@
• Can derive this the same way, by solving J (✓) =
@ 0
• Can prove that for λ > 0, inverse exists in✓
the equation above
62

You might also like