04 LinearRegression
04 LinearRegression
These slides were assembled by Eric Eaton, with grateful acknowledgement of the many others who made
their course materials freely available online. Feel free to reuse or adapt these slides for your own academic
purposes, provided that you include proper attribution. Please send comments and corrections to Eric.
Robot Image Credit: Viktoriya Sukhanova © 123RF.com
Regression
Given:
– Data x ( 1 ) , . . . , x ( n ) wher x ( i ) Rd
X =
e
(1)
– Corresponding labels y = y ,. .., where y ( i ) 2 R
y( n )
9
8
September Arctic Sea Ice Extent
7
(1,000,000 sq km)
6
5
4
3
Linear Regression
Quadratic Regression
2
1
0
1970 1980 2010 2020
1990
2000
2
Year
Prostate Cancer Dataset
• 97 samples, partitioned into 67 train / 30 test
• Eight predictors (features):
– 6 continuous (4 log transforms), 1 binary, 1 ordinal
• Continuous outcome variable:
– lpsa: log(prostate specific antigen level)
=
• Fit model by minimizing sum of squared errors
Assume x 0 = 1
6
Intuition Behind Cost Function
Xn ⇣ ⇣ ⌘2
1
J (✓) = ⌘h✓ x (i) —
2n
i=1
For insight on J(), let’s assume xy (i) R so ✓ = [✓0 ,
✓1 ]
Based on example
by Andrew Ng 7
Intuition Behind Cost Function
Xn ⇣ ⇣ ⌘
1
J (✓) = ⌘h2✓ x (i) —
2n
i=1
For insight on J(), let’s assume xy (i) R so ✓ = [✓0 ,
✓1 ]
(for fixed , this is a function of x) (function of the parameter )
3 3
2 2
y
1
1
0
0
0.5 1 1.5 2 2.5
0 1 2 3 -0.5 0
x
Based on example
by Andrew Ng 8
Intuition Behind Cost Function
Xn ⇣ ⇣ ⌘
1
J (✓) = ⌘h2✓ x (i) —
2n
i=1
For insight on J(), let’s assume xy (i) R so ✓ = [✓0 ,
✓1 ]
(for fixed , this is a function of x) (function of the parameter )
3 3
2 2
y
1
1
0
0
0.5 1 1.5 2 2.5
0 1 2 3 -0.5 0
x
⇥
Based on example (0.5 — 1)2 + (1 — 2)2 + (1.5 — 3)2 ⇡
by Andrew Ng 1 ⇤ 9
0.58
Intuition Behind Cost Function
Xn ⇣ ⇣ ⌘
1
J (✓) = ⌘h2✓ x (i) —
2n
i=1
For insight on J(), let’s assume xy (i) R so ✓ = [✓0 ,
✓1 ]
(for fixed , this is a function of x) (function of the parameter )
3 3
J ([0, 0]) ⇡
2.333
2 2
y
1 J() is concave
1
0
0
0.5 1 1.5 2 2.5
0 1 2 3 -0.5 0
x
Based on example
by Andrew Ng 10
Intuition Behind Cost Function
Slide by Andrew Ng 11
Intuition Behind Cost Function
Slide by Andrew Ng 12
Intuition Behind Cost Function
Slide by Andrew Ng 13
Intuition Behind Cost Function
Slide by Andrew Ng 14
Intuition Behind Cost Function
Slide by Andrew Ng 15
Basic Search Procedure
• Choose initial value for ✓
• Until we reach a minimum:
– Choose a new value for ✓ to reduce J (✓)
J()
Figure by Andrew Ng 16
Basic Search Procedure
• Choose initial value for ✓
• Until we reach a minimum:
– Choose a new value for ✓ to reduce J (✓)
✓
J()
Figure by Andrew Ng 17
Basic Search Procedure
• Choose initial value for ✓
• Until we reach a minimum:
– Choose a new value for ✓ to reduce J (✓)
✓
J()
Figure by Andrew Ng 18
Gradient Descent
• Initialize ✓
• Repeat until convergence
@ simultaneous update
✓j ← ✓ j J (✓
for j = 0 ... d
— ↵ )
learning rate (small)
e.g., α = 0.05 @ 3
✓j 2
J(✓)
1
0
0.5 1 1.5 2 2.5
-0.5 0
✓ 19
Gradient Descent
• Initialize ✓
• Repeat until convergence
@ simultaneous update
✓j ← ✓j J (✓
@j for j = 0 ... d
— ↵ )
@
✓ @ 1 Xn ⇣ ⇣ ⌘
For Linear Regression: J (✓) = ⌘h✓ x (i ) — 2
@ @✓j 2n i =1
✓j y (i )
20
Gradient Descent
• Initialize ✓
• Repeat until convergence
@ simultaneous update
✓j ← ✓j J (✓
@j for j = 0 ... d
— ↵ )
@
✓ @ 1 Xn ⇣ ⇣ ⌘
For Linear Regression: J (✓) = ⌘h✓ x (i ) — 2
@ @✓j 2n i =1
✓j y (i ) !2
@ 1 Xn Xd
= ✓k x (ik —) y (i )
@✓j 2n
i =1 k =0
21
Gradient Descent
• Initialize ✓
• Repeat until convergence
@ simultaneous update
✓j ← ✓j J (✓
@j for j = 0 ... d
— ↵ )
@
✓ @ 1 Xn ⇣ ⇣ ⌘
For Linear Regression: J (✓) = ⌘h✓ x (i ) — 2
@ @✓j 2n i =1
✓j y (i ) !2
@ 1 Xn Xd
= ✓k x (ik ) — (i )
@✓j 2n y
i =1 k =0 ! !
1 Xn Xd @ Xd
= ✓k x (ik ) — y(i ) ⇥ ✓k x (ik —) y (i )
n @
i =1 k =0 k =0
✓j
22
Gradient Descent
• Initialize ✓
• Repeat until convergence
@ simultaneous update
✓j ← ✓j J (✓
@j for j = 0 ... d
— ↵ )
@
✓ @ 1 Xn ⇣ ⇣ ⌘
For Linear Regression: J (✓) = ⌘h✓ x (i ) — 2
@ @✓j 2n i =1
✓j y (i ) !2
@ 1 Xn Xd
= ✓k x (ik ) — (i )
@✓j 2n y
i =1 k =0 ! !
1 Xn Xd @ Xd
= ✓k x (ik ) — (i )
⇥ ✓k x (ik —) y (i )
n @
y k =0
i =1 k =0 ! ✓j
1X Xd
n
= ✓k x (ik ) — (i )
n x (j i )
i =1 k =0
y
23
Gradient Descent for Linear Regression
• Initialize ✓
• Repeat until convergence
n ⇣ ⇣
1 X (i ) ⌘ simultaneous
✓j ← ✓j ⌘h✓ x ( i ) — y x (j i ) update
n i =1 for j = 0 ... d
— ↵
• To achieve simultaneous update
• At the start of each GD iteration, compute h✓
x (i)
• Use this stored value in the update step loop
• Assume convergence when ✓n ew — ✓old k2
< ✏
24
Gradient Descent
Slide by Andrew Ng 25
Gradient Descent
Slide by Andrew Ng 26
Gradient Descent
Slide by Andrew Ng 27
Gradient Descent
Slide by Andrew Ng 28
Gradient Descent
Slide by Andrew Ng 29
Gradient Descent
Slide by Andrew Ng 30
Gradient Descent
Slide by Andrew Ng 31
Gradient Descent
Slide by Andrew Ng 32
Gradient Descent
Slide by Andrew Ng 33
Choosing α
α too small α too large
Increasing value for J(✓)
slow convergence
• Gaussian basis
functions:
where
Xp
2 p ✓j x j
y=✓0 + ✓
1 x + 2✓ x+ .. . p x =
+✓ j =0
Linear Basis Function Models
Xd
• Basic Linear Model: ✓h (x ) = j xj
✓ j =0
Xd
• Generalized Linear Model: ✓h (x) ✓j $ j (x)
= j =0
1 32 8
4 78 6
9 43 2
Based on slides by Joseph Bradley
• An m-by-n matrix is an object with m rows and n
Linear Algebra Concepts
• Transpose: reflect vector/matrix on line:
a a
T
a b a c
Tb c d b d
b
– Note: (Ax )T =x T A T (We’ll define multiplication
soon…)
• Vector norms: X
! 1
p
|vi|p
– Lp norm of v = (v1, i
– Common
…,v k) is
norms: L1 , 2
L
– Linfinity = maxi |vi|
• Length of a vector v is
Based on slides by Joseph Bradley
L2(v)
Linear Algebra Concepts
• Matrix product:
a11 a12 b11 b12
A , B
a21 a22 b21 b22
AB a11b11 a12b21 a11b12 a12b22
a 21 11 b 21 a21b12 a22 b22
22
a b
Based on slides by Joseph Bradley
Linear Algebra Concepts
• Vector products:
– Dot product: u v u T v u1 u2 v1
u1 v1 u v
v2 2 2
– Outer product:
uv u1 v1
T
v2 u1v1 u1v2
u 2 u 2v1 u2v2
1 Xn ⇣ ⌘
= 2n ✓| x (i—) y (i )
2
i =1 R n⇥(d+1)
R (d+1)⇥1
Let: 1
2 = (X ✓ — y)| (X ✓ —
2n
6 y(2) 7
y (1) y)
y= 7
..
y (n)
R 1⇥n 47
Closed Form Solution
• Instead of using GD, solve for optimal ✓
@
– Notice that the solution is when
analytically J (✓) =
@ 0
• Derivation: ✓
1 | 1x1
J (✓) = (X ✓ — (X ✓ —
2n
y) y)|
/ ✓ X X ✓ — y X ✓ — ✓| X | y +
| |
y| y
/ ✓| X | X ✓ — 2✓| X | y + y | y
@
(✓| X and
Take derivative |
X set 2✓to| 0,
✓ —equal |
y +solve
X then y | y)
for
@
=0
✓:✓ (X | X )✓ — X | y = 0
(X | X)✓ = X | y
Closed Form Solution: ✓ = (X | X )— 1 X |
48
Closed Form Solution
• Can obtain ✓ by simply plugging X and y
into 2
1 (11 ) . . . xd( 1 )
✓ = (X .. X )
6 ..
x|
. .— 1 X y
. 7
|
2
y (1)
. 6 (2) 7
6 7 y= 6 y 7
X = 6 x (1i ) (di ) 7
x 7 .4
6 1. .. .. . ..
4 . .. y.
(n )
x (1n ) . x (dn )
1
.. .
• If X T X is not invertible (i.e., singular), may need
to:
– Use pseudo-inverse instead of the inverse
• In python, numpy.linalg.pinv(a)
– Remove redundant (not linearly independent) features
– Remove extra features to ensure that d ≤ n 49
Gradient Descent vs Closed Form
Gradient Descent Closed Form Solution
• Requires multiple iterations • Non-iterative
• Need to choose α • No need for α
• Works well when n is large • Slow if n is large
• Can support incremental – Computing (X T X)-1
learning is roughly O(n3)
50
Improving Learning:
Feature Scaling
• Idea: Ensure that feature have similar scales
Before Feature Scaling After Feature Scaling
20 20
15 15
✓ 10 ✓ 10
5 5
2 2
0 0
0 5 10 15 20 0 5 10 15 20
✓1 ✓1
• Makes gradient descent converge much faster
51
Feature Standardization
• Rescales features to have zero mean and unit variance
1 X n
(i )
– Let μ j be the mean of feature j : µ j = xj
n i =1
– Replace each value with:
(i)
(i ) xj — j for j =
xj ← µ (not x 0!)
1...d
sj
• s j is the standard deviation of feature j
• Could also use the range of feature j (maxj –
minj) for s j
Productivity
Productivity
Time Spent Time Spent Time Spent
Overfitting:
• The learned hypothesis may fit the training set very
well ( J (✓) ⇡ 0 )
• ...but fails to generalize to new examples
54
Regularization
• Linear regression objective function
Xn ⇣ ⇣ ⌘ (i ) X d
1 λ
J (✓) ⌘h2✓ x ( i ) — y +
2n 2 j =1 ✓j
= i =1 2
55
Understanding Regularization
Xn ⇣ ⇣ ⌘ X d
1 (i) λ
J (✓) ⌘h2✓ x ( i ) — y j
2n + 2 j =1 ✓
= i =1 2
Xd 2 2
• Note that j = k✓1:d k2
✓ j=1
– This is the magnitude of the feature coefficient vector!
• What happens as λ ?
✓0 + ✓1 x + ✓2 x 2 + ✓3 x 3
+ ✓4 x 4
Productivity
• What happens as λ ?
0
0
Productivity
0
✓0 + ✓1 x + ✓2 x 2 + ✓3 x 3
+ ✓4 x 4
Time Spent on Work
58
Regularized Linear Regression
• Cost Function
@
J(
@✓
✓)
j
regularization
59
Regularized Linear Regression
60
Regularized Linear Regression
• To incorporate regularization into the closed form
solution:
0
2 1 —1
0 0 0 .. .
6
✓= X| X 3
B 0 C X| y
B 0 1 0 . . . 0 7C
7
@ 0 7 A
.. .. .. . .
7
.7
. .
4 . . . . 5
0 0 0 .. . 1
61
Regularized Linear Regression
• To incorporate regularization into the closed form
solution:
0 2 0 0 0 .. . 31— 1
0
0 1 0 .. .
B | 6 0 0 1 0 7C
✓ X X + ... X| y
= B@λ .. .. . . . . .. 7 C
0 5
0 0 0 .. . 1
6
4 A
@
• Can derive this the same way, by solving J (✓) =
@ 0
• Can prove that for λ > 0, inverse exists in✓
the equation above
62