0% found this document useful (0 votes)
9 views61 pages

Lecture 3

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views61 pages

Lecture 3

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 61

ECE553/653 Neural Networks

Linear Regression

1
Previously …
• Linear regression problem

• Feature maps

• Overfitting and underfitting


– Training/test dataset split
– Capacity of model family
– Bias-variance tradeoff

2
Linear Regression Problem
• Input: Dataset 𝑍 = 𝑥1, 𝑦1 , ⋯ , 𝑥𝑛 , 𝑦𝑛 , where 𝑥𝑖 ∈ ℝ𝑑 and
𝑦𝑖 ∈ ℝ
• Output: A linear function 𝑓𝛽 𝑥 = 𝛽 T 𝑥 that minimizes the
MSE:

1 𝑛 2
𝐿 𝛽; 𝑍 = σ T
𝑦 −𝛽 𝑥𝑖
𝑛 𝑖=1 𝑖
Feature Maps
General strategy Linear regression with feature map
• Model family • Linear functions over a given
𝐹 = 𝑓𝛽
𝛽
feature map ∅: 𝒙 → ℝ𝑑
• Loss function 𝐹 = 𝑓𝛽 𝑥 = 𝜷𝑇 ∅ 𝒙
𝐿 𝛽; 𝑍
• MSE
1 2
𝐿 𝛽; 𝑍 = σ𝑛𝑖=1 𝑦𝑖 − 𝜷𝑇 ∅ 𝒙𝑖
𝑛

4
Bias-Variance Tradeoff
• Capacity of a model family captures
“complexity” of data it can fit
– Higher capacity -> more likely to overfit (model
family has high variance)
– Lower capacity -> more likely to underfit (model
family has high bias)
• For linear regression, capacity corresponds to
feature dimension

5
Bias-Variance Tradeoff

6
Bias-Variance Tradeoff

7
Example of Underfitting/Overfitting
• Exploratory Data Analysis

8
Example of Underfitting/Overfitting
• Exploratory Data Analysis

9
Example of Underfitting/Overfitting
• Using ‘Sklearn.preprocessing.MinMaxScaler’
for data normalization.

𝑥 − 𝑥𝑚𝑖𝑛
𝑥=
𝑥𝑚𝑎𝑥 − 𝑥𝑚𝑖𝑛

10
Example of Underfitting/Overfitting
• Exploratory Data Analysis

11
Example of Underfitting/Overfitting
• Exploratory Data Analysis

12
Example of Underfitting/Overfitting
• Linear regression uses only the first-order
features 𝑥:
y = wx + b
• Polynomial regression uses the higher-order
combination features 𝑥′ of 𝑥:
y = wx + b
For example, the degree-2 polynomial features of
𝑥1
𝑥 = 𝑥 are 𝑥 ′ = [1, 𝑥1 , 𝑥2 , 𝑥12 ,𝑥1 𝑥2 ,𝑥22 ]T .
2
13
Example of Underfitting/Overfitting
• We have features 𝑥 = (𝑥1 , 𝑥2 ) to predict 𝑦.
• In linear regression, we have
𝑥1
w = argmin (𝑦 − 𝑤1 , 𝑤2 𝑥 − 𝑏) 2 .

w 2
• In polynomial regression, if we use the
degree-2 polynomial features of
𝑥1
𝑥2
w ∗ = argmin (𝑦 − 𝑤1 , 𝑤 2 , 𝑤 3 , … , 𝑤 6 … − 𝑏) 2
w
𝑥22

14
Example of Underfitting/Overfitting

• Degree-1

• Degree-3

• Degree-6

15
Agenda
• Regularization
– Strategy to address bias—variance tradeoff
– By example: Linear regression with 𝐿2
regularization

• Minimizing the MSE Loss


– Closed-form solution
– Gradient descent

16
Recall: Mean Squared Error Loss
• Mean squared error loss for linear regression:

1 𝑛 2
𝐿 𝛽; 𝑍 = σ T
𝑦 −𝛽 𝑥𝑖
𝑛 𝑖=1 𝑖

17
Linear Regression with 𝐿2 Regularization
• Original loss + regularization
𝑛
1 T 2 2
𝐿 𝛽; 𝑍 = 𝑛
෍ 𝑦𝑖 − 𝛽 𝑥𝑖 +𝜆∙ 𝛽 2
𝑖=1
1 2
= σ𝑛𝑖=1 T
𝑦𝑖 − 𝛽 𝑥𝑖 + 𝜆 σ𝐷 𝛽
𝑗=1 𝑗
2
𝑛

• 𝜆 is a hyperparameter that must be tuned


(satisfies 𝜆 ≥ 0)

18
Intuition on 𝐿2 Regularization
• Equivalently the 𝐿2 norm of 𝛽
𝐷

𝜆 ෍ 𝛽𝑗2 = 𝛽 2
2 = 𝛽−0 2
2
𝑗=1

• ”Pulling” 𝛽 to zero
– ”Pulls” more as 𝜆 becomes larger

19
Intuition on 𝐿2 Regularization
• Why does it help?
– Encourages “simple” functions
– As 𝜆 → ∞, obtain 𝛽 = 0
– Use 𝜆 to tune bias-variance tradeoff

20
Bias-Variance Tradeoff for Regularization

21
Intuition on 𝐿2 Regularization
𝛽2 Minimizes
original loss
Loss varies greatly (or if 𝜆 = 0)
in this direction
→ Penalizes more Minimizes
full loss

𝛽1
Minimizes
• At this point, the
regularization term
gradients are equal
(or if 𝜆 → ∞)
• Tradeoff depends on
choice of 𝜆
22
Feature Standardization
• Unregularized linear regression is invariant to
feature scaling
– Suppose we scale 𝑥𝑖𝑗 -> 2𝑥𝑖𝑗 for all examples
– Without regularization, simply use 𝛽𝑗 -> 𝛽𝑗 Τ2

• Not true for regularized regression


2
– Penalty 𝛽𝑗 Τ2 is scaled by 1Τ4

23
Feature Standardization
• Solution: Rescale features to zero mean and
unit variance

• Must use same transformation during training


and for prediction
– Compute on standardization on training data and
use on test data

24
General Regularization Strategy
• Original loss + regularization

𝐿 𝛽; 𝑍 = 𝐿 𝛽; 𝑍 + 𝜆 ∙ 𝑅 𝛽

– Offers a way to express a preference “simple”


functions in family
– Typically, regularization is independent of data

25
Hyperparameter Tuning
• 𝜆 is a hyperparameter that must be tuned
(satisfies 𝜆 ≥ 0 )
• Naïve strategy: Try a few different candidates
𝜆 and choose the one that minimizes the test
loss
• Problem: We may overfit the test set!
– Major problem if we have more hyperparameters

26
Training/Validation/Test Split
• Goal: Choose best hyperparameter 𝜆
– Can also compare different model families, feature
maps, etc.
• Solution: Optimize 𝜆 on a validation data
– Rule: 60/20/20 split

Given data 𝑍

Training data 𝑍train Val data 𝑍val Test data 𝑍test


27
27
Basic Cross Validation Algorithm
• Step 1: Split 𝑍 into 𝑍train, 𝑍val and 𝑍test

Training data 𝑍train Val data 𝑍val Test data 𝑍test

• Step 2: For 𝑡 ∈ 1, … , ℎ :
• Step 2a: Run linear regression with 𝑍train and 𝜆𝑡 to obtain
𝛽መ 𝑍train , 𝜆𝑡
• Step 2b: Evaluate validation loss 𝐿𝑡𝑣𝑎𝑙 = 𝐿 𝛽መ 𝑍train , 𝜆𝑡 ; 𝑍val
• Step 3: Use best 𝜆𝑡
• Choose 𝑡 ′ = arg min 𝐿𝑡val with lowest validation loss
𝑡
• Re-run linear regression with 𝑍train and 𝜆𝑡 ′ to obtain 𝛽መ 𝑍train , 28𝜆𝑡 ′
Alternative Cross-Validation Algorithms
• If 𝑍 is small, then splitting it can reduce
performance
– Can use 𝑍train and 𝑍val in Step 3

• Alternative: 𝑘-fold cross-validation (e.g., 𝑘=3)


– Split 𝑍 into 𝑍train and 𝑍test
𝑆
– Split 𝑍train into 𝑘 disjoint sets 𝑍val
– Use 𝜆′ that works best on average across 𝑠 ∈ 1, … , 𝑘
with 𝑍train
– Choose better 𝜆′ than above strategy

29
Example: 3-Fold Cross Validation

30
𝐿1 Regularization
• Can we minimize 𝛽 0 = 𝑗 𝛽𝑗 ≠ 0 ?
– That is, the number of nonzero components of 𝛽
– Improves interpretability (automatic feature
selection!)
– Also serves as a strong regularizer

• Challenge: 𝛽 0 is not differentiable, making


it hard to optimize

31
Intuition on 𝐿1 Regularization
𝛽2 Minimizes
original loss
(or if 𝜆 = 0)

Minimizer of full loss at


corner → sparse (𝛽1 = 0)!

𝛽1
Minimizes
regularization term 𝑛 𝐷
(or if 𝜆 → ∞) 1 2
𝐿 𝛽; 𝑍 = ෍ 𝑦𝑖 − 𝛽 T 𝑥𝑖 + 𝜆 ෍ 𝛽j
𝑛
𝑖=1 𝑗=1 32
𝐿1 Regularization for Feature Selection
• Step 1: Construct a lot of features and add to
feature map
• Step 2: Use 𝐿1 regularized regression to
“select” subset of features
– I.e., coefficient 𝛽𝑗 ≠ 0 -> feature 𝑗 is selected)
• Optional: Remove unselected features from
the feature map and run vanilla linear
regression

33
Agenda
• Regularization
– Strategy to address bias—variance tradeoff
– By example: Linear regression with 𝐿2
regularization

• Minimizing the MSE Loss


– Closed-form solution
– Gradient descent

34
Minimizing the MSE Loss
• Recall that linear regression minimizes the loss
𝑛
1 2
𝐿 𝛽; 𝑍 = ෍ 𝑦𝑖 − 𝛽 T 𝑥𝑖
𝑛
𝑖=1

• Closed-form solution: Compute using matrix


operations
• Optimization-based solution: Search over
candidate 𝛽

35
Vectorizing Linear Regression

𝑌 ≈ 𝑋𝛽

𝑦1 𝑥1,1 ⋯ 𝑥1,𝑑 𝛽1
𝑌= ⋮ 𝑋= ⋮ ⋱ ⋮ 𝛽= ⋮
𝑦𝑛 𝑥𝑛,1 ⋯ 𝑥𝑛,𝑑 𝛽𝑑

36
Vectorizing Mean Squared Error
𝑦1 𝑓𝛽 𝑥1
⋮ ⋮
𝑦𝑛 𝑓𝛽 𝑥𝑛

𝑛
1 1 2
𝐿 𝛽; 𝑍 = ෍ 𝑦𝑖 − 𝛽 𝑇 𝑥𝑖 2
= 𝑌 − 𝑋𝛽 2
𝑛 𝑛
𝑖=1

𝑛
2
𝑧 2 = ෍ 𝑧𝑖 2
𝑖=1

37
Strategy 1: Closed-Form Solution

• The gradient is
1 2 1 𝑇
∇𝛽 𝐿 𝛽; 𝑍 = ∇𝛽 𝑌 − 𝑋𝛽 2 = ∇𝛽 𝑌 − 𝑋𝛽 𝑌 − 𝑋𝛽
𝑛 𝑛
2
= ∇𝛽 𝑌 − 𝑋𝛽 𝑇 𝑌 − 𝑋𝛽
𝑛
2 𝑇
= − 𝑋 𝑌 − 𝑋𝛽
𝑛
2 𝑇 2 𝑇
= − 𝑋 𝑌 + 𝑋 𝑋𝛽
𝑛 𝑛

38
Strategy 1: Closed-Form Solution
• The gradient is

1 2 2 𝑇 2 𝑇
∇𝛽 𝐿 𝛽; 𝑍 = ∇𝛽 𝑌 − 𝑋𝛽 2 = − 𝑋 𝑌 + 𝑋 𝑋𝛽
𝑛 𝑛 𝑛

• Setting ∇𝛽 𝐿 𝛽; 𝑍 = 0, we have 𝑋 𝑇 𝑋𝛽መ = 𝑋 𝑇 𝑌

39
Strategy 1: Closed-Form Solution
• Setting ∇𝛽 𝐿 𝛽; 𝑍 = 0, we have 𝑋 𝑇 𝑋𝛽መ = 𝑋 𝑇 𝑌

• Assuming 𝑋 𝑇 𝑋 is invertible, we have

𝛽መ 𝑍 = 𝑋 𝑇 𝑋 −1
𝑋𝑇𝑌

40
Closed-Form Solution for Vanilla Regression Model

• Given 𝐿 𝛽; 𝑍 = σ𝑛𝑖=1 𝑦𝑖 − 𝑥𝑖 𝛽1 − 𝛽2 2 , we have

σ𝑛𝑖=1 𝑥𝑖 − 𝑥ത𝑦𝑖 − 𝑦ത
𝛽1 =
σ𝑛𝑖=1 𝑥𝑖 − 𝑥ത 2

𝛽2 = 𝑦ത − 𝛽1𝑥ത

1 1
𝑦ത = σ𝑛𝑖=1 𝑦𝑖 and 𝑥ҧ = σ𝑛𝑖=1 𝑥𝑖
𝑛 𝑛

41
Proof
• Obtain 𝛽2 𝑛
2
𝐿 𝛽; 𝑍 = ෍ 𝑦𝑖 − 𝑥𝑖 𝛽1 − 𝛽2
𝑖=1
𝑛
𝜕𝐿 𝛽; 𝑍
⇒ = ෍ 2 𝑦𝑖 − 𝑥𝑖 𝛽1 − 𝛽2 −1 = 0
𝜕𝛽2
𝑛 𝑖=1

⇒ ෍ 𝑦𝑖 − 𝛽1𝑥𝑖 − 𝛽2 = 0
𝑖=1
𝑛 𝑛 𝑛
⇒ ෍ 𝑦𝑖 − 𝛽1 ෍ 𝑥𝑖 − ෍ 𝛽2 = 0
𝑖=1 𝑖=1 𝑖=1
𝑛 𝑛
1 1
⇒ 𝛽2 = ෍ 𝑦𝑖 − 𝛽1 ෍ 𝑥𝑖
𝑛 𝑛
𝑖=1 𝑖=1

⇒ 𝛽2 = 𝑦ത − 𝛽1 𝑥ҧ
42
Proof
𝑛
𝜕𝐿 𝛽; 𝑍
• Obtain 𝛽1 ⇒
𝜕𝛽
= ෍ 2 𝑦𝑖 − 𝑥𝑖 𝛽1 − 𝛽2 −𝑥𝑖 = 0
𝑛 1
𝑖=1
⇒ ෍ 𝑦𝑖 𝑥𝑖 − 𝛽1 𝑥𝑖 2 − 𝛽2 𝑥𝑖 = 0
𝑖=1

43
Proof
𝑛
𝜕𝐿 𝛽; 𝑍
• Obtain 𝛽1 ⇒
𝜕𝛽
= ෍ 2 𝑦𝑖 − 𝑥𝑖 𝛽1 − 𝛽2 −𝑥𝑖 = 0
𝑛 1
𝑖=1
⇒ ෍ 𝑦𝑖 𝑥𝑖 − 𝛽1 𝑥𝑖 2 − 𝛽2 𝑥𝑖 = 0
𝑖=1
𝑛
𝛽2 = 𝑦ത − 𝛽1𝑥ҧ ⇒ ෍ 𝑦𝑖 𝑥𝑖 − 𝛽1𝑥𝑖 2 − 𝑦𝑥
ത 𝑖 + 𝛽1 𝑥𝑥
ҧ 𝑖 =0
𝑖=1
𝑛 𝑛
ത 𝑖 − 𝛽1 ෍ 𝑥𝑖 2 − 𝑥𝑥
⇒ ෍ 𝑦𝑖 𝑥𝑖 − 𝑦𝑥 ҧ 𝑖 =0
𝑖=1 𝑖=1
𝑛
σ𝑖=1 𝑦𝑖 𝑥𝑖 − 𝑦𝑥
ത 𝑖
⇒ 𝛽1 =
σ𝑛𝑖=1 𝑥𝑖 2 − 𝑥𝑥
ҧ 𝑖

44
Proof
𝑛
𝜕𝐿 𝛽; 𝑍
• Obtain 𝛽1 ⇒
𝜕𝛽
= ෍ 2 𝑦𝑖 − 𝑥𝑖 𝛽1 − 𝛽2 −𝑥𝑖 = 0
𝑛 1
𝑖=1
⇒ ෍ 𝑦𝑖 𝑥𝑖 − 𝛽1 𝑥𝑖 2 − 𝛽2 𝑥𝑖 = 0
𝑖=1
𝑛
𝛽2 = 𝑦ത − 𝛽1𝑥ҧ ⇒ ෍ 𝑦𝑖 𝑥𝑖 − 𝛽1𝑥𝑖 2 − 𝑦𝑥
ത 𝑖 + 𝛽1 𝑥𝑥
ҧ 𝑖 =0
𝑖=1
𝑛 𝑛
ത 𝑖 − 𝛽1 ෍ 𝑥𝑖 2 − 𝑥𝑥
⇒ ෍ 𝑦𝑖 𝑥𝑖 − 𝑦𝑥 ҧ 𝑖 =0
𝑖=1 𝑖=1
𝑛
σ𝑖=1 𝑦𝑖 𝑥𝑖 − 𝑦𝑥
ത 𝑖
⇒ 𝛽1 =
σ𝑛𝑖=1 𝑥𝑖 2 − 𝑥𝑥
ҧ 𝑖

Since σ𝑛𝑖=1 𝑥𝑖 𝑦ത = 𝑛𝑥ҧ 𝑦ത and σ𝑛𝑖=1 𝑥𝑖 𝑥ҧ = 𝑛𝑥ҧ 2 , we have

45
Proof
𝑛
𝜕𝐿 𝛽; 𝑍
• Obtain 𝛽1 ⇒
𝜕𝛽
= ෍ 2 𝑦𝑖 − 𝑥𝑖 𝛽1 − 𝛽2 −𝑥𝑖 = 0
𝑛 1
𝑖=1
⇒ ෍ 𝑦𝑖 𝑥𝑖 − 𝛽1 𝑥𝑖 2 − 𝛽2 𝑥𝑖 = 0
𝑖=1
𝑛
𝛽2 = 𝑦ത − 𝛽1𝑥ҧ ⇒ ෍ 𝑦𝑖 𝑥𝑖 − 𝛽1𝑥𝑖 2 − 𝑦𝑥
ത 𝑖 + 𝛽1 𝑥𝑥
ҧ 𝑖 =0
𝑖=1
𝑛 𝑛
ത 𝑖 − 𝛽1 ෍ 𝑥𝑖 2 − 𝑥𝑥
⇒ ෍ 𝑦𝑖 𝑥𝑖 − 𝑦𝑥 ҧ 𝑖 =0
𝑖=1 𝑖=1
𝑛
σ𝑖=1 𝑦𝑖 𝑥𝑖 − 𝑦𝑥
ത 𝑖
⇒ 𝛽1 =
σ𝑛𝑖=1 𝑥𝑖 2 − 𝑥𝑥
ҧ 𝑖

Since σ𝑛𝑖=1 𝑦𝑖 𝑥ҧ = 𝑛𝑥ҧ 𝑦ത and σ𝑛𝑖=1 𝑥𝑖 𝑥ҧ = 𝑛𝑥ҧ 2 , we have


σ𝑛𝑖=1 𝑦𝑖 𝑥𝑖 − 𝑦𝑥
ത 𝑖 − 𝑥𝑦 ҧ 𝑖 + 𝑥ҧ 𝑦ത σ𝑛𝑖=1 𝑥𝑖 − 𝑥ҧ 𝑦𝑖 − 𝑦ത
⇒ 𝛽1 = 𝑛 =
σ𝑖=1 𝑥𝑖 2 − 𝑥𝑥 ҧ 𝑖 + 𝑥ҧ 2
ҧ 𝑖 − 𝑥𝑥 σ𝑛𝑖=1 𝑥𝑖 − 𝑥ҧ 2
46
Example

• Five randomly selected students took a math


test before they began their statistics course.

• The Statistics Department has two questions.


– What linear regression equation best predicts
statistics performance, based on test scores?

– If a student made an 80 on the test, what grade


would we expect her to make in statistics?
Example

• 𝑥𝑖 is the score of the test.


• 𝑦𝑖 is the statistics grades
• 𝑦𝑖 = 𝛽1 𝑥𝑖 + 𝛽2
Student 𝒙𝒊 𝒚𝒊
1 95 85
2 85 95
3 80 70
4 70 65
5 60 70
Sum 390 385
Mean 78 77
Example

• 𝑥𝑖 is the score of the aptitude test.


• 𝑦𝑖 is the statistics grades 𝛽1 =
σ 𝑛
𝑖=1 𝑥𝑖 − 𝑥ҧ
𝑛
𝑦𝑖 − 𝑦ത
σ𝑖=1 𝑥𝑖 − 𝑥ҧ 2
𝛽2 = 𝑦ത − 𝛽1 𝑥ҧ

Student 𝒙𝒊 𝒚𝒊
1 95 85
2 85 95
3 80 70
4 70 65
5 60 70
Sum 390 385
Mean 78 77
Example

• 𝑥𝑖 is the score of the aptitude test.


• 𝑦𝑖 is the statistics grades 𝛽1 =
σ 𝑛
𝑖=1 𝑥𝑖 − 𝑥ҧ
𝑛
𝑦𝑖 − 𝑦ത
σ𝑖=1 𝑥𝑖 − 𝑥ҧ 2
𝛽2 = 𝑦ത − 𝛽1 𝑥ҧ

Student 𝒙𝒊 𝒚𝒊 𝑥𝑖 − 𝑥ҧ 𝑦𝑖 − 𝑦ത
1 95 85 17 8
2 85 95 7 18
3 80 70 2 -7
4 70 65 -8 -12
5 60 70 -18 -7
Sum 390 385
Mean 78 77
Example

• 𝑥𝑖 is the score of the aptitude test.


• 𝑦𝑖 is the statistics grades 𝛽1 =
σ 𝑛
𝑖=1 𝑥𝑖 − 𝑥ҧ
𝑛
𝑦𝑖 − 𝑦ത
σ𝑖=1 𝑥𝑖 − 𝑥ҧ 2
𝛽2 = 𝑦ത − 𝛽1 𝑥ҧ

Student 𝒙𝒊 𝒚𝒊 𝑥𝑖 − 𝑥ҧ (𝑦𝑖
− 𝑦ത )
1 95 85 136
2 85 95 126
3 80 70 -14
4 70 65 96
5 60 70 126
Sum 390 385 470
Mean 78 77
Example

• 𝑥𝑖 is the score of the aptitude test.


• 𝑦𝑖 is the statistics grades 𝛽1 = 𝑛
σ
470
2
𝑖=1 𝑥𝑖 − 𝑥ҧ

𝛽2 = 𝑦ത − 𝛽1 𝑥ҧ

Student 𝒙𝒊 𝒚𝒊 𝑥𝑖 − 𝑥ҧ (𝑦𝑖
− 𝑦ത )
1 95 85 136
2 85 95 126
3 80 70 -14
4 70 65 96
5 60 70 126
Sum 390 385 470
Mean 78 77
Example

• 𝑥𝑖 is the score of the aptitude test.


• 𝑦𝑖 is the statistics grades 𝛽1 =
470
= 0.644
730
𝛽2 = 𝑦ത − 𝛽1 𝑥ҧ
2
Student 𝒙𝒊 𝒚𝒊 𝑥𝑖 − 𝑥ҧ (𝑦𝑖 𝑥𝑖 − 𝑥ҧ
− 𝑦ത )
1 95 85 136 289
2 85 95 126 49
3 80 70 -14 4
4 70 65 96 64
5 60 70 126 324
Sum 390 385 470 730
Mean 78 77
Example

• 𝑥𝑖 is the score of the aptitude test.


• 𝑦𝑖 is the statistics grades 𝛽1 =
470
= 0.644
730
𝛽2 = 77 − 0.644 × 78
= 26.768 2
Student 𝒙𝒊 𝒚𝒊 𝑥𝑖 − 𝑥ҧ (𝑦𝑖 𝑥𝑖 − 𝑥ҧ
− 𝑦ത )
1 95 85 136 289
2 85 95 126 49
3 80 70 -14 4
4 70 65 96 64
5 60 70 126 324
Sum 390 385 470 730
Mean 78 77
Example

• What linear regression equation best predicts


statistics performance, based on test scores?

𝑦𝑖 = 𝛽1 𝑥𝑖 + 𝛽2
470
𝛽1 = = 0.644
730

𝛽2 = 77 − 0.644 × 78 = 26.768
Example

• If a student made an 80 on the test, what


grade would we expect her to make in
statistics?
𝑦𝑖 = 0.644 × 80 + 26.768
= 78.288
– Note: do not use values for the independent
variable that are outside the range of values
used to create the equation.
Note on Invertibility
• Closed-form solution only unique if 𝑋 𝑇 𝑋 is
invertible
– Otherwise, multiple solutions exist
– Example:
1 1 𝛽መ1 2
=
2 2 𝛽መ2 4

– Any 𝛽መ2 = 2 − 𝛽መ1 is a solution

57
When Can This Happen?
• Case 1: Fewer data examples than feature
dimension (i.e., 𝑛 < 𝑑)
– Solution 1: Remove features so 𝑑 ≤ 𝑛
– Solution 2: Collect more data until 𝑑 ≤ 𝑛

• Case 2: Some feature is a linear combination


of the others
– Solution 1: Remove linearly dependent features
– Solution 2: Use 𝐿2 regularization

58
Shortcomings of Closed-Form Solution

• Computing 𝛽መ 𝑍 = 𝑋 𝑇 𝑋 −1
𝑋 𝑇 𝑌 can be
challenging

𝑇 −1 3
• Computing 𝑋 𝑋 is 𝑂 𝑑
– 𝑑 = 104 features -> 𝑂 1012
– Even storing 𝑋 𝑇 𝑋 requires a lot of memory

59
Shortcomings of Closed-Form Solution
• Numerical accuracy issues due to “ill-
conditioning”
– 𝑋 𝑇 𝑋 is “barely” invertible
– Then, 𝑋 𝑇 𝑋 −1 has large variance along some
dimension
– Regularization helps

60
Agenda
• Regularization
– Strategy to address bias—variance tradeoff
– By example: Linear regression with 𝐿2
regularization

• Minimizing the MSE Loss


– Closed-form solution
– Gradient descent

61

You might also like