0% found this document useful (0 votes)

9 views61 pages

Lecture 3

Uploaded by

bluishrodriguezrr

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

9 views61 pages

Lecture 3

Uploaded by

bluishrodriguezrr

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 61

ECE553/653 Neural Networks

Linear Regression

1
Previously …
• Linear regression problem

• Feature maps

• Overfitting and underfitting

– Training/test dataset split
– Capacity of model family
– Bias-variance tradeoff

2
Linear Regression Problem
• Input: Dataset 𝑍 = 𝑥1, 𝑦1 , ⋯ , 𝑥𝑛 , 𝑦𝑛 , where 𝑥𝑖 ∈ ℝ𝑑 and
𝑦𝑖 ∈ ℝ
• Output: A linear function 𝑓𝛽 𝑥 = 𝛽 T 𝑥 that minimizes the
MSE:

1 𝑛 2
𝐿 𝛽; 𝑍 = σ T
𝑦 −𝛽 𝑥𝑖
𝑛 𝑖=1 𝑖
Feature Maps
General strategy Linear regression with feature map
• Model family • Linear functions over a given
𝐹 = 𝑓𝛽
𝛽
feature map ∅: 𝒙 → ℝ𝑑
• Loss function 𝐹 = 𝑓𝛽 𝑥 = 𝜷𝑇 ∅ 𝒙
𝐿 𝛽; 𝑍
• MSE
1 2
𝐿 𝛽; 𝑍 = σ𝑛𝑖=1 𝑦𝑖 − 𝜷𝑇 ∅ 𝒙𝑖
𝑛

4
Bias-Variance Tradeoff
• Capacity of a model family captures
“complexity” of data it can fit
– Higher capacity -> more likely to overfit (model
family has high variance)
– Lower capacity -> more likely to underfit (model
family has high bias)
• For linear regression, capacity corresponds to
feature dimension

5
Bias-Variance Tradeoff

6
Bias-Variance Tradeoff

7
Example of Underfitting/Overfitting
• Exploratory Data Analysis

8
Example of Underfitting/Overfitting
• Exploratory Data Analysis

9
Example of Underfitting/Overfitting
• Using ‘Sklearn.preprocessing.MinMaxScaler’
for data normalization.

𝑥 − 𝑥𝑚𝑖𝑛
𝑥=
𝑥𝑚𝑎𝑥 − 𝑥𝑚𝑖𝑛

10
Example of Underfitting/Overfitting
• Exploratory Data Analysis

11
Example of Underfitting/Overfitting
• Exploratory Data Analysis

12
Example of Underfitting/Overfitting
• Linear regression uses only the first-order
features 𝑥:
y = wx + b
• Polynomial regression uses the higher-order
combination features 𝑥′ of 𝑥:
y = wx + b
For example, the degree-2 polynomial features of
𝑥1
𝑥 = 𝑥 are 𝑥 ′ = [1, 𝑥1 , 𝑥2 , 𝑥12 ,𝑥1 𝑥2 ,𝑥22 ]T .
2
13
Example of Underfitting/Overfitting
• We have features 𝑥 = (𝑥1 , 𝑥2 ) to predict 𝑦.
• In linear regression, we have
𝑥1
w = argmin (𝑦 − 𝑤1 , 𝑤2 𝑥 − 𝑏) 2 .
∗
w 2
• In polynomial regression, if we use the
degree-2 polynomial features of
𝑥1
𝑥2
w ∗ = argmin (𝑦 − 𝑤1 , 𝑤 2 , 𝑤 3 , … , 𝑤 6 … − 𝑏) 2
w
𝑥22

14
Example of Underfitting/Overfitting

• Degree-1

• Degree-3

• Degree-6

15
Agenda
• Regularization
– Strategy to address bias—variance tradeoff
– By example: Linear regression with 𝐿2
regularization

• Minimizing the MSE Loss

– Closed-form solution
– Gradient descent

16
Recall: Mean Squared Error Loss
• Mean squared error loss for linear regression:

1 𝑛 2
𝐿 𝛽; 𝑍 = σ T
𝑦 −𝛽 𝑥𝑖
𝑛 𝑖=1 𝑖

17
Linear Regression with 𝐿2 Regularization
• Original loss + regularization
𝑛
1 T 2 2
𝐿 𝛽; 𝑍 = 𝑛
෍ 𝑦𝑖 − 𝛽 𝑥𝑖 +𝜆∙ 𝛽 2
𝑖=1
1 2
= σ𝑛𝑖=1 T
𝑦𝑖 − 𝛽 𝑥𝑖 + 𝜆 σ𝐷 𝛽
𝑗=1 𝑗
2
𝑛

• 𝜆 is a hyperparameter that must be tuned

(satisfies 𝜆 ≥ 0)

18
Intuition on 𝐿2 Regularization
• Equivalently the 𝐿2 norm of 𝛽
𝐷

𝜆 ෍ 𝛽𝑗2 = 𝛽 2
2 = 𝛽−0 2
2
𝑗=1

• ”Pulling” 𝛽 to zero
– ”Pulls” more as 𝜆 becomes larger

19
Intuition on 𝐿2 Regularization
• Why does it help?
– Encourages “simple” functions
– As 𝜆 → ∞, obtain 𝛽 = 0
– Use 𝜆 to tune bias-variance tradeoff

20
Bias-Variance Tradeoff for Regularization

21
Intuition on 𝐿2 Regularization
𝛽2 Minimizes
original loss
Loss varies greatly (or if 𝜆 = 0)
in this direction
→ Penalizes more Minimizes
full loss

𝛽1
Minimizes
• At this point, the
regularization term
gradients are equal
(or if 𝜆 → ∞)
• Tradeoff depends on
choice of 𝜆
22
Feature Standardization
• Unregularized linear regression is invariant to
feature scaling
– Suppose we scale 𝑥𝑖𝑗 -> 2𝑥𝑖𝑗 for all examples
– Without regularization, simply use 𝛽𝑗 -> 𝛽𝑗 Τ2

• Not true for regularized regression

2
– Penalty 𝛽𝑗 Τ2 is scaled by 1Τ4

23
Feature Standardization
• Solution: Rescale features to zero mean and
unit variance

• Must use same transformation during training

and for prediction
– Compute on standardization on training data and
use on test data

24
General Regularization Strategy
• Original loss + regularization

𝐿 𝛽; 𝑍 = 𝐿 𝛽; 𝑍 + 𝜆 ∙ 𝑅 𝛽

– Offers a way to express a preference “simple”

functions in family
– Typically, regularization is independent of data

25
Hyperparameter Tuning
• 𝜆 is a hyperparameter that must be tuned
(satisfies 𝜆 ≥ 0 )
• Naïve strategy: Try a few different candidates
𝜆 and choose the one that minimizes the test
loss
• Problem: We may overfit the test set!
– Major problem if we have more hyperparameters

26
Training/Validation/Test Split
• Goal: Choose best hyperparameter 𝜆
– Can also compare different model families, feature
maps, etc.
• Solution: Optimize 𝜆 on a validation data
– Rule: 60/20/20 split

Given data 𝑍

Training data 𝑍train Val data 𝑍val Test data 𝑍test

27
27
Basic Cross Validation Algorithm
• Step 1: Split 𝑍 into 𝑍train, 𝑍val and 𝑍test

Training data 𝑍train Val data 𝑍val Test data 𝑍test

• Step 2: For 𝑡 ∈ 1, … , ℎ :
• Step 2a: Run linear regression with 𝑍train and 𝜆𝑡 to obtain
𝛽መ 𝑍train , 𝜆𝑡
• Step 2b: Evaluate validation loss 𝐿𝑡𝑣𝑎𝑙 = 𝐿 𝛽መ 𝑍train , 𝜆𝑡 ; 𝑍val
• Step 3: Use best 𝜆𝑡
• Choose 𝑡 ′ = arg min 𝐿𝑡val with lowest validation loss
𝑡
• Re-run linear regression with 𝑍train and 𝜆𝑡 ′ to obtain 𝛽መ 𝑍train , 28𝜆𝑡 ′
Alternative Cross-Validation Algorithms
• If 𝑍 is small, then splitting it can reduce
performance
– Can use 𝑍train and 𝑍val in Step 3

• Alternative: 𝑘-fold cross-validation (e.g., 𝑘=3)

– Split 𝑍 into 𝑍train and 𝑍test
𝑆
– Split 𝑍train into 𝑘 disjoint sets 𝑍val
– Use 𝜆′ that works best on average across 𝑠 ∈ 1, … , 𝑘
with 𝑍train
– Choose better 𝜆′ than above strategy

29
Example: 3-Fold Cross Validation

30
𝐿1 Regularization
• Can we minimize 𝛽 0 = 𝑗 𝛽𝑗 ≠ 0 ?
– That is, the number of nonzero components of 𝛽
– Improves interpretability (automatic feature
selection!)
– Also serves as a strong regularizer

• Challenge: 𝛽 0 is not differentiable, making

it hard to optimize

31
Intuition on 𝐿1 Regularization
𝛽2 Minimizes
original loss
(or if 𝜆 = 0)

Minimizer of full loss at

corner → sparse (𝛽1 = 0)!

𝛽1
Minimizes
regularization term 𝑛 𝐷
(or if 𝜆 → ∞) 1 2
𝐿 𝛽; 𝑍 = ෍ 𝑦𝑖 − 𝛽 T 𝑥𝑖 + 𝜆 ෍ 𝛽j
𝑛
𝑖=1 𝑗=1 32
𝐿1 Regularization for Feature Selection
• Step 1: Construct a lot of features and add to
feature map
• Step 2: Use 𝐿1 regularized regression to
“select” subset of features
– I.e., coefficient 𝛽𝑗 ≠ 0 -> feature 𝑗 is selected)
• Optional: Remove unselected features from
the feature map and run vanilla linear
regression

33
Agenda
• Regularization
– Strategy to address bias—variance tradeoff
– By example: Linear regression with 𝐿2
regularization

• Minimizing the MSE Loss

– Closed-form solution
– Gradient descent

34
Minimizing the MSE Loss
• Recall that linear regression minimizes the loss
𝑛
1 2
𝐿 𝛽; 𝑍 = ෍ 𝑦𝑖 − 𝛽 T 𝑥𝑖
𝑛
𝑖=1

• Closed-form solution: Compute using matrix

operations
• Optimization-based solution: Search over
candidate 𝛽

35
Vectorizing Linear Regression

𝑌 ≈ 𝑋𝛽

𝑦1 𝑥1,1 ⋯ 𝑥1,𝑑 𝛽1
𝑌= ⋮ 𝑋= ⋮ ⋱ ⋮ 𝛽= ⋮
𝑦𝑛 𝑥𝑛,1 ⋯ 𝑥𝑛,𝑑 𝛽𝑑

36
Vectorizing Mean Squared Error
𝑦1 𝑓𝛽 𝑥1
⋮ ⋮
𝑦𝑛 𝑓𝛽 𝑥𝑛

𝑛
1 1 2
𝐿 𝛽; 𝑍 = ෍ 𝑦𝑖 − 𝛽 𝑇 𝑥𝑖 2
= 𝑌 − 𝑋𝛽 2
𝑛 𝑛
𝑖=1

𝑛
2
𝑧 2 = ෍ 𝑧𝑖 2
𝑖=1

37
Strategy 1: Closed-Form Solution

• The gradient is
1 2 1 𝑇
∇𝛽 𝐿 𝛽; 𝑍 = ∇𝛽 𝑌 − 𝑋𝛽 2 = ∇𝛽 𝑌 − 𝑋𝛽 𝑌 − 𝑋𝛽
𝑛 𝑛
2
= ∇𝛽 𝑌 − 𝑋𝛽 𝑇 𝑌 − 𝑋𝛽
𝑛
2 𝑇
= − 𝑋 𝑌 − 𝑋𝛽
𝑛
2 𝑇 2 𝑇
= − 𝑋 𝑌 + 𝑋 𝑋𝛽
𝑛 𝑛

38
Strategy 1: Closed-Form Solution
• The gradient is

1 2 2 𝑇 2 𝑇
∇𝛽 𝐿 𝛽; 𝑍 = ∇𝛽 𝑌 − 𝑋𝛽 2 = − 𝑋 𝑌 + 𝑋 𝑋𝛽
𝑛 𝑛 𝑛

• Setting ∇𝛽 𝐿 𝛽; 𝑍 = 0, we have 𝑋 𝑇 𝑋𝛽መ = 𝑋 𝑇 𝑌

39
Strategy 1: Closed-Form Solution
• Setting ∇𝛽 𝐿 𝛽; 𝑍 = 0, we have 𝑋 𝑇 𝑋𝛽መ = 𝑋 𝑇 𝑌

• Assuming 𝑋 𝑇 𝑋 is invertible, we have

𝛽መ 𝑍 = 𝑋 𝑇 𝑋 −1
𝑋𝑇𝑌

40
Closed-Form Solution for Vanilla Regression Model

• Given 𝐿 𝛽; 𝑍 = σ𝑛𝑖=1 𝑦𝑖 − 𝑥𝑖 𝛽1 − 𝛽2 2 , we have

σ𝑛𝑖=1 𝑥𝑖 − 𝑥ത𝑦𝑖 − 𝑦ത
𝛽1 =
σ𝑛𝑖=1 𝑥𝑖 − 𝑥ത 2

𝛽2 = 𝑦ത − 𝛽1𝑥ത

1 1
𝑦ത = σ𝑛𝑖=1 𝑦𝑖 and 𝑥ҧ = σ𝑛𝑖=1 𝑥𝑖
𝑛 𝑛

41
Proof
• Obtain 𝛽2 𝑛
2
𝐿 𝛽; 𝑍 = ෍ 𝑦𝑖 − 𝑥𝑖 𝛽1 − 𝛽2
𝑖=1
𝑛
𝜕𝐿 𝛽; 𝑍
⇒ = ෍ 2 𝑦𝑖 − 𝑥𝑖 𝛽1 − 𝛽2 −1 = 0
𝜕𝛽2
𝑛 𝑖=1

⇒ ෍ 𝑦𝑖 − 𝛽1𝑥𝑖 − 𝛽2 = 0
𝑖=1
𝑛 𝑛 𝑛
⇒ ෍ 𝑦𝑖 − 𝛽1 ෍ 𝑥𝑖 − ෍ 𝛽2 = 0
𝑖=1 𝑖=1 𝑖=1
𝑛 𝑛
1 1
⇒ 𝛽2 = ෍ 𝑦𝑖 − 𝛽1 ෍ 𝑥𝑖
𝑛 𝑛
𝑖=1 𝑖=1

⇒ 𝛽2 = 𝑦ത − 𝛽1 𝑥ҧ
42
Proof
𝑛
𝜕𝐿 𝛽; 𝑍
• Obtain 𝛽1 ⇒
𝜕𝛽
= ෍ 2 𝑦𝑖 − 𝑥𝑖 𝛽1 − 𝛽2 −𝑥𝑖 = 0
𝑛 1
𝑖=1
⇒ ෍ 𝑦𝑖 𝑥𝑖 − 𝛽1 𝑥𝑖 2 − 𝛽2 𝑥𝑖 = 0
𝑖=1

43
Proof
𝑛
𝜕𝐿 𝛽; 𝑍
• Obtain 𝛽1 ⇒
𝜕𝛽
= ෍ 2 𝑦𝑖 − 𝑥𝑖 𝛽1 − 𝛽2 −𝑥𝑖 = 0
𝑛 1
𝑖=1
⇒ ෍ 𝑦𝑖 𝑥𝑖 − 𝛽1 𝑥𝑖 2 − 𝛽2 𝑥𝑖 = 0
𝑖=1
𝑛
𝛽2 = 𝑦ത − 𝛽1𝑥ҧ ⇒ ෍ 𝑦𝑖 𝑥𝑖 − 𝛽1𝑥𝑖 2 − 𝑦𝑥
ത 𝑖 + 𝛽1 𝑥𝑥
ҧ 𝑖 =0
𝑖=1
𝑛 𝑛
ത 𝑖 − 𝛽1 ෍ 𝑥𝑖 2 − 𝑥𝑥
⇒ ෍ 𝑦𝑖 𝑥𝑖 − 𝑦𝑥 ҧ 𝑖 =0
𝑖=1 𝑖=1
𝑛
σ𝑖=1 𝑦𝑖 𝑥𝑖 − 𝑦𝑥
ത 𝑖
⇒ 𝛽1 =
σ𝑛𝑖=1 𝑥𝑖 2 − 𝑥𝑥
ҧ 𝑖

44
Proof
𝑛
𝜕𝐿 𝛽; 𝑍
• Obtain 𝛽1 ⇒
𝜕𝛽
= ෍ 2 𝑦𝑖 − 𝑥𝑖 𝛽1 − 𝛽2 −𝑥𝑖 = 0
𝑛 1
𝑖=1
⇒ ෍ 𝑦𝑖 𝑥𝑖 − 𝛽1 𝑥𝑖 2 − 𝛽2 𝑥𝑖 = 0
𝑖=1
𝑛
𝛽2 = 𝑦ത − 𝛽1𝑥ҧ ⇒ ෍ 𝑦𝑖 𝑥𝑖 − 𝛽1𝑥𝑖 2 − 𝑦𝑥
ത 𝑖 + 𝛽1 𝑥𝑥
ҧ 𝑖 =0
𝑖=1
𝑛 𝑛
ത 𝑖 − 𝛽1 ෍ 𝑥𝑖 2 − 𝑥𝑥
⇒ ෍ 𝑦𝑖 𝑥𝑖 − 𝑦𝑥 ҧ 𝑖 =0
𝑖=1 𝑖=1
𝑛
σ𝑖=1 𝑦𝑖 𝑥𝑖 − 𝑦𝑥
ത 𝑖
⇒ 𝛽1 =
σ𝑛𝑖=1 𝑥𝑖 2 − 𝑥𝑥
ҧ 𝑖

Since σ𝑛𝑖=1 𝑥𝑖 𝑦ത = 𝑛𝑥ҧ 𝑦ത and σ𝑛𝑖=1 𝑥𝑖 𝑥ҧ = 𝑛𝑥ҧ 2 , we have

45
Proof
𝑛
𝜕𝐿 𝛽; 𝑍
• Obtain 𝛽1 ⇒
𝜕𝛽
= ෍ 2 𝑦𝑖 − 𝑥𝑖 𝛽1 − 𝛽2 −𝑥𝑖 = 0
𝑛 1
𝑖=1
⇒ ෍ 𝑦𝑖 𝑥𝑖 − 𝛽1 𝑥𝑖 2 − 𝛽2 𝑥𝑖 = 0
𝑖=1
𝑛
𝛽2 = 𝑦ത − 𝛽1𝑥ҧ ⇒ ෍ 𝑦𝑖 𝑥𝑖 − 𝛽1𝑥𝑖 2 − 𝑦𝑥
ത 𝑖 + 𝛽1 𝑥𝑥
ҧ 𝑖 =0
𝑖=1
𝑛 𝑛
ത 𝑖 − 𝛽1 ෍ 𝑥𝑖 2 − 𝑥𝑥
⇒ ෍ 𝑦𝑖 𝑥𝑖 − 𝑦𝑥 ҧ 𝑖 =0
𝑖=1 𝑖=1
𝑛
σ𝑖=1 𝑦𝑖 𝑥𝑖 − 𝑦𝑥
ത 𝑖
⇒ 𝛽1 =
σ𝑛𝑖=1 𝑥𝑖 2 − 𝑥𝑥
ҧ 𝑖

Since σ𝑛𝑖=1 𝑦𝑖 𝑥ҧ = 𝑛𝑥ҧ 𝑦ത and σ𝑛𝑖=1 𝑥𝑖 𝑥ҧ = 𝑛𝑥ҧ 2 , we have

σ𝑛𝑖=1 𝑦𝑖 𝑥𝑖 − 𝑦𝑥
ത 𝑖 − 𝑥𝑦 ҧ 𝑖 + 𝑥ҧ 𝑦ത σ𝑛𝑖=1 𝑥𝑖 − 𝑥ҧ 𝑦𝑖 − 𝑦ത
⇒ 𝛽1 = 𝑛 =
σ𝑖=1 𝑥𝑖 2 − 𝑥𝑥 ҧ 𝑖 + 𝑥ҧ 2
ҧ 𝑖 − 𝑥𝑥 σ𝑛𝑖=1 𝑥𝑖 − 𝑥ҧ 2
46
Example

• Five randomly selected students took a math

test before they began their statistics course.

• The Statistics Department has two questions.

– What linear regression equation best predicts
statistics performance, based on test scores?

– If a student made an 80 on the test, what grade

would we expect her to make in statistics?
Example

• 𝑥𝑖 is the score of the test.

• 𝑦𝑖 is the statistics grades
• 𝑦𝑖 = 𝛽1 𝑥𝑖 + 𝛽2
Student 𝒙𝒊 𝒚𝒊
1 95 85
2 85 95
3 80 70
4 70 65
5 60 70
Sum 390 385
Mean 78 77
Example

• 𝑥𝑖 is the score of the aptitude test.

• 𝑦𝑖 is the statistics grades 𝛽1 =
σ 𝑛
𝑖=1 𝑥𝑖 − 𝑥ҧ
𝑛
𝑦𝑖 − 𝑦ത
σ𝑖=1 𝑥𝑖 − 𝑥ҧ 2
𝛽2 = 𝑦ത − 𝛽1 𝑥ҧ

Student 𝒙𝒊 𝒚𝒊
1 95 85
2 85 95
3 80 70
4 70 65
5 60 70
Sum 390 385
Mean 78 77
Example

• 𝑥𝑖 is the score of the aptitude test.

• 𝑦𝑖 is the statistics grades 𝛽1 =
σ 𝑛
𝑖=1 𝑥𝑖 − 𝑥ҧ
𝑛
𝑦𝑖 − 𝑦ത
σ𝑖=1 𝑥𝑖 − 𝑥ҧ 2
𝛽2 = 𝑦ത − 𝛽1 𝑥ҧ

Student 𝒙𝒊 𝒚𝒊 𝑥𝑖 − 𝑥ҧ 𝑦𝑖 − 𝑦ത
1 95 85 17 8
2 85 95 7 18
3 80 70 2 -7
4 70 65 -8 -12
5 60 70 -18 -7
Sum 390 385
Mean 78 77
Example

• 𝑥𝑖 is the score of the aptitude test.

• 𝑦𝑖 is the statistics grades 𝛽1 =
σ 𝑛
𝑖=1 𝑥𝑖 − 𝑥ҧ
𝑛
𝑦𝑖 − 𝑦ത
σ𝑖=1 𝑥𝑖 − 𝑥ҧ 2
𝛽2 = 𝑦ത − 𝛽1 𝑥ҧ

Student 𝒙𝒊 𝒚𝒊 𝑥𝑖 − 𝑥ҧ (𝑦𝑖
− 𝑦ത )
1 95 85 136
2 85 95 126
3 80 70 -14
4 70 65 96
5 60 70 126
Sum 390 385 470
Mean 78 77
Example

• 𝑥𝑖 is the score of the aptitude test.

• 𝑦𝑖 is the statistics grades 𝛽1 = 𝑛
σ
470
2
𝑖=1 𝑥𝑖 − 𝑥ҧ

𝛽2 = 𝑦ത − 𝛽1 𝑥ҧ

Student 𝒙𝒊 𝒚𝒊 𝑥𝑖 − 𝑥ҧ (𝑦𝑖
− 𝑦ത )
1 95 85 136
2 85 95 126
3 80 70 -14
4 70 65 96
5 60 70 126
Sum 390 385 470
Mean 78 77
Example

• 𝑥𝑖 is the score of the aptitude test.

• 𝑦𝑖 is the statistics grades 𝛽1 =
470
= 0.644
730
𝛽2 = 𝑦ത − 𝛽1 𝑥ҧ
2
Student 𝒙𝒊 𝒚𝒊 𝑥𝑖 − 𝑥ҧ (𝑦𝑖 𝑥𝑖 − 𝑥ҧ
− 𝑦ത )
1 95 85 136 289
2 85 95 126 49
3 80 70 -14 4
4 70 65 96 64
5 60 70 126 324
Sum 390 385 470 730
Mean 78 77
Example

• 𝑥𝑖 is the score of the aptitude test.

• 𝑦𝑖 is the statistics grades 𝛽1 =
470
= 0.644
730
𝛽2 = 77 − 0.644 × 78
= 26.768 2
Student 𝒙𝒊 𝒚𝒊 𝑥𝑖 − 𝑥ҧ (𝑦𝑖 𝑥𝑖 − 𝑥ҧ
− 𝑦ത )
1 95 85 136 289
2 85 95 126 49
3 80 70 -14 4
4 70 65 96 64
5 60 70 126 324
Sum 390 385 470 730
Mean 78 77
Example

• What linear regression equation best predicts

statistics performance, based on test scores?

𝑦𝑖 = 𝛽1 𝑥𝑖 + 𝛽2
470
𝛽1 = = 0.644
730

𝛽2 = 77 − 0.644 × 78 = 26.768
Example

• If a student made an 80 on the test, what

grade would we expect her to make in
statistics?
𝑦𝑖 = 0.644 × 80 + 26.768
= 78.288
– Note: do not use values for the independent
variable that are outside the range of values
used to create the equation.
Note on Invertibility
• Closed-form solution only unique if 𝑋 𝑇 𝑋 is
invertible
– Otherwise, multiple solutions exist
– Example:
1 1 𝛽መ1 2
=
2 2 𝛽መ2 4

– Any 𝛽መ2 = 2 − 𝛽መ1 is a solution

57
When Can This Happen?
• Case 1: Fewer data examples than feature
dimension (i.e., 𝑛 < 𝑑)
– Solution 1: Remove features so 𝑑 ≤ 𝑛
– Solution 2: Collect more data until 𝑑 ≤ 𝑛

• Case 2: Some feature is a linear combination

of the others
– Solution 1: Remove linearly dependent features
– Solution 2: Use 𝐿2 regularization

58
Shortcomings of Closed-Form Solution

• Computing 𝛽መ 𝑍 = 𝑋 𝑇 𝑋 −1
𝑋 𝑇 𝑌 can be
challenging

𝑇 −1 3
• Computing 𝑋 𝑋 is 𝑂 𝑑
– 𝑑 = 104 features -> 𝑂 1012
– Even storing 𝑋 𝑇 𝑋 requires a lot of memory

59
Shortcomings of Closed-Form Solution
• Numerical accuracy issues due to “ill-
conditioning”
– 𝑋 𝑇 𝑋 is “barely” invertible
– Then, 𝑋 𝑇 𝑋 −1 has large variance along some
dimension
– Regularization helps

60
Agenda
• Regularization
– Strategy to address bias—variance tradeoff
– By example: Linear regression with 𝐿2
regularization

• Minimizing the MSE Loss

– Closed-form solution
– Gradient descent

Linear Regression Python Programming
No ratings yet
Linear Regression Python Programming
25 pages
Regression and Generalization
No ratings yet
Regression and Generalization
67 pages
10 Linear Regression
No ratings yet
10 Linear Regression
61 pages
Bias
No ratings yet
Bias
62 pages
Cost Function
No ratings yet
Cost Function
17 pages
Lecture 7
No ratings yet
Lecture 7
29 pages
Lecture3 Supervised Learning I
No ratings yet
Lecture3 Supervised Learning I
84 pages
Week 4 Linear Regression
No ratings yet
Week 4 Linear Regression
38 pages
L5 SVM
No ratings yet
L5 SVM
61 pages
Lecture16 Crossvalidation
No ratings yet
Lecture16 Crossvalidation
32 pages
Lecture 4 - Regularization
No ratings yet
Lecture 4 - Regularization
22 pages
Group30 Linear Regression
No ratings yet
Group30 Linear Regression
20 pages
Group 30
No ratings yet
Group 30
33 pages
COL774 Practice Problems
No ratings yet
COL774 Practice Problems
22 pages
2 Linear Regression
No ratings yet
2 Linear Regression
14 pages
L09 - Regularisation
No ratings yet
L09 - Regularisation
79 pages
L11+ Regularization
No ratings yet
L11+ Regularization
25 pages
Lecture 7 - Part A - Mutli Class and Overfitting and Regularization
No ratings yet
Lecture 7 - Part A - Mutli Class and Overfitting and Regularization
43 pages
Supervised Regression Notes
No ratings yet
Supervised Regression Notes
11 pages
02 - Linear Models - C - Regularization - Logistic - Regression
No ratings yet
02 - Linear Models - C - Regularization - Logistic - Regression
16 pages
9 - Linear Regression-Problems and Solutions
No ratings yet
9 - Linear Regression-Problems and Solutions
23 pages
Chapter Regression
No ratings yet
Chapter Regression
10 pages
03 Linear Models
No ratings yet
03 Linear Models
46 pages
Machine Learning
No ratings yet
Machine Learning
19 pages
Intro To ML RevisionNotes
No ratings yet
Intro To ML RevisionNotes
24 pages
Lecture 2
No ratings yet
Lecture 2
66 pages
Lec 03
No ratings yet
Lec 03
42 pages
Regression
No ratings yet
Regression
39 pages
Lec4 Oct12 2022 PracticalNotes LinearRegression
No ratings yet
Lec4 Oct12 2022 PracticalNotes LinearRegression
34 pages
Lecture3 Upload
No ratings yet
Lecture3 Upload
28 pages
Lecture Notes 5 Linear Regression
No ratings yet
Lecture Notes 5 Linear Regression
11 pages
Skript Opt Mach
No ratings yet
Skript Opt Mach
49 pages
Regularization
No ratings yet
Regularization
46 pages
Lecture 3 - Regression
No ratings yet
Lecture 3 - Regression
47 pages
Regression Analysis
No ratings yet
Regression Analysis
54 pages
Lab Manual 05
No ratings yet
Lab Manual 05
13 pages
L11+ Regularization
No ratings yet
L11+ Regularization
24 pages
BACK PROPAGATION and REGULATION, BATCH NORMALIZATION
No ratings yet
BACK PROPAGATION and REGULATION, BATCH NORMALIZATION
20 pages
ML 1
No ratings yet
ML 1
24 pages
Cheatsheet 2
No ratings yet
Cheatsheet 2
5 pages
2022 Scribe Lecture7
No ratings yet
2022 Scribe Lecture7
9 pages
2EL1730 ML Lecture02 Linear and Logistic Regression
No ratings yet
2EL1730 ML Lecture02 Linear and Logistic Regression
65 pages
Lecture 02
No ratings yet
Lecture 02
43 pages
Today: - Calculus
No ratings yet
Today: - Calculus
61 pages
Lec 05 Regularization
No ratings yet
Lec 05 Regularization
77 pages
Lec9 - Linear Models
No ratings yet
Lec9 - Linear Models
44 pages
(MLP) Lecture Notes
No ratings yet
(MLP) Lecture Notes
22 pages
Lecture 3 - Linear Regression
No ratings yet
Lecture 3 - Linear Regression
31 pages
Andrew Rosenberg - Lecture 5: Linear Regression With Regularization CSC 84020 - Machine Learning
No ratings yet
Andrew Rosenberg - Lecture 5: Linear Regression With Regularization CSC 84020 - Machine Learning
38 pages
Wk05 Machine Learning
No ratings yet
Wk05 Machine Learning
6 pages
1.1 ID5059 1.2 Tom Kelsey - Jan 2021: February 15, 2021
No ratings yet
1.1 ID5059 1.2 Tom Kelsey - Jan 2021: February 15, 2021
43 pages
02 - Linear Models - A
No ratings yet
02 - Linear Models - A
23 pages
Lec 07-08 - Final
No ratings yet
Lec 07-08 - Final
32 pages
ML Models and When To Choose One Over Others
No ratings yet
ML Models and When To Choose One Over Others
7 pages
2IIG0 Cheat Sheet 1
No ratings yet
2IIG0 Cheat Sheet 1
2 pages
Introduction To Machine Learning Lecture 2: Linear Regression
No ratings yet
Introduction To Machine Learning Lecture 2: Linear Regression
38 pages
Computer Graphics Practical File
No ratings yet
Computer Graphics Practical File
30 pages
Digital Image Processing Lab Manual
67% (3)
Digital Image Processing Lab Manual
19 pages
307C - Operations Research
No ratings yet
307C - Operations Research
22 pages
Example-35: Solve The Following Non-Linear Programming Problem Using Kuhn
No ratings yet
Example-35: Solve The Following Non-Linear Programming Problem Using Kuhn
17 pages
Neural Networks The Adaline: Last Lecture Summary
No ratings yet
Neural Networks The Adaline: Last Lecture Summary
19 pages
NI-Predictive Maintenance and Machine Health Monitoring
100% (1)
NI-Predictive Maintenance and Machine Health Monitoring
34 pages
8 Linear Classifiers HInge Loss 03-08-2024
No ratings yet
8 Linear Classifiers HInge Loss 03-08-2024
20 pages
A - Mini - Project - Report - Tic - Tac - Toe 12
No ratings yet
A - Mini - Project - Report - Tic - Tac - Toe 12
18 pages
Tutorial 1, Design and Analysis of Algorithms, 2024
No ratings yet
Tutorial 1, Design and Analysis of Algorithms, 2024
2 pages
Transportation or
No ratings yet
Transportation or
5 pages
Ruiz Modified I2ml3e Chap6
No ratings yet
Ruiz Modified I2ml3e Chap6
38 pages
Unit 3QB
No ratings yet
Unit 3QB
36 pages
Lecture Plan Signals and Systems
No ratings yet
Lecture Plan Signals and Systems
3 pages
A Kernel Smoother Is A Statistical Technique For Estimating A Real Valued Function
No ratings yet
A Kernel Smoother Is A Statistical Technique For Estimating A Real Valued Function
6 pages
15) Machine Learning Algorithms
No ratings yet
15) Machine Learning Algorithms
5 pages
Introduction To: Artificial Intelligence
No ratings yet
Introduction To: Artificial Intelligence
86 pages
1st Assignment of CSE 403
No ratings yet
1st Assignment of CSE 403
8 pages
COMP4337 Lab1 Report
No ratings yet
COMP4337 Lab1 Report
7 pages
DC Tutorial Sheet 1
No ratings yet
DC Tutorial Sheet 1
2 pages
Expanding The Data Capacity of QR Codes Using Multiple Compression Algorithms and Base64 Encode/Decode
No ratings yet
Expanding The Data Capacity of QR Codes Using Multiple Compression Algorithms and Base64 Encode/Decode
7 pages
Streaming Algorithm: Filtering & Counting Distinct Elements: Compsci 590.02 Instructor: Ashwinmachanavajjhala
No ratings yet
Streaming Algorithm: Filtering & Counting Distinct Elements: Compsci 590.02 Instructor: Ashwinmachanavajjhala
26 pages
Paper Minig and Association
No ratings yet
Paper Minig and Association
5 pages
3.6 Single Source Shortest Paths
No ratings yet
3.6 Single Source Shortest Paths
5 pages
2023 CSC10004 22CLC03 HW02 AlgorithmEfficiency
No ratings yet
2023 CSC10004 22CLC03 HW02 AlgorithmEfficiency
4 pages
Compilation Dbm3013 Engineering Mathematic 3 PDF
No ratings yet
Compilation Dbm3013 Engineering Mathematic 3 PDF
4 pages
EGH315 Lab Week 6
No ratings yet
EGH315 Lab Week 6
4 pages
Btech Cse 3 Sem Data Structure and Algorithms PCC cs301 2024
No ratings yet
Btech Cse 3 Sem Data Structure and Algorithms PCC cs301 2024
1 page
Computer Assignment 2 EE430
No ratings yet
Computer Assignment 2 EE430
1 page
EE113 Homework 5: Problem 1
No ratings yet
EE113 Homework 5: Problem 1
4 pages
Mat 339 Syllabus Fall 2021
No ratings yet
Mat 339 Syllabus Fall 2021
2 pages