Lecture 2
Lecture 2
Linear Regression
1
Previously …
• Artificial intelligence (AI) vs. machine learning
(ML) vs. deep learning (DL)
AI
ML
DL
2
Artificial Intelligence
• AI segregates the fruits using decision logic
within a rule-based engine
3
Machine Learning
• ML segregates the fruits using feature
extraction and provides a degree of confidence
4
Deep Learning
• DL-based method removes the need for
defining what each fruit looks like.
5
Machine Learning vs. Deep Learning
6
What We Will Cover in this Course?
• Supervised learning
– Linear regression
– Logistic regression
– Model ensembles
– K-nearest neighbor and decision tree
• Unsupervised learning
– Clustering and K-means
• Reinforcement learning
– Temporal difference learning
– Q learning
– Deep reinforcement learning
– Multi-agent reinforcement learning
7
What We Will Cover in this Course?
• Supervised learning
– Linear regression
– Logistic regression
– Model ensembles
– K-nearest neighbor and decision tree
• Unsupervised learning
– Clustering and K-means
• Reinforcement learning
– Temporal difference learning
– Q learning
– Deep reinforcement learning
– Multi-agent reinforcement learning
8
Supervised Learning
• In supervised learning problems, each training
data sample
– A feature/input vector
– A label/output vector
𝒙𝑛 , 𝒚 𝑛 Cat
𝒙𝑛 feature/input 𝒚𝑛 label/output
9
Linear regression
• Basic example of a supervised learning
algorithm
10
Agenda
• Function approximation
– Modern strategy for designing machine learning
algorithms
– By example: Linear regression, a simple machine
learning algorithm
• Bias-variance tradeoff
– Fundamental challenge in machine learning
11
Linear Functions
• Consider the space of linear functions 𝑓𝛽 𝑥
defined by
𝑓𝛽 𝒙 = 𝛽 T 𝒙
12
Linear Functions
• Consider the space of linear functions 𝑓𝛽 𝑥
defined by
𝑥1
𝑓𝜷 𝒙 = 𝜷 T 𝒙 = 𝛽1 ⋯ 𝛽𝑑 ⋮
𝑥𝑑
13
Linear Functions
• Consider the space of linear functions 𝑓𝛽 𝑥
defined by
𝑥1
𝑓𝜷 𝒙 = 𝜷 T 𝒙 = 𝛽1 ⋯ 𝛽𝑑 ⋮ = 𝛽1 𝑥 1 + ⋯+ 𝛽𝑑 𝑥 𝑑
𝑥𝑑
14
Linear Regression Problem
• Input: Dataset 𝑍 = 𝑥1, 𝑦1 , ⋯ , 𝑥𝑛 , 𝑦𝑛 , where 𝑥𝑖 ∈ ℝ𝑑 and
𝑦𝑖 ∈ ℝ
• Output: A linear function 𝑓𝛽 𝑥 = 𝛽 T 𝑥 such that 𝑦𝑖 ≈ 𝛽 T 𝑥𝑖
• Typical notation
• Use 𝑖 to index examples 𝑥𝑖 , 𝑦𝑖 in data 𝑍
• Use 𝑗 to index components 𝑥𝑗 of 𝒙 ∈ ℝ𝑑
• 𝑥𝑖𝑗 is component 𝑗 of input example 𝑖
• Goal: Estimate 𝛽 ∈ ℝ𝑑
Linear Regression Problem
• Input: Dataset 𝑍 = 𝑥1, 𝑦1 , ⋯ , 𝑥𝑛 , 𝑦𝑛 , where 𝑥𝑖 ∈ ℝ𝑑 and
𝑦𝑖 ∈ ℝ
• Output: A linear function 𝑓𝛽 𝑥 = 𝛽 T 𝑥 such that 𝑦𝑖 ≈ 𝛽 T 𝑥𝑖
𝑥i , 𝑦𝑖
𝑥 = 800, 𝑦 =?
Linear Regression Problem
• Input: Dataset 𝑍 = 𝑥1, 𝑦1 , ⋯ , 𝑥𝑛 , 𝑦𝑛 , where 𝑥𝑖 ∈ ℝ𝑑 and
𝑦𝑖 ∈ ℝ
• Output: A linear function 𝑓𝛽 𝑥 = 𝛽 T 𝑥 such that 𝑦𝑖 ≈ 𝛽 T 𝑥𝑖
𝑥i , 𝑦𝑖
𝑥 = 800, 𝑦 =?
Choice of Loss Function
T T 2
• 𝑦𝑖 ≈ 𝛽 𝑥𝑖 if 𝑦𝑖 − 𝛽 𝑥𝑖 small
𝑦 𝑓𝛽 𝑥 = 𝛽 T 𝑥
• Mean squared error (MSE):
𝐿 𝛽; 𝑍 = 𝑛1 σ𝑛𝑖=1
2
𝑦𝑖 −𝛽 T 𝑥𝑖
• Computationally convenient 𝑥
and works well in practice
2 + 2 + 2 + 2 + 2
𝐿 𝛽; 𝑍 =
Linear Regression Problem
• Input: Dataset 𝑍 = 𝑥1, 𝑦1 , ⋯ , 𝑥𝑛 , 𝑦𝑛 , where 𝑥𝑖 ∈ ℝ𝑑 and
𝑦𝑖 ∈ ℝ
• Output: A linear function 𝑓𝛽 𝑥 = 𝛽 T 𝑥 such that 𝑦𝑖 ≈ 𝛽 T 𝑥𝑖
Linear Regression Problem
• Input: Dataset 𝑍 = 𝑥1, 𝑦1 , ⋯ , 𝑥𝑛 , 𝑦𝑛 , where 𝑥𝑖 ∈ ℝ𝑑 and
𝑦𝑖 ∈ ℝ
• Output: A linear function 𝑓𝛽 𝑥 = 𝛽 T 𝑥 that minimizes the
MSE:
1 𝑛 2
𝐿 𝛽; 𝑍 = σ T
𝑦 −𝛽 𝑥𝑖
𝑛 𝑖=1 𝑖
Linear Regression Problem
• Input: Dataset 𝑍 = 𝑥1 , 𝑦1 , ⋯ , 𝑥𝑛 , 𝑦𝑛
• Compute
𝛽መ 𝑍 = arg min𝑑 𝐿 𝛽; 𝑍
𝛽∈ℝ
1 2
= arg min𝑑 σ𝑛𝑖=1 𝑦𝑖 − 𝛽 𝑥𝑖T
𝛽∈ℝ 𝑛
•Output: 𝑓𝛽 𝑍 𝑥 = 𝛽መ 𝑍 𝑇 𝑥
Intuition on Minimizing MSE Loss
• Consider 𝑥 ∈ ℝ and 𝛽 ∈ ℝ,
3 5
4
2 3
𝑦 𝐿(𝛽; 𝑍) 2
1
1
0 0
0 1 2 3 0 0.5 1 1.5 2
𝑥 𝛽
22
Intuition on Minimizing MSE Loss
• Consider 𝑥 ∈ ℝ and 𝛽 ∈ ℝ,
3 5
4
2 3
𝑦 𝐿(𝛽; 𝑍) 2
1
1
0 0
0 1 2 3 0 0.5 1 1.5 2
𝑥 𝛽
𝛽=1
23
Intuition on Minimizing MSE Loss
• Consider 𝑥 ∈ ℝ and 𝛽 ∈ ℝ,
3 5
4
2 3
𝑦 𝐿(𝛽; 𝑍) 2
1
1
0 0
0 1 2 3 0 0.5 1 1.5 2
𝑥 𝛽
𝛽 = 0.5
24
Intuition on Minimizing MSE Loss
• Consider 𝑥 ∈ ℝ and 𝛽 ∈ ℝ,
3 5
4
2 3
𝑦 𝐿(𝛽; 𝑍) 2
1
1
0 0
0 1 2 3 0 0.5 1 1.5 2
𝑥 𝛽
𝛽=0
25
Intuition on Minimizing MSE Loss
• Consider 𝑥 ∈ ℝ and 𝛽 ∈ ℝ,
3 5
4
2 3
𝑦 𝐿(𝛽; 𝑍) 2
1
1
0 0
0 1 2 3 0 0.5 1 1.5 2
𝑥 𝛽
26
Intuition on Minimizing MSE Loss
• Convex
𝛽1
𝛽2
27
“Good” Mean Squared Error?
• Need to compare to baseline!
– I.e., Constant prediction
28
Alternative Loss Functions
• Mean absolute error:
𝑛
1
𝑦ෝ𝑖 − 𝑦𝑖
𝑛
𝑖=1
• Mean relative error:
𝑛
1 𝑦ෝ𝑖 − 𝑦𝑖
𝑛 𝑦𝑖
𝑖=1
𝟐 MSE
𝑹 score : 1 − Variance
– “Coefficient of determination”
– Higher is better 𝑅 2 = 1 is perfect 29
Loss Function Comparisons
30
Function Approximation
32
True Function
• Input: Dataset 𝑍
– Presume there is an unknown function 𝑓 ∗ that
generates 𝑍
33
Linear Regression
𝑛
Data 𝑍 = 𝑥𝑖 , 𝑦𝑖 𝑖=1 𝛽መ 𝑍 = arg min 𝐿 𝛽; 𝑍 Model 𝑓𝛽 𝑍
𝛽
𝐿 encodes 𝑦𝑖 ≈ 𝑓𝛽 𝑥𝑖
𝛽መ 𝑍 = arg min 𝐿 𝛽; 𝑍
𝛽
35
Motivation: Quadratic Function
36
Motivation: Quadratic Function
𝑓𝛽 𝑥 = 𝑥/2
37
Motivation: Quadratic Function
𝑓𝛽 𝑥 = 𝑥
39
Quadratic Feature Map
• Consider the feature map 𝜙: ℝ → ℝ2 given by
𝑥
𝜙 𝑥 = 𝑥2
𝑓𝜷 𝒙 = 𝛽1 𝑥 + 𝛽2 𝑥 2
40
Quadratic Feature Map
𝑓𝜷 𝒙 = 0𝑥 + 1𝑥 2
𝑓𝜷 𝒙
𝑥 42
Training vs. Test Data
• Training data: Examples 𝑍 = 𝑥, 𝑦 used to
fit our model
43
Prediction
• Issue: The goal in machine learning is
prediction
– Given a new input 𝑥, predict the label 𝑦ො = 𝑓𝜷 𝒙
𝑥
The errors on new inputs are very large! 44
Prediction
• Issue: The goal in machine learning is
prediction
– Given a new input 𝑥, predict the label 𝑦ො = 𝑓𝜷 𝒙
𝑓𝜷 𝒙
𝑥
Vanilla linear regression actually works better! 45
Overfitting vs. Underfitting
• Overfitting
– Fit the training data 𝑍 well
– Fit new test data 𝑥, 𝑦 poorly
𝑥
46
Overfitting vs. Underfitting
• Underfitting
– Fit the training data 𝑍 poorly
– Fit new test data 𝑥, 𝑦 poorly
𝑓𝛽 𝑥
𝑥 47
Why Does Overfitting Happen?
• Noise in the data
– Noise in labels 𝑦: True data generating process is
complex
– Noise in features 𝑥
• Measurement error in the feature values
• Errors due to preprocessing
• Some features may be irrelevant to the function
• The training data size is too small
– The dataset does not contain enough data samples to
accurately represent all possible input data values.
• The model trains for too long on a single sample
set of data.
48
Training/Test Split
• Issue: How to detect overfitting vs.
underfitting?
• Solution: Use held-out test data to estimate
loss on new data
– Typically, randomly shuffle data first
Given data 𝑍
• Step 3: Evaluate
• Training loss: 𝐿𝑡𝑟𝑎𝑖𝑛 = 𝐿 𝛽መ 𝑍𝑡𝑟𝑎𝑖𝑛 ; 𝑍𝑡𝑟𝑎𝑖𝑛
• Test (or generalization) loss: 𝐿𝑡𝑒𝑠𝑡 = 𝐿 𝛽መ 𝑍𝑡𝑟𝑎𝑖𝑛 ; 𝑍𝑡𝑒𝑠𝑡
Training/Test Split Algorithm
• Overfitting
– Fit the training data 𝑍 well
– Fit new test data 𝑥, 𝑦 poorly
𝑥
51
Training/Test Split Algorithm
• Overfitting
– 𝐿train is small
– 𝐿test is large
𝑥
52
Training/Test Split Algorithm
• Underfitting
– Fit the training data 𝑍 poorly
– Fit new test data 𝑥, 𝑦 poorly
𝑓𝛽 𝑥
𝑥 53
Training/Test Split Algorithm
• Underfitting
– 𝐿train is large
– 𝐿test is also large
𝑓𝛽 𝑥
𝑥 54
How to Fix Underfitting/Overfitting?
• Choose the right model family!
55
Role of Capacity
• Capacity of a model family captures
“complexity” of data it can fit
– Higher capacity -> more likely to overfit (model
family has high variance)
– Lower capacity -> more likely to underfit (model
family has high bias)
• For linear regression, capacity corresponds to
feature dimension
56
Bias-Variance Tradeoff
Test loss
Training loss
Capacity
57
Example of Underfitting/Overfitting
• House Prices Prediction with linear regression.
• Data fields:
1. Total Area: Total area of the house excluding the
basement
2. Bedrooms: The number of bedrooms in the
house.
3. Living Area: The living area inside the house.
4. Parking Area: The area of the parking lot.
5. …
• Target prediction value : The transaction price of
houses.
58
Example of Underfitting/Overfitting
• Exploratory Data Analysis
59
Example of Underfitting/Overfitting
• Exploratory Data Analysis
60
Example of Underfitting/Overfitting
• Using ‘Sklearn.preprocessing.MinMaxScaler’
for data normalization.
𝑥 − 𝑥𝑚𝑖𝑛
𝑥=
𝑥𝑚𝑎𝑥 − 𝑥𝑚𝑖𝑛
61
Example of Underfitting/Overfitting
• Exploratory Data Analysis
62
Example of Underfitting/Overfitting
• Exploratory Data Analysis
63
Example of Underfitting/Overfitting
• Linear regression uses only the first-order
features 𝑥:
y = wx + b
• Polynomial regression uses the higher-order
combination features 𝑥′ of 𝑥:
y = wx + b
For example, the degree-2 polynomial features of 𝑥
𝑥1
= 𝑥 are 𝑥 ′ = [1, 𝑥1 , 𝑥2 , 𝑥12 ,𝑥1 𝑥2 ,𝑥22 ]T .
2
64
Example of Underfitting/Overfitting
• We have features 𝑥 = (𝑥1 , 𝑥2 ) to predict 𝑦.
• In linear regression, we have
𝑥1
w = argmin (𝑦 − 𝑤1 , 𝑤2 𝑥 − 𝑏) 2 .
∗
w 2
• In polynomial regression, if we use the
degree-2 polynomial features of
𝑥1
𝑥2
w ∗ = argmin (𝑦 − 𝑤1 , 𝑤 2 , 𝑤 3 , … , 𝑤 6 … − 𝑏) 2
w
𝑥22
65
Example of Underfitting/Overfitting
• Degree-1
• Degree-3
• Degree-6
66