0% found this document useful (0 votes)
35 views66 pages

Lecture 2

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
35 views66 pages

Lecture 2

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 66

ECE553/653 Neural Networks

Linear Regression

1
Previously …
• Artificial intelligence (AI) vs. machine learning
(ML) vs. deep learning (DL)

AI
ML
DL

2
Artificial Intelligence
• AI segregates the fruits using decision logic
within a rule-based engine

3
Machine Learning
• ML segregates the fruits using feature
extraction and provides a degree of confidence

4
Deep Learning
• DL-based method removes the need for
defining what each fruit looks like.

5
Machine Learning vs. Deep Learning

Machine Learning Deep Learning


Performs well on small Performs well on large
Data
to medium datasets datasets
Requires significant
Hardware Able to function on CPU
computing power e.g., GPU
Features need to be
Features Learns features automatically
manually identified
Training time Quick to train Computationally intensive

6
What We Will Cover in this Course?
• Supervised learning
– Linear regression
– Logistic regression
– Model ensembles
– K-nearest neighbor and decision tree

• Unsupervised learning
– Clustering and K-means

• Reinforcement learning
– Temporal difference learning
– Q learning
– Deep reinforcement learning
– Multi-agent reinforcement learning

7
What We Will Cover in this Course?
• Supervised learning
– Linear regression
– Logistic regression
– Model ensembles
– K-nearest neighbor and decision tree

• Unsupervised learning
– Clustering and K-means

• Reinforcement learning
– Temporal difference learning
– Q learning
– Deep reinforcement learning
– Multi-agent reinforcement learning

8
Supervised Learning
• In supervised learning problems, each training
data sample
– A feature/input vector
– A label/output vector

𝒙𝑛 , 𝒚 𝑛 Cat

𝒙𝑛 feature/input 𝒚𝑛 label/output
9
Linear regression
• Basic example of a supervised learning
algorithm

• Captures many fundamental machine learning


concepts
– Function approximation
– Bias-variance tradeoff
– Regularization
– Training/validation/test split
– Optimization and gradient descent

10
Agenda
• Function approximation
– Modern strategy for designing machine learning
algorithms
– By example: Linear regression, a simple machine
learning algorithm
• Bias-variance tradeoff
– Fundamental challenge in machine learning

11
Linear Functions
• Consider the space of linear functions 𝑓𝛽 𝑥
defined by

𝑓𝛽 𝒙 = 𝛽 T 𝒙

• 𝒙 ∈ ℝ𝑑 is an input vector (a.k.a. features)


• 𝜷 ∈ ℝ𝑑 is a parameter vector

12
Linear Functions
• Consider the space of linear functions 𝑓𝛽 𝑥
defined by
𝑥1
𝑓𝜷 𝒙 = 𝜷 T 𝒙 = 𝛽1 ⋯ 𝛽𝑑 ⋮
𝑥𝑑

• 𝒙 ∈ ℝ𝑑 is an input vector (a.k.a. features)


• 𝜷 ∈ ℝ𝑑 is a parameter vector

13
Linear Functions
• Consider the space of linear functions 𝑓𝛽 𝑥
defined by
𝑥1
𝑓𝜷 𝒙 = 𝜷 T 𝒙 = 𝛽1 ⋯ 𝛽𝑑 ⋮ = 𝛽1 𝑥 1 + ⋯+ 𝛽𝑑 𝑥 𝑑
𝑥𝑑

• 𝒙 ∈ ℝ𝑑 is called an input (a.k.a. features)


• 𝜷 ∈ ℝ𝑑 is called the parameters (a.k.a. parameter vector)
• 𝑦 = 𝑓𝜷 𝒙 is called the label (a.k.a. output)

14
Linear Regression Problem
• Input: Dataset 𝑍 = 𝑥1, 𝑦1 , ⋯ , 𝑥𝑛 , 𝑦𝑛 , where 𝑥𝑖 ∈ ℝ𝑑 and
𝑦𝑖 ∈ ℝ
• Output: A linear function 𝑓𝛽 𝑥 = 𝛽 T 𝑥 such that 𝑦𝑖 ≈ 𝛽 T 𝑥𝑖

• Typical notation
• Use 𝑖 to index examples 𝑥𝑖 , 𝑦𝑖 in data 𝑍
• Use 𝑗 to index components 𝑥𝑗 of 𝒙 ∈ ℝ𝑑
• 𝑥𝑖𝑗 is component 𝑗 of input example 𝑖

• Goal: Estimate 𝛽 ∈ ℝ𝑑
Linear Regression Problem
• Input: Dataset 𝑍 = 𝑥1, 𝑦1 , ⋯ , 𝑥𝑛 , 𝑦𝑛 , where 𝑥𝑖 ∈ ℝ𝑑 and
𝑦𝑖 ∈ ℝ
• Output: A linear function 𝑓𝛽 𝑥 = 𝛽 T 𝑥 such that 𝑦𝑖 ≈ 𝛽 T 𝑥𝑖

𝑥i , 𝑦𝑖

𝑥 = 800, 𝑦 =?
Linear Regression Problem
• Input: Dataset 𝑍 = 𝑥1, 𝑦1 , ⋯ , 𝑥𝑛 , 𝑦𝑛 , where 𝑥𝑖 ∈ ℝ𝑑 and
𝑦𝑖 ∈ ℝ
• Output: A linear function 𝑓𝛽 𝑥 = 𝛽 T 𝑥 such that 𝑦𝑖 ≈ 𝛽 T 𝑥𝑖

What does this mean?

𝑥i , 𝑦𝑖

𝑥 = 800, 𝑦 =?
Choice of Loss Function
T T 2
• 𝑦𝑖 ≈ 𝛽 𝑥𝑖 if 𝑦𝑖 − 𝛽 𝑥𝑖 small

𝑦 𝑓𝛽 𝑥 = 𝛽 T 𝑥
• Mean squared error (MSE):

𝐿 𝛽; 𝑍 = 𝑛1 σ𝑛𝑖=1
2
𝑦𝑖 −𝛽 T 𝑥𝑖

• Computationally convenient 𝑥
and works well in practice
2 + 2 + 2 + 2 + 2
𝐿 𝛽; 𝑍 =
Linear Regression Problem
• Input: Dataset 𝑍 = 𝑥1, 𝑦1 , ⋯ , 𝑥𝑛 , 𝑦𝑛 , where 𝑥𝑖 ∈ ℝ𝑑 and
𝑦𝑖 ∈ ℝ
• Output: A linear function 𝑓𝛽 𝑥 = 𝛽 T 𝑥 such that 𝑦𝑖 ≈ 𝛽 T 𝑥𝑖
Linear Regression Problem
• Input: Dataset 𝑍 = 𝑥1, 𝑦1 , ⋯ , 𝑥𝑛 , 𝑦𝑛 , where 𝑥𝑖 ∈ ℝ𝑑 and
𝑦𝑖 ∈ ℝ
• Output: A linear function 𝑓𝛽 𝑥 = 𝛽 T 𝑥 that minimizes the
MSE:

1 𝑛 2
𝐿 𝛽; 𝑍 = σ T
𝑦 −𝛽 𝑥𝑖
𝑛 𝑖=1 𝑖
Linear Regression Problem
• Input: Dataset 𝑍 = 𝑥1 , 𝑦1 , ⋯ , 𝑥𝑛 , 𝑦𝑛
• Compute

𝛽መ 𝑍 = arg min𝑑 𝐿 𝛽; 𝑍
𝛽∈ℝ
1 2
= arg min𝑑 σ𝑛𝑖=1 𝑦𝑖 − 𝛽 𝑥𝑖T
𝛽∈ℝ 𝑛

•Output: 𝑓𝛽෡ 𝑍 𝑥 = 𝛽መ 𝑍 𝑇 𝑥
Intuition on Minimizing MSE Loss
• Consider 𝑥 ∈ ℝ and 𝛽 ∈ ℝ,

3 5
4
2 3
𝑦 𝐿(𝛽; 𝑍) 2
1
1
0 0
0 1 2 3 0 0.5 1 1.5 2
𝑥 𝛽

22
Intuition on Minimizing MSE Loss
• Consider 𝑥 ∈ ℝ and 𝛽 ∈ ℝ,

3 5
4
2 3
𝑦 𝐿(𝛽; 𝑍) 2
1
1
0 0
0 1 2 3 0 0.5 1 1.5 2
𝑥 𝛽
𝛽=1
23
Intuition on Minimizing MSE Loss
• Consider 𝑥 ∈ ℝ and 𝛽 ∈ ℝ,

3 5
4
2 3
𝑦 𝐿(𝛽; 𝑍) 2
1
1
0 0
0 1 2 3 0 0.5 1 1.5 2
𝑥 𝛽
𝛽 = 0.5
24
Intuition on Minimizing MSE Loss
• Consider 𝑥 ∈ ℝ and 𝛽 ∈ ℝ,

3 5
4
2 3
𝑦 𝐿(𝛽; 𝑍) 2
1
1
0 0
0 1 2 3 0 0.5 1 1.5 2
𝑥 𝛽
𝛽=0
25
Intuition on Minimizing MSE Loss
• Consider 𝑥 ∈ ℝ and 𝛽 ∈ ℝ,

3 5
4
2 3
𝑦 𝐿(𝛽; 𝑍) 2
1
1
0 0
0 1 2 3 0 0.5 1 1.5 2
𝑥 𝛽

26
Intuition on Minimizing MSE Loss
• Convex

𝛽1
𝛽2

27
“Good” Mean Squared Error?
• Need to compare to baseline!
– I.e., Constant prediction

• Later: Training vs. test MSE

28
Alternative Loss Functions
• Mean absolute error:
𝑛
1
෍ 𝑦ෝ𝑖 − 𝑦𝑖
𝑛
𝑖=1
• Mean relative error:
𝑛
1 𝑦ෝ𝑖 − 𝑦𝑖

𝑛 𝑦𝑖
𝑖=1
𝟐 MSE
𝑹 score : 1 − Variance
– “Coefficient of determination”
– Higher is better 𝑅 2 = 1 is perfect 29
Loss Function Comparisons

Key characteristics Application scenarios


• Sensitivity to outliers
Large errors are particularly
MSE • Unit is different from
undesirable
original data
• Less Sensitive to
outliers
MAE Requires robust model
• Unit is the same from
original data
Suitable for comparing
• Relative Metric
MRE performance across different
• Scale-Invariant
scales or datasets

30
Function Approximation

Data 𝑍 Machine learning Model 𝑓


algorithm

ML algorithm outputs a model 𝑓 that best


“approximates” the given data 𝑍
31
Function Approximation
• Two design decisions
– What is the family of candidate models 𝑓? (E.g.,
linear functions)
– How to define “approximating”? (E.g., MSE loss)

32
True Function
• Input: Dataset 𝑍
– Presume there is an unknown function 𝑓 ∗ that
generates 𝑍

• Goal: Find an approximation 𝑓𝛽 ≈ 𝑓 ∗ in our


model family 𝑓𝛽 ∈ 𝐹
– Typically, 𝑓 ∗ is not in our model family 𝐹

33
Linear Regression

𝑛
Data 𝑍 = 𝑥𝑖 , 𝑦𝑖 𝑖=1 𝛽መ 𝑍 = arg min 𝐿 𝛽; 𝑍 Model 𝑓𝛽෡ 𝑍
𝛽
𝐿 encodes 𝑦𝑖 ≈ 𝑓𝛽 𝑥𝑖

MSE loss Model is a linear function 𝑓𝛽 𝑥 = 𝛽𝑇 𝑥


34
Linear Regression
General strategy Linear regression strategy
• Model family • Linear functions
𝐹 = 𝑓𝛽 𝐹 = 𝑓𝛽 𝑥 = 𝛽𝑇 𝑥
𝛽
• Loss function • MSE
1
𝐿 𝛽; 𝑍 𝐿 𝛽; 𝑍 = σ𝑛𝑖=1 𝑦𝑖 − 𝛽𝑇 𝑥𝑖 2
𝑛

Linear regression algorithm

𝛽መ 𝑍 = arg min 𝐿 𝛽; 𝑍
𝛽

35
Motivation: Quadratic Function

36
Motivation: Quadratic Function

𝑓𝛽 𝑥 = 𝑥/2

37
Motivation: Quadratic Function

𝑓𝛽 𝑥 = 𝑥

Can we get a better fit?


38
Feature Maps
General strategy Linear regression with feature map
• Model family • Linear functions over a given
𝐹 = 𝑓𝛽
𝛽
feature map ∅: 𝒙 → ℝ𝑑
• Loss function 𝐹 = 𝑓𝛽 𝑥 = 𝜷𝑇 ∅ 𝒙
𝐿 𝛽; 𝑍
• MSE
1 2
𝐿 𝛽; 𝑍 = σ𝑛𝑖=1 𝑦𝑖 − 𝜷𝑇 ∅ 𝒙𝑖
𝑛

39
Quadratic Feature Map
• Consider the feature map 𝜙: ℝ → ℝ2 given by

𝑥
𝜙 𝑥 = 𝑥2

• Then, the model family is

𝑓𝜷 𝒙 = 𝛽1 𝑥 + 𝛽2 𝑥 2

40
Quadratic Feature Map

𝑓𝜷 𝒙 = 0𝑥 + 1𝑥 2

In our family for 𝛽 = 0 !


1
41
Question
• Why not throw in lots of features?
– Can fit any 𝑛 points using a polynomial of degree
𝑛
𝑓𝜷 𝒙 = 𝛽0 + 𝛽2 𝑥1 + 𝛽3 𝑥2 + 𝛽4 𝑥 12 + ⋯

𝑓𝜷 𝒙

𝑥 42
Training vs. Test Data
• Training data: Examples 𝑍 = 𝑥, 𝑦 used to
fit our model

• Test data: New inputs 𝑥 whose labels 𝑦 we


want to predict

43
Prediction
• Issue: The goal in machine learning is
prediction
– Given a new input 𝑥, predict the label 𝑦ො = 𝑓𝜷 𝒙

𝑥
The errors on new inputs are very large! 44
Prediction
• Issue: The goal in machine learning is
prediction
– Given a new input 𝑥, predict the label 𝑦ො = 𝑓𝜷 𝒙

𝑓𝜷 𝒙

𝑥
Vanilla linear regression actually works better! 45
Overfitting vs. Underfitting
• Overfitting
– Fit the training data 𝑍 well
– Fit new test data 𝑥, 𝑦 poorly

𝑥
46
Overfitting vs. Underfitting
• Underfitting
– Fit the training data 𝑍 poorly
– Fit new test data 𝑥, 𝑦 poorly

𝑓𝛽 𝑥

𝑥 47
Why Does Overfitting Happen?
• Noise in the data
– Noise in labels 𝑦: True data generating process is
complex
– Noise in features 𝑥
• Measurement error in the feature values
• Errors due to preprocessing
• Some features may be irrelevant to the function
• The training data size is too small
– The dataset does not contain enough data samples to
accurately represent all possible input data values.
• The model trains for too long on a single sample
set of data.
48
Training/Test Split
• Issue: How to detect overfitting vs.
underfitting?
• Solution: Use held-out test data to estimate
loss on new data
– Typically, randomly shuffle data first

Given data 𝑍

Training data Test data


49
Training/Test Split Algorithm

• Step 1: Split 𝑍 into 𝑍𝑡𝑟𝑎𝑖𝑛 and 𝑍𝑡𝑒𝑠𝑡

Training data 𝑍𝑡𝑟𝑎𝑖𝑛 Test data 𝑍𝑡𝑒𝑠𝑡

• Step 2: Run linear regression with 𝑍𝑡𝑟𝑎𝑖𝑛 to obtain 𝛽መ 𝑍𝑡𝑟𝑎𝑖𝑛

• Step 3: Evaluate
• Training loss: 𝐿𝑡𝑟𝑎𝑖𝑛 = 𝐿 𝛽መ 𝑍𝑡𝑟𝑎𝑖𝑛 ; 𝑍𝑡𝑟𝑎𝑖𝑛
• Test (or generalization) loss: 𝐿𝑡𝑒𝑠𝑡 = 𝐿 𝛽መ 𝑍𝑡𝑟𝑎𝑖𝑛 ; 𝑍𝑡𝑒𝑠𝑡
Training/Test Split Algorithm
• Overfitting
– Fit the training data 𝑍 well
– Fit new test data 𝑥, 𝑦 poorly

𝑥
51
Training/Test Split Algorithm
• Overfitting
– 𝐿train is small
– 𝐿test is large

𝑥
52
Training/Test Split Algorithm
• Underfitting
– Fit the training data 𝑍 poorly
– Fit new test data 𝑥, 𝑦 poorly

𝑓𝛽 𝑥

𝑥 53
Training/Test Split Algorithm
• Underfitting
– 𝐿train is large
– 𝐿test is also large

𝑓𝛽 𝑥

𝑥 54
How to Fix Underfitting/Overfitting?
• Choose the right model family!

55
Role of Capacity
• Capacity of a model family captures
“complexity” of data it can fit
– Higher capacity -> more likely to overfit (model
family has high variance)
– Lower capacity -> more likely to underfit (model
family has high bias)
• For linear regression, capacity corresponds to
feature dimension

56
Bias-Variance Tradeoff

Underfitting Ideal Overfitting


Loss

Test loss

Training loss
Capacity

57
Example of Underfitting/Overfitting
• House Prices Prediction with linear regression.
• Data fields:
1. Total Area: Total area of the house excluding the
basement
2. Bedrooms: The number of bedrooms in the
house.
3. Living Area: The living area inside the house.
4. Parking Area: The area of the parking lot.
5. …
• Target prediction value : The transaction price of
houses.
58
Example of Underfitting/Overfitting
• Exploratory Data Analysis

59
Example of Underfitting/Overfitting
• Exploratory Data Analysis

60
Example of Underfitting/Overfitting
• Using ‘Sklearn.preprocessing.MinMaxScaler’
for data normalization.

𝑥 − 𝑥𝑚𝑖𝑛
𝑥=
𝑥𝑚𝑎𝑥 − 𝑥𝑚𝑖𝑛

61
Example of Underfitting/Overfitting
• Exploratory Data Analysis

62
Example of Underfitting/Overfitting
• Exploratory Data Analysis

63
Example of Underfitting/Overfitting
• Linear regression uses only the first-order
features 𝑥:
y = wx + b
• Polynomial regression uses the higher-order
combination features 𝑥′ of 𝑥:
y = wx + b
For example, the degree-2 polynomial features of 𝑥
𝑥1
= 𝑥 are 𝑥 ′ = [1, 𝑥1 , 𝑥2 , 𝑥12 ,𝑥1 𝑥2 ,𝑥22 ]T .
2
64
Example of Underfitting/Overfitting
• We have features 𝑥 = (𝑥1 , 𝑥2 ) to predict 𝑦.
• In linear regression, we have
𝑥1
w = argmin (𝑦 − 𝑤1 , 𝑤2 𝑥 − 𝑏) 2 .

w 2
• In polynomial regression, if we use the
degree-2 polynomial features of
𝑥1
𝑥2
w ∗ = argmin (𝑦 − 𝑤1 , 𝑤 2 , 𝑤 3 , … , 𝑤 6 … − 𝑏) 2
w
𝑥22

65
Example of Underfitting/Overfitting

• Degree-1

• Degree-3

• Degree-6

66

You might also like