0% found this document useful (0 votes)

2 views

2 LinearRegression2

Uploaded by

João Paulo Dellasta do Nascimento

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

2 views

2 LinearRegression2

Uploaded by

João Paulo Dellasta do Nascimento

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 45

ITCS 6156/8156 Fall 2023

Machine Learning

Linear Regression

Instructor: Hongfei Xue

Email: [email protected]
Class Meeting: Mon & Wed, 4:00 PM – 5:15 PM, CHHS 376

Some content in the slides is based on Dr. Razvan’s lecture

Machine Learning as Optimization
Convexity
Convex Optimization
Gradient Descent

𝐽(𝑤! , 𝑤" )

𝑤"
𝑤!
Gradient Descent

𝐽(𝑤! , 𝑤" )

𝑤"
𝑤!
Gradient Descent

𝐽(𝑤! , 𝑤" )

𝑤"
𝑤!
Gradient Descent

𝐽(𝑤! , 𝑤" )

𝑤"
𝑤!
Gradient Descent

𝐽(𝑤! , 𝑤" )

𝑤"
𝑤!
Gradient Descent

•
Taylor Expansion
Gradient Decent

•
Gradient Decent
•

• The key operation in the above update step is the calculation of each partial derivative.

•
Gradient Decent

• The final weight update rule:

Issues with Gradient Decent

• Issues with Gradient Decent:

• Slow convergence
• Stuck in local minima

• One should note that the second issue will not arise in the
case of convex problem as the error surface has only one
global minima.

• More eﬃcient algorithms exist for batch optimization,

including Conjugate Gradient Descent and other quasi-
Newton methods. Another approach is to consider training
examples in an online or incremental fashion, resulting in
an online algorithm called Stochastic Gradient Descent.
Stochastic Gradient Descent (SGD)
• Update weights after every (or a small subset of) training
example(s).
•

• Why SGD?
Stochastic Gradient Descent (SGD)

1 or K (a small number)

1 or K (a small number)
Polynomial Basis Functions

• Q: What if the raw feature is insufficient for good

performance?
• Example: non-linear dependency between label and raw
feature.

• A: Engineer / Learning higher-level features, as functions

of the raw feature.

• Polynomial curve fitting:

- Add new features, as polynomials of the original feature.
Regression: Curve Fitting

Target 𝑓
Regression: Curve Fitting

Learned ℎ
𝑦

Target 𝑓

• Training: Build a function ℎ 𝑥 , based on (noisy) training examples

𝑥! , 𝑦! , 𝑥" , 𝑦" , ⋯ , (𝑥# , 𝑦# ).
Regression: Curve Fitting

Learned ℎ
𝑦

Target 𝑓

• Testing: for arbitrary (unseen) instance 𝑥 𝜖 𝐗, compute target output

ℎ 𝑥 ; want it to be close to f 𝑥 .
Regression: Polynomial Curve Fitting

%
ℎ 𝑥 = ℎ 𝑥, 𝐰 = 𝑤$ + 𝑤! 𝑥 + 𝑤" 𝑥 " + ⋯ + 𝑤% 𝑥 % = 1 𝑤& 𝑥 &
&'$

parameters features
Polynomial Curve Fitting
• Parametric model:
%
ℎ 𝑥 = ℎ 𝑥, 𝐰 = 𝑤$ + 𝑤! 𝑥 + 𝑤" 𝑥 " + ⋯ + 𝑤% 𝑥 % = 1 𝑤& 𝑥 &
&'$

• Polynomial curve fitting is (Multiple) Linear Regression:

𝐱 = [1, 𝑥, 𝑥 " , ⋯ , 𝑥 % ](
ℎ 𝑥 = ℎ 𝐱, 𝐰 = ℎ𝐰 𝐱 = 𝐰 𝑻𝐱

• Learning = minimize the Sum-of-Squares error function:

#
argmin 1
6=
𝐰 𝐽 𝐰 𝐽 𝐰 = 1 (ℎ𝐰(𝐱 . ) − 𝑦. )"
𝐰 2𝑁
.'!
• Least Square Estimate:

6 = (𝐗 𝐓𝐗),𝟏 𝐗 𝐓𝐲
𝐰
Polynomial Curve Fitting

• Generalization = how well the parameterized ℎ(𝑥, 𝐰)

performs on arbitrary (unseen) test instances 𝑥𝜖𝑋.

• Generalization performance depends on the value of M

0th Order Polynomial
1st Order Polynomial
3rd Order Polynomial
9th Order Polynomial

• Which M to pick? Why?

• Follow the wisdom of a philosopher.
Occam’s Razor

William of Occam (1288 – 1348)

English Franciscan friar, theologian and
philosopher.

“Entia non sunt multiplicanda praeter necessitatem”

• Entities must not be multiplied beyond necessity.

i.e. Do not make things needlessly complicated.

i.e. Prefer the simplest hypothesis that fits the data.
Polynomial Curve Fitting

• Model Selection: choosing the order M of the polynomial.

- Best generalization obtained with M=3.
- M = 9 obtains poor generalization, even though it fits
training examples perfectly:
• But M = 9 polynomials subsume M = 3 polynomials!

• Overfitting ≡ good performance on training examples, poor

performance on test examples.
Over-fitting and Parameter Values
Overfitting
• Measure fit using the Root-Mean-Square (RMS) error (RMSE):
∑+(𝐰 , 𝐱 + − 𝒕+ )-
𝐸'() 𝐰 =
𝑁
• Use 100 random test examples, generated in the same way:
Overfitting vs. Data Set Size

• More training data ⟹ less overfitting

• What if we do not have more training data?

- Use regularization
Regularization

• Penalize large parameter values:

0
1 -
𝜆 -
𝐸 𝐰 = 5 (ℎ𝐰(𝐱 + ) − 𝑡+ ) + 𝐰
2𝑁 2
+./

Regularizer

argmin
𝐰∗ = 𝐸(𝐰)
𝐰
Ridge Regression

• Multiple linear regression with L2 regularization:

0
1 -
𝜆 -
𝐽 𝐰 = 5 (ℎ𝐰(𝐱 + ) − 𝑡+ ) + 𝐰
2𝑁 2
+./
argmin
@=
𝐰 𝐽(𝐰)
𝐰

• Solution is 𝐰 = (𝝀𝑵𝐈 + 𝐗 𝑻𝐗)3𝟏 𝐗 𝑻𝐭

- Prove it.
9th Order Polynomial with Regularization
9th Order Polynomial with Regularization
Training & Test error vs. ln 𝜆

How do we find the optimal value of 𝜆?

Model Selection

• Put aside an independent validation set.

• Select parameters giving best performance on validation set.

ln 𝜆 𝜖{−40, −35, −30, −25, −20, −15}

K-fold Cross-Validation

Source: https://scikit-learn.org/stable/modules/cross_validation.html
K-fold Cross-Validation

• Split the training data into K folds and try a wide range of
tunning parameter values:
- split the data into K folds of roughly equal size
- iterate over a set of values for 𝜆
• iterate over k = 1,2, ⋯ ,K
- use all folds except k for training
- validate (calculate test error) in the k-th fold
• error[𝜆] = average error over the K folds
- choose the value of 𝜆 that gives the smallest error.
Regularization: Ridge vs. Lasso

• Ridge regression:

! /
𝐽 𝐰 = ∑# (ℎ 𝐱 . − 𝑡. )" + ∑%
&'! 𝑤&
"
"# .'! 𝐰 "

• Lasso:
# %
1 "
𝜆
𝐽 𝐰 = 1 (ℎ𝐰 𝐱 . − 𝑡. ) + 1 𝑤&
2𝑁 2
.'! &'!

- if 𝜆 is sufficiently large, some of the coefficients 𝑤& are driven to

0 ⟹ sparse model
Regularization: Ridge vs. Lasso

Plot of the contours of the unregularized error function (blue) along with the
constraint region (3.30) for the quadratic regularizer 𝑞 = 2 on the left and the lasso
regularizer 𝑞 = 1 on the right, in which the optimum value for the parameter vector
𝐰 is denoted by 𝐰 ∗. The lasso gives a sparse solution in which 𝐰 ∗ = 𝟎.
Regularization

• Parameter norm penalties (term in the objective).

• Limit parameter norm (constraint).
• Dataset augmentation.
• Dropout.
• Ensembles.
• Semi-supervised learning.
• Early stopping
• Noise robustness.
• Sparse representations.
• Adversarial training.
Questions?

Unit 02 - Nonlinear Classification, Linear Regression, Collaborative Filtering - MD
No ratings yet
Unit 02 - Nonlinear Classification, Linear Regression, Collaborative Filtering - MD
14 pages
The Practically Cheating Calculus Handbook
From Everand
The Practically Cheating Calculus Handbook
S. Deviant
3.5/5 (7)
Lecture16 Crossvalidation
No ratings yet
Lecture16 Crossvalidation
32 pages
Regression Analysis
No ratings yet
Regression Analysis
11 pages
Regression-and-generalization (1)
No ratings yet
Regression-and-generalization (1)
67 pages
week2
No ratings yet
week2
43 pages
ML U-4
No ratings yet
ML U-4
63 pages
2EL1730 ML Lecture02 Linear and Logistic Regression
No ratings yet
2EL1730 ML Lecture02 Linear and Logistic Regression
65 pages
ML PYQs
No ratings yet
ML PYQs
32 pages
lec8_Regularization_polynomial_regression
No ratings yet
lec8_Regularization_polynomial_regression
30 pages
slides_foundations
No ratings yet
slides_foundations
81 pages
ML 04 Validation Regularization
No ratings yet
ML 04 Validation Regularization
57 pages
3 Polyreg
No ratings yet
3 Polyreg
22 pages
Regression Analysis
No ratings yet
Regression Analysis
54 pages
EE5434 Regression
No ratings yet
EE5434 Regression
96 pages
Logistic
No ratings yet
Logistic
14 pages
Lecture 09_02.09.2024_Regression-01
No ratings yet
Lecture 09_02.09.2024_Regression-01
62 pages
CS229 Lecture 2 PDF
100% (1)
CS229 Lecture 2 PDF
48 pages
cs229 Notes1 PDF
No ratings yet
cs229 Notes1 PDF
28 pages
Lecture 4_Regularization
No ratings yet
Lecture 4_Regularization
22 pages
curve_fitting
No ratings yet
curve_fitting
17 pages
IML-Summary
No ratings yet
IML-Summary
12 pages
CSO504 Machine Learning: Evaluation and Error Analysis Validation and Regularization Koustav Rudra 22/08/2022
No ratings yet
CSO504 Machine Learning: Evaluation and Error Analysis Validation and Regularization Koustav Rudra 22/08/2022
28 pages
CPSC540: Regularization, Regularization, Nonlinear Prediction and Generalization
No ratings yet
CPSC540: Regularization, Regularization, Nonlinear Prediction and Generalization
23 pages
Lect 1
No ratings yet
Lect 1
24 pages
Lec4 Oct12 2022 PracticalNotes LinearRegression
No ratings yet
Lec4 Oct12 2022 PracticalNotes LinearRegression
34 pages
01B-DL2023-LinearModels
No ratings yet
01B-DL2023-LinearModels
47 pages
Chap1 Bishop
No ratings yet
Chap1 Bishop
35 pages
BiasVariance
No ratings yet
BiasVariance
14 pages
Introduction To Machine Learning Lecture 2: Linear Regression
No ratings yet
Introduction To Machine Learning Lecture 2: Linear Regression
38 pages
Machine Learning Notes Cs229 1
No ratings yet
Machine Learning Notes Cs229 1
217 pages
Lecture3_upload
No ratings yet
Lecture3_upload
28 pages
Berkeley-tutorial Optimization for Machine Learning-part1
No ratings yet
Berkeley-tutorial Optimization for Machine Learning-part1
37 pages
Week11_regularization and optimization
No ratings yet
Week11_regularization and optimization
75 pages
CH 4
No ratings yet
CH 4
41 pages
Introduction ML
No ratings yet
Introduction ML
65 pages
Week 3
No ratings yet
Week 3
56 pages
Chapter Regression
No ratings yet
Chapter Regression
10 pages
BITS F464 ML Lecture Notes
No ratings yet
BITS F464 ML Lecture Notes
86 pages
Cost Function
No ratings yet
Cost Function
17 pages
Lecture 3_Regression (1)
No ratings yet
Lecture 3_Regression (1)
47 pages
Polynomial Curve Fitting in Machine Learning
No ratings yet
Polynomial Curve Fitting in Machine Learning
7 pages
03 Model Selection and Train-Validation-Test Sets 12 Min
No ratings yet
03 Model Selection and Train-Validation-Test Sets 12 Min
7 pages
Gdesc LMS
No ratings yet
Gdesc LMS
7 pages
Regression
No ratings yet
Regression
30 pages
3 LogisticRegression
No ratings yet
3 LogisticRegression
30 pages
Lecture Slide 02 - Supervised Learning - Summer 2023
No ratings yet
Lecture Slide 02 - Supervised Learning - Summer 2023
43 pages
Understanding The Geometry of Predictive Models: Workshop at S P Jain School Institute of Management and Research
No ratings yet
Understanding The Geometry of Predictive Models: Workshop at S P Jain School Institute of Management and Research
78 pages
CS229 Lecture Notes: Supervised Learning
No ratings yet
CS229 Lecture Notes: Supervised Learning
30 pages
CS229
No ratings yet
CS229
69 pages
CS229 Lecture Notes: Supervised Learning
No ratings yet
CS229 Lecture Notes: Supervised Learning
30 pages
cs229 2
No ratings yet
cs229 2
275 pages
ML 01
No ratings yet
ML 01
24 pages
L09 - Regularisation
No ratings yet
L09 - Regularisation
79 pages
Stanford ML CS229-Merged Notes
No ratings yet
Stanford ML CS229-Merged Notes
126 pages
Machine Learning Notes AndrewNg
No ratings yet
Machine Learning Notes AndrewNg
141 pages
CS229 Lecture Notes: Supervised Learning
No ratings yet
CS229 Lecture Notes: Supervised Learning
293 pages
A-level Maths Revision: Cheeky Revision Shortcuts
From Everand
A-level Maths Revision: Cheeky Revision Shortcuts
Scool Revision
3.5/5 (8)
Amazing Java: Learn Java Quickly
From Everand
Amazing Java: Learn Java Quickly
Andrei Besedin
No ratings yet
GCSE Maths Revision: Cheeky Revision Shortcuts
From Everand
GCSE Maths Revision: Cheeky Revision Shortcuts
Scool Revision
3.5/5 (2)
Complex Analysis and Complex Geometry: 1 Overview of The Field
No ratings yet
Complex Analysis and Complex Geometry: 1 Overview of The Field
10 pages
Exercise # 1: Subjective Type Questions 1. Find The Domain of Definition of The Given Functions
No ratings yet
Exercise # 1: Subjective Type Questions 1. Find The Domain of Definition of The Given Functions
31 pages
S3 Measures of Dispersion - Schoology
No ratings yet
S3 Measures of Dispersion - Schoology
27 pages
Module 24 Steps in Hypothesis Testing
No ratings yet
Module 24 Steps in Hypothesis Testing
4 pages
Munkres Chapter 2 Section 19 (Part I) Abstract Nonsense
No ratings yet
Munkres Chapter 2 Section 19 (Part I) Abstract Nonsense
9 pages
MATH1101 (Final Note 1)
No ratings yet
MATH1101 (Final Note 1)
17 pages
Reliability of Inserts in Sandwich Composite Panels
No ratings yet
Reliability of Inserts in Sandwich Composite Panels
35 pages
Advanced Malware Analysis - Christopher Elisan
No ratings yet
Advanced Malware Analysis - Christopher Elisan
582 pages
Multivariable Calculus: Inverse-Implicit Function Theorems: N N M F X
No ratings yet
Multivariable Calculus: Inverse-Implicit Function Theorems: N N M F X
11 pages
Subgradient Method: Ryan Tibshirani Convex Optimization 10-725
No ratings yet
Subgradient Method: Ryan Tibshirani Convex Optimization 10-725
21 pages
Brown - Warner - 1985 How To Calculate Market Return
No ratings yet
Brown - Warner - 1985 How To Calculate Market Return
29 pages
First Semester Syllabus Revised
No ratings yet
First Semester Syllabus Revised
21 pages
CH10B
No ratings yet
CH10B
20 pages
B26 Notes
No ratings yet
B26 Notes
11 pages
Logarithmic Differentiation
No ratings yet
Logarithmic Differentiation
5 pages
HW16 Fourier
No ratings yet
HW16 Fourier
1 page
Eigenvalue Problems For Odes 1
No ratings yet
Eigenvalue Problems For Odes 1
15 pages
Mean, Median, Mode
No ratings yet
Mean, Median, Mode
5 pages
CSD 205 - Design and Analysis of Algorithms
No ratings yet
CSD 205 - Design and Analysis of Algorithms
44 pages
Chemistry Project
No ratings yet
Chemistry Project
15 pages
Chapter 2
No ratings yet
Chapter 2
20 pages
Introduction and Sop
No ratings yet
Introduction and Sop
3 pages
Confusion Matrix and Performance Evaluation Metrics
No ratings yet
Confusion Matrix and Performance Evaluation Metrics
13 pages
Chapter-3-Constant Failure Rate Models
No ratings yet
Chapter-3-Constant Failure Rate Models
15 pages
Intro Metric Final
No ratings yet
Intro Metric Final
51 pages
Community: Welcome To The Brilliant Community
No ratings yet
Community: Welcome To The Brilliant Community
1 page
Assignment-1 (Partial Differential Equations (PDE) For Engineers)
No ratings yet
Assignment-1 (Partial Differential Equations (PDE) For Engineers)
3 pages
English 10
No ratings yet
English 10
10 pages
Fea MCQ On Theory
No ratings yet
Fea MCQ On Theory
10 pages
Bahals Book Chapter - 16 - Answers
No ratings yet
Bahals Book Chapter - 16 - Answers
22 pages

2 LinearRegression2

Uploaded by

2 LinearRegression2

Uploaded by

ITCS 6156/8156 Fall 2023

Instructor: Hongfei Xue

Some content in the slides is based on Dr. Razvan’s lecture

• The final weight update rule:

• Issues with Gradient Decent:

• More eﬃcient algorithms exist for batch optimization,

• Q: What if the raw feature is insufficient for good

• A: Engineer / Learning higher-level features, as functions

• Polynomial curve fitting:

• Training: Build a function ℎ 𝑥 , based on (noisy) training examples

• Testing: for arbitrary (unseen) instance 𝑥 𝜖 𝐗, compute target output

• Polynomial curve fitting is (Multiple) Linear Regression:

• Learning = minimize the Sum-of-Squares error function:

• Generalization = how well the parameterized ℎ(𝑥, 𝐰)

• Generalization performance depends on the value of M

• Which M to pick? Why?

William of Occam (1288 – 1348)

“Entia non sunt multiplicanda praeter necessitatem”

i.e. Do not make things needlessly complicated.

• Model Selection: choosing the order M of the polynomial.

• Overfitting ≡ good performance on training examples, poor

• More training data ⟹ less overfitting

• What if we do not have more training data?

• Penalize large parameter values:

• Multiple linear regression with L2 regularization:

• Solution is 𝐰 = (𝝀𝑵𝐈 + 𝐗 𝑻𝐗)3𝟏 𝐗 𝑻𝐭

How do we find the optimal value of 𝜆?

• Put aside an independent validation set.

ln 𝜆 𝜖{−40, −35, −30, −25, −20, −15}

- if 𝜆 is sufficiently large, some of the coefficients 𝑤& are driven to

• Parameter norm penalties (term in the objective).

You might also like