0% found this document useful (0 votes)
2 views

2 LinearRegression2

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views

2 LinearRegression2

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 45

ITCS 6156/8156 Fall 2023

Machine Learning

Linear Regression

Instructor: Hongfei Xue


Email: [email protected]
Class Meeting: Mon & Wed, 4:00 PM – 5:15 PM, CHHS 376

Some content in the slides is based on Dr. Razvan’s lecture


Machine Learning as Optimization
Convexity
Convex Optimization
Gradient Descent

𝐽(𝑤! , 𝑤" )

𝑤"
𝑤!
Gradient Descent

𝐽(𝑤! , 𝑤" )

𝑤"
𝑤!
Gradient Descent

𝐽(𝑤! , 𝑤" )

𝑤"
𝑤!
Gradient Descent

𝐽(𝑤! , 𝑤" )

𝑤"
𝑤!
Gradient Descent

𝐽(𝑤! , 𝑤" )

𝑤"
𝑤!
Gradient Descent


Taylor Expansion
Gradient Decent


Gradient Decent

• The key operation in the above update step is the calculation of each partial derivative.


Gradient Decent

• The final weight update rule:


Issues with Gradient Decent

• Issues with Gradient Decent:


• Slow convergence
• Stuck in local minima

• One should note that the second issue will not arise in the
case of convex problem as the error surface has only one
global minima.

• More efficient algorithms exist for batch optimization,


including Conjugate Gradient Descent and other quasi-
Newton methods. Another approach is to consider training
examples in an online or incremental fashion, resulting in
an online algorithm called Stochastic Gradient Descent.
Stochastic Gradient Descent (SGD)
• Update weights after every (or a small subset of) training
example(s).

• Why SGD?
Stochastic Gradient Descent (SGD)

1 or K (a small number)

1 or K (a small number)
Polynomial Basis Functions

• Q: What if the raw feature is insufficient for good


performance?
• Example: non-linear dependency between label and raw
feature.

• A: Engineer / Learning higher-level features, as functions


of the raw feature.

• Polynomial curve fitting:


- Add new features, as polynomials of the original feature.
Regression: Curve Fitting

Target 𝑓
Regression: Curve Fitting

Learned ℎ
𝑦

Target 𝑓

• Training: Build a function ℎ 𝑥 , based on (noisy) training examples


𝑥! , 𝑦! , 𝑥" , 𝑦" , ⋯ , (𝑥# , 𝑦# ).
Regression: Curve Fitting

Learned ℎ
𝑦

Target 𝑓

• Testing: for arbitrary (unseen) instance 𝑥 𝜖 𝐗, compute target output


ℎ 𝑥 ; want it to be close to f 𝑥 .
Regression: Polynomial Curve Fitting

%
ℎ 𝑥 = ℎ 𝑥, 𝐰 = 𝑤$ + 𝑤! 𝑥 + 𝑤" 𝑥 " + ⋯ + 𝑤% 𝑥 % = 1 𝑤& 𝑥 &
&'$

parameters features
Polynomial Curve Fitting
• Parametric model:
%
ℎ 𝑥 = ℎ 𝑥, 𝐰 = 𝑤$ + 𝑤! 𝑥 + 𝑤" 𝑥 " + ⋯ + 𝑤% 𝑥 % = 1 𝑤& 𝑥 &
&'$

• Polynomial curve fitting is (Multiple) Linear Regression:


𝐱 = [1, 𝑥, 𝑥 " , ⋯ , 𝑥 % ](
ℎ 𝑥 = ℎ 𝐱, 𝐰 = ℎ𝐰 𝐱 = 𝐰 𝑻𝐱

• Learning = minimize the Sum-of-Squares error function:

#
argmin 1
6=
𝐰 𝐽 𝐰 𝐽 𝐰 = 1 (ℎ𝐰(𝐱 . ) − 𝑦. )"
𝐰 2𝑁
.'!
• Least Square Estimate:

6 = (𝐗 𝐓𝐗),𝟏 𝐗 𝐓𝐲
𝐰
Polynomial Curve Fitting

• Generalization = how well the parameterized ℎ(𝑥, 𝐰)


performs on arbitrary (unseen) test instances 𝑥𝜖𝑋.

• Generalization performance depends on the value of M


0th Order Polynomial
1st Order Polynomial
3rd Order Polynomial
9th Order Polynomial

• Which M to pick? Why?


• Follow the wisdom of a philosopher.
Occam’s Razor

William of Occam (1288 – 1348)


English Franciscan friar, theologian and
philosopher.

“Entia non sunt multiplicanda praeter necessitatem”


• Entities must not be multiplied beyond necessity.

i.e. Do not make things needlessly complicated.


i.e. Prefer the simplest hypothesis that fits the data.
Polynomial Curve Fitting

• Model Selection: choosing the order M of the polynomial.


- Best generalization obtained with M=3.
- M = 9 obtains poor generalization, even though it fits
training examples perfectly:
• But M = 9 polynomials subsume M = 3 polynomials!

• Overfitting ≡ good performance on training examples, poor


performance on test examples.
Over-fitting and Parameter Values
Overfitting
• Measure fit using the Root-Mean-Square (RMS) error (RMSE):
∑+(𝐰 , 𝐱 + − 𝒕+ )-
𝐸'() 𝐰 =
𝑁
• Use 100 random test examples, generated in the same way:
Overfitting vs. Data Set Size

• More training data ⟹ less overfitting

• What if we do not have more training data?


- Use regularization
Regularization

• Penalize large parameter values:


0
1 -
𝜆 -
𝐸 𝐰 = 5 (ℎ𝐰(𝐱 + ) − 𝑡+ ) + 𝐰
2𝑁 2
+./

Regularizer

argmin
𝐰∗ = 𝐸(𝐰)
𝐰
Ridge Regression

• Multiple linear regression with L2 regularization:


0
1 -
𝜆 -
𝐽 𝐰 = 5 (ℎ𝐰(𝐱 + ) − 𝑡+ ) + 𝐰
2𝑁 2
+./
argmin
@=
𝐰 𝐽(𝐰)
𝐰

• Solution is 𝐰 = (𝝀𝑵𝐈 + 𝐗 𝑻𝐗)3𝟏 𝐗 𝑻𝐭


- Prove it.
9th Order Polynomial with Regularization
9th Order Polynomial with Regularization
Training & Test error vs. ln 𝜆

How do we find the optimal value of 𝜆?


Model Selection

• Put aside an independent validation set.


• Select parameters giving best performance on validation set.

ln 𝜆 𝜖{−40, −35, −30, −25, −20, −15}


K-fold Cross-Validation

Source: https://scikit-learn.org/stable/modules/cross_validation.html
K-fold Cross-Validation

• Split the training data into K folds and try a wide range of
tunning parameter values:
- split the data into K folds of roughly equal size
- iterate over a set of values for 𝜆
• iterate over k = 1,2, ⋯ ,K
- use all folds except k for training
- validate (calculate test error) in the k-th fold
• error[𝜆] = average error over the K folds
- choose the value of 𝜆 that gives the smallest error.
Regularization: Ridge vs. Lasso

• Ridge regression:

! /
𝐽 𝐰 = ∑# (ℎ 𝐱 . − 𝑡. )" + ∑%
&'! 𝑤&
"
"# .'! 𝐰 "

• Lasso:
# %
1 "
𝜆
𝐽 𝐰 = 1 (ℎ𝐰 𝐱 . − 𝑡. ) + 1 𝑤&
2𝑁 2
.'! &'!

- if 𝜆 is sufficiently large, some of the coefficients 𝑤& are driven to


0 ⟹ sparse model
Regularization: Ridge vs. Lasso

Plot of the contours of the unregularized error function (blue) along with the
constraint region (3.30) for the quadratic regularizer 𝑞 = 2 on the left and the lasso
regularizer 𝑞 = 1 on the right, in which the optimum value for the parameter vector
𝐰 is denoted by 𝐰 ∗. The lasso gives a sparse solution in which 𝐰 ∗ = 𝟎.
Regularization

• Parameter norm penalties (term in the objective).


• Limit parameter norm (constraint).
• Dataset augmentation.
• Dropout.
• Ensembles.
• Semi-supervised learning.
• Early stopping
• Noise robustness.
• Sparse representations.
• Adversarial training.
Questions?

You might also like