0% found this document useful (0 votes)
4 views48 pages

Introml sp24 Lec2

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views48 pages

Introml sp24 Lec2

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 48

https://introml.mit.

edu/

Intro to Machine Learning


Lecture 2: Linear regression and regularization

Shen Shen
Feb 9, 2024

(many slides adapted from Tamara Broderick)


Logistical issues? Personal concerns?
We’d love to help out at
[email protected]
Logistics
11am Section 3 and 4 are completely full and we have many
requests to switch. Physical space packed.
If at all possible, please help us by signup/switch to other slots.

OHs start this Sunday, please also join our Piazza


Thanks for all the assignments feedback. We are adapting on-the-
go but these certainly benefit future semesters.
Start to get assignments due now. (first up, exercises 2, keep an
eye on the "due")
https://shenshen.mit.edu/demos/gifs/atlas_darpa_overall.gif

Optimization + first-principle physics


https://www.youtube.com/embed/fn3KWM1kuAw?enablejsapi=1
Outline
Recap of last (content) week.
Ordinary least-square regression
Analytical solution (when exists)
Cases when analytical solutions don't exist
Practically, visually, mathamtically
Regularization
Hyperparameter, cross-validation
Outline
Recap of last (content) week.
Ordinary least-square regression
Analytical solution (when exists)
Cases when analytical solutions don't exist
Practically, visually, mathemtically
Regularization
Hyper-parameter, cross-validation
~ ⊤ ~ −1 ~ ⊤ ~
θ = (X X ) X Y

~ ⊤ ~ −1 ~ ⊤ ~
θ = (X X ) X Y

When θ∗ exists, guaranteed to be


unique minimizer of
~ ⊤ ~ −1 ~ ⊤ ~
θ = (X X ) X Y
may not be
Now, the catch: ∗
well-defined

~ ⊤ ~ −1 ~ ⊤ ~ ~ ~
θ = (X X ) X Y is not well-defined if (X ⊤ X ) is not invertible

~ ~
Indeed, it's possible that(X ⊤ X ) is not
invertible.

~⊤ ~
In particular,(X X ) is not invertible if
~
and only if X is not full column rank
~ ⊤ ~ −1 ~ ⊤ ~
θ = (X X ) X Y
is not well-
Now, the catch: ∗
defined
~
if X is not full column rank

Recall
~
indeed X is not full column rank

1. if n<d
~
2. if columns (features) in X have linear
dependency
1. if n<d (i.e. not enough data)
Recap: ~
2. if columns (features) in X have linear
dependency (i.e., so-called co-linearity)

~ ⊤ ~ −1 ~ ⊤ ~
θ = (X X ) X Y

is not defined

Both cases do happen in practice


In both cases, loss function is a "half-pipe"
In both cases, infinitily-many optimal
hypotheses
Side-note: sometimes noise can resolve
invertabiliy issue, but undesirable
Outline
Recap of last (content) week.
Ordinary least-square regression
Analytical solution (when exists)
Cases when analytical solutions don't exist
Practically, visually, mathemtically
Regularization
Hyper-parameter, cross-validation
Regularization

🥺 🥰
Ridge Regression Regularization
Ridge Regression Regularization
Ridge Regression Regularization

λ is a hyper-parameter
Outline
Recap of last (content) week.
Ordinary least-square regression
Analytical solution (when exists)
Cases when analytical solutions don't exist
Practically, visually, mathemtically
Regularization
Hyper-parameter, cross-validation
Cross-validation
Cross-validation
Cross-validation


Cross-validation
Cross-validation
Cross-validation
Cross-validation
Cross-validation
Comments about cross-validation

good idea to shuffle data first


a way to "reuse" data
not evaluating a hypothesis, but rather
evaluating learning algorithm. (e.g. hypothesis class, hyper-
parameter)
Could e.g. have an outer loop for picking good hyper-
parameter/class
Summary
One strategy for finding ML algorithms is to reduce the ML
problem to an optimization problem.
For the ordinary least squares (OLS), we can find the optimizer
analytically, using basic calculus! Take the gradient and set it to
zero. (Generally need more than gradient info; suffices in OLS)
Two ways to approach the calculus problem: write out in terms of
explicit sums or keep in vector-matrix form. Vector-matrix form is
easier to manage as things get complicated (and they will!) There
are some good discussions in the lecture notes.
Summary
What does it mean to well posed.
When there are many possible solutions, we need to indicate our
preference somehow.
Regularization is a way to construct a new optimization problem
Least-squares regularization leads to the ridge-regression formulation.
Good news: we can still solve it analytically!
Hyper-parameters and how to pick them. Cross-validation
We'd love it for you to share some lecture feedback.

Thanks!

You might also like