CS550 Lec2
CS550 Lec2
Lecture 2: Regression
Once learned (using some optimization technique), these weight vectors (one for each
class) can sometimes have nice interpretations, especially when the inputs are images
These images sort
The learned weight
vectors of each of the 4 of look like class
classes visualized as prototypes if I
images – they kind of were using LwP
look like a “template” of 𝒘 𝑐𝑎𝑟 𝒘 𝑓𝑟𝑜𝑔 𝒘 h𝑜𝑟𝑠𝑒 𝒘 𝑐𝑎𝑡
Yeah, “sort of”.
what the images from
That’s why the dot product of each of these weight vectors with No wonder why LwP (with
that class should look an image from the correct class will be expected to be the largest Euclidean distances) acts
like like a linear model.
6
Simple Linear Models as Building Blocks
Linear models are building blocks for dimensionality reduction methods like PCA
This looks very similar to the multi-
output model, except that the values of
the latent features are not known and
have to be learned
Linear models are building blocks for even deep learning model (each layer is like a multi-
output linear model, followed by a nonlinearity)
Normal Equations
Since () is perpendicular to all vectors Ax in the column space,
Normal Equation for solving :
Least squares sol to Ax=b:
Projection of b onto Col(A): =
Projection matrix that multiplies b to give p:
Pseudo Inverse Method
Pseudo Inverse Method:
If A has independent columns,
If A has independent rows,
Pseudo inverse can be computed using SVD:
contains the inverse of all non-zero diagonal elements in
QR (Gram-Schmidt) Method
Decompose
Where Q is an orthogonal matrix, and R is a
triangular matrix
Then, and the normal equation , can be solved
as
Goal: Learn a model to predict the output for new test inputs
Let’s write the total error or “loss” of this model over the training
measures the data as
Goal of learning is to find
the that minimizes this loss
) prediction error or “loss” or
“deviation” of the model on
+ does well on test data Unlike models like KNN and DT, here we
a single training input
have an explicit problem-specific
objective (loss function) that we wish to
15
Linear Regression: Pictorially
Linear regression is like fitting a line or (hyper)plane to a set of points [ ] 𝑧1
𝑧2
= 𝜙( 𝑥 )
(Output )
curve or curved surface?
(Output )
Let us find the that optimizes (minimizes) the above squared loss
The “least squares” (LS)
We need calculus and optimization to do this! problem Gauss-Legendre, 18th
century)
The LS problem can be solved easily and has a closed form solutionClosed form
solutions to ML
= problems are
rare.
= ¿( 𝑿 𝑿)
⊤ −1 ⊤
𝑿 𝒚
matrix inversion – can be expensive.
Ways to handle this. Will see later
18
Proof: A bit of calculus/optim. (more on this later)
We wanted to find the minima of
Let us apply basic rule of calculus: Take first derivative of and set to zero
Chain rule of calculus
Partial derivative of dot product w.r.t each element of Result of this derivative is - same size as
Using the fact , we get
To separate to get a solution, we write the above as
𝑁 𝑁
∑ 2 𝒙 𝑛 ( 𝑦𝑛 − 𝒙 ⊤
𝑛 𝒘 ) =0 ∑ 𝑦 𝑛 𝒙 𝑛 − 𝒙𝑛 𝒙 ⊤
𝑛 𝒘=0
𝑛=1 𝑛=1
= ¿( 𝑿 𝑿)
⊤ −1
𝑿 𝒚
⊤
19
Problem(s) with the Solution!
We minimized the objective w.r.t. and got
= ⊤
¿( 𝑿 𝑿)
−1 ⊤
𝑿 𝒚
Problem: The matrix may not be invertible
This may lead to non-unique solutions for
𝑛=1
Proceeding just like the LS case, we can find the optimal which is given by
= ¿( 𝑿
⊤
𝑿+ 𝜆 𝐼𝐷)
−1
𝑿 𝒚
⊤
21
A closer look at regularization Remember – in general,
weights with large magnitude
are bad since they can cause
The regularized objective we minimized is overfitting on training data and
𝑁 may not work well on test data
𝐿𝑟𝑒𝑔 ( 𝒘 ) =∑ ( 𝑦 𝑛 − 𝒘 ⊤ 𝒙 𝑛 )2 + 𝜆 𝒘 ⊤ 𝒘
𝑛=1
How to solve this and other (possibly more difficult) optimization problems arising in ML efficiently?
What’s the basic calculus and optimization knowledge we need for ML?