0% found this document useful (0 votes)
9 views24 pages

CS550 Lec2

The document summarizes key topics from a lecture on linear regression in machine learning: - It discussed linear regression problems, loss functions for regression like sum of squared errors, and regularization techniques like L2 regularization to avoid overfitting. - It explained that linear models can be used for regression by learning weight vectors to map inputs to outputs. These simple models can also be extended to multi-output and multiclass classification problems. - The lecture covered methods for learning the parameters of linear models like the normal equations, pseudo-inverse, and QR decomposition methods. It also introduced randomized techniques for approximating the SVD to compute low-rank approximations.

Uploaded by

dipsresearch
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views24 pages

CS550 Lec2

The document summarizes key topics from a lecture on linear regression in machine learning: - It discussed linear regression problems, loss functions for regression like sum of squared errors, and regularization techniques like L2 regularization to avoid overfitting. - It explained that linear models can be used for regression by learning weight vectors to map inputs to outputs. These simple models can also be extended to multi-output and multiclass classification problems. - The lecture covered methods for learning the parameters of linear models like the normal equations, pseudo-inverse, and QR decomposition methods. It also introduced randomized techniques for approximating the SVD to compute low-rank approximations.

Uploaded by

dipsresearch
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 24

CS 550: Machine Learning

Lecture 2: Regression

Instructor: Dr. Gagan Raj Gupta


Today’s Class

 Linear Regression problems


 Loss Functions for Regression
 Stability Issues in matrix operations
 Regularization to avoid over-fitting
 Different regularization objectives
 SGD to solve Linear Regression
3
Linear Models
 Consider learning to map an input to the corresponding (say real-valued) output

 Assume the output to be a linear weighted combination of the input features

This defines a linear model with


parameters given by a “weight
vector”
Each of these weights have a simple
interpretation: is the “weight” or importance
of the feature in making this prediction
The “optimal” weights are unknown and
 This simple model can be used for Linear Regression have to be learned by solving an
optimization problem, using some training
data
 This simple model can also be used as a “building block” for more complex models
 Even classification (binary/multiclass/multi-output/multi-label) and various other ML/deep learning
models
4
Simple Linear Models as Building Blocks
 In some regression problems, each output itself is a real-valued vector
 Example: Given a full body image of a person, predict height, weight, hand size, and leg size (
 Such problems are commonly known as multi-output regression

 We can assume a separate linear model for each of the outputs

Now each is a D-dim weight



𝑦 𝑚= 𝒘 𝑚 𝒙 vector for predicting the output

Here is an MxD weight matrix


𝒚 =𝐖 𝒙 with its row containing

𝒘1
Note: Learning separate models may not be ideal ⊤ Learning this model will
𝒘2
these multiple outputs are somewhat correlated with require us to learn this weight
each other. But this model can be extended to matrix (or equivalently, the
handle such situation (techniques are a bit advanced 𝒙
to be discussed right now – but if curious, you may 𝒘𝑀

weight vectors)
look up more about multitask learning techniques) 𝒚 𝐖
5
Simple Linear Models as Building Blocks
 Linear models are also used in multiclass classification problems

 Assuming classes, we can assume the following model



𝑦 =argmax 𝑘∈ {1 , 2 ,… , 𝐾 } 𝒘 𝒙 𝑘

 Can think of as the score of the input for the class

 Once learned (using some optimization technique), these weight vectors (one for each
class) can sometimes have nice interpretations, especially when the inputs are images
These images sort
The learned weight
vectors of each of the 4 of look like class
classes visualized as prototypes if I
images – they kind of were using LwP
look like a “template” of 𝒘 𝑐𝑎𝑟 𝒘 𝑓𝑟𝑜𝑔 𝒘 h𝑜𝑟𝑠𝑒 𝒘 𝑐𝑎𝑡 
Yeah, “sort of”. 
what the images from
That’s why the dot product of each of these weight vectors with No wonder why LwP (with
that class should look an image from the correct class will be expected to be the largest Euclidean distances) acts
like like a linear model. 
6
Simple Linear Models as Building Blocks
 Linear models are building blocks for dimensionality reduction methods like PCA
This looks very similar to the multi-
output model, except that the values of
the latent features are not known and
have to be learned

 Linear models are building blocks for even deep learning model (each layer is like a multi-
output linear model, followed by a nonlinearity)

In a deep learning model, each layer learns a latent


feature representation of the inputs using a model like a
multi-output linear model, followed by a nonlinearity

The last (output) layer can have one or more outputs

More on this when we discuss deep learning later


7
Learning Linear Models
Linear Regression (Problem setup)
 We are given observations (x, y)
 Let us build a linear model to predict y from the observations x s. t.
 Compute the “best” solution to the model that minimizes SSE (sum of squared distances)
 This problem is very similar to solving , a system of linear equations
 In most cases, there may be no exact solution to this problem
 But, there are many approaches we can take to get the “least squares” solution
 This minimizes the sum of square errors

Normal Equations
Since () is perpendicular to all vectors Ax in the column space,
Normal Equation for solving :
Least squares sol to Ax=b:
Projection of b onto Col(A): =
Projection matrix that multiplies b to give p:
Pseudo Inverse Method
 Pseudo Inverse Method:
 If A has independent columns,
 If A has independent rows,
 Pseudo inverse can be computed using SVD:
 contains the inverse of all non-zero diagonal elements in
QR (Gram-Schmidt) Method

 Decompose
 Where Q is an orthogonal matrix, and R is a
triangular matrix
 Then, and the normal equation , can be solved
as

 This is computationally efficient to solve


Alternatives to compute Low Rank Approximations
 Let A be an mxn matrix of low numerical
rank
 Suppose that you can’t afford to compute
the full SVD or you don’t have a good
implementation
 How can you compute a low-rank
approximation to A?
 Gram-Schmidt: Keep reducing a rank-1
component from A
 Complexity: O(mnk) Each of these approximations result in a
factorization of the form
 Krylov Methods: Restrict the matrix A to the
k-dimensional “Krylov subspace” Where Q is an approximate orthonormal basis for
 Span the column space of A
Randomized Low Rank Approximations
Range Finding (Basis) Problem: Given an m x n matrix A and an integer k <
min(m,n). Find an orthonormal m x k matrix Q such that

Solving the primitive problem via randomized sampling — intuition:


1. Draw Gaussian random vectors
2. Form “sample” vectors
3. Form orthonormal vectors such that
Span() = Span()
For instance, Gram-Schmidt can be used — pivoting is rarely
required.
If A has exact rank k, then SpanRange(A) with probability 1.
Randomized SVD
 Goal: Given an m x n matrix A, compute an approximate rank-k SVD
 Algorithm:
 1. Draw an n x k Gaussian random matrix G. G=
randn(n,k)
 2. Form the m x k sample matrix Y = AG
Y=A*G
 3. Form an m x k orthonormal matrix Q such that Y = QR [Q, ~ ] =
qr(Y)
 4. Form the k x n matrix B =
B = Q’ * A
 5. Compute the SVD of the small matrix B: B = [Uhat,
Sigma, V] = svd(B,0)
 6. Form the matrix U =
U = Q * Uhat
 Power iteration to improve the accuracy: The computed factorization is close to optimally
accurate when the singular values of A decay rapidly. When they do not, a small amount of
14
Linear Regression
 Given: Training data with input-output pairs , ,

 Goal: Learn a model to predict the output for new test inputs

 Assume the function that approximates the I/O relationship to be a linear


model ⊤ Can also write all of them
𝑦 𝑛 ≈ 𝑓 ( 𝒙𝑛 ) = 𝒘 𝒙 𝑛 (𝑛=1 , 2 , … , 𝑁 ) compactly using matrix-
vector notation as

 Let’s write the total error or “loss” of this model over the training
measures the data as
Goal of learning is to find
the that minimizes this loss
) prediction error or “loss” or
“deviation” of the model on
+ does well on test data Unlike models like KNN and DT, here we
a single training input
have an explicit problem-specific
objective (loss function) that we wish to
15
Linear Regression: Pictorially
 Linear regression is like fitting a line or (hyper)plane to a set of points [ ] 𝑧1
𝑧2
= 𝜙( 𝑥 )

What if a line/plane doesn’t


model the input-output
relationship very well, e.g., Original (single) feature Two features
Nonlinear curve needed Can fit a plane (linear)
if their relationship is better
modeled by a nonlinear

(Output )
curve or curved surface?
(Output )

Do linear No. We can even fit a


models become curve using a linear
useless in such model after suitably
cases? transforming the inputs
Input (single feature) (Feature 2) (Feature 1)

The transformation can be predefined or learned (e.g.,


using kernel methods or a deep neural network based
feature extractor). More on this later
 The line/plane must also predict outputs the unseen (test) inputs well
16
Loss Functions for Regression Choice of loss function usually
depends on the nature of the
data. Also, some loss functions
result in easier optimization
 Many possible loss functions for regression problems problem than others

Squared loss Absolute


Loss ( 𝑦 𝑛 − 𝑓 ( 𝒙 𝑛 ))2 Loss ¿ 𝑦 𝑛 − 𝑓 ( 𝒙 𝑛 ) ∨¿
loss
Very commonly used
for regression. Leads Grows more slowly than
to an easy-to-solve squared loss. Thus better
suited when data has some
optimization problem
outliers (inputs on which
model makes large errors)
) )
Loss ¿ 𝑦 𝑛 − 𝑓 ( 𝒙 𝑛 ) ∨− 𝜖
Huber loss Loss
-insensitive loss
Squared loss for
small errors (say up
(a.k.a. Vapnik
to ); absolute loss for loss) Note: Can also use
larger errors. Good Zero loss for small errors squared loss instead
for data with outliers (say up to ); absolute loss of absolute loss
−𝛿 𝛿 ) for larger errors
−𝜖 𝜖 )
17
Linear Regression with Squared Loss
In matrix-vector notation, can write it
 In this case, the loss func will be compactly as )

 Let us find the that optimizes (minimizes) the above squared loss
The “least squares” (LS)
 We need calculus and optimization to do this! problem Gauss-Legendre, 18th
century)

 The LS problem can be solved easily and has a closed form solutionClosed form
solutions to ML
= problems are
rare.

= ¿( 𝑿 𝑿)
⊤ −1 ⊤
𝑿 𝒚
matrix inversion – can be expensive.
Ways to handle this. Will see later
18
Proof: A bit of calculus/optim. (more on this later)
 We wanted to find the minima of

 Let us apply basic rule of calculus: Take first derivative of and set to zero
Chain rule of calculus

Partial derivative of dot product w.r.t each element of Result of this derivative is - same size as
 Using the fact , we get
 To separate to get a solution, we write the above as
𝑁 𝑁

∑ 2 𝒙 𝑛 ( 𝑦𝑛 − 𝒙 ⊤
𝑛 𝒘 ) =0 ∑ 𝑦 𝑛 𝒙 𝑛 − 𝒙𝑛 𝒙 ⊤
𝑛 𝒘=0
𝑛=1 𝑛=1

= ¿( 𝑿 𝑿)
⊤ −1
𝑿 𝒚

19
Problem(s) with the Solution!
 We minimized the objective w.r.t. and got

= ⊤
¿( 𝑿 𝑿)
−1 ⊤
𝑿 𝒚
 Problem: The matrix may not be invertible
 This may lead to non-unique solutions for

 Problem: Overfitting since we only minimized loss defined on training data


 Weights may become arbitrarily large to fit training data perfectly
 Such weights may perform poorly on the test data however is called the Regularizer and
measures the “magnitude” of
 One Solution: Minimize a regularized objective
is the reg. hyperparam.
 The reg. will prevent the elements of from becoming too large Controls how much we wish
 Reason: Now we are minimizing training error + magnitude of vector to regularize (needs to be
tuned via cross-validation)
20
Regularized Least Squares (a.k.a. Ridge Regression)
 Recall that the regularized objective is of the form

 One possible/popular regularizer: the squared Euclidean ( squared) norm of


2
𝑅 ( 𝒘 )=‖𝒘‖2=𝒘 ⊤ 𝒘
 With this regularizer, we have the regularized least squares problem as
Look at the form of the solution. We
+ are adding a small value to the
diagonals of the DxD matrix (like
Why is the method 𝑁 adding a ridge/mountain to some land)
called “ridge”
regression = arg min 𝒘 ∑ ( 𝑦 𝑛 − 𝒘 𝒙 𝑛 ) + 𝜆 𝒘 𝒘
⊤ 2 ⊤

𝑛=1
 Proceeding just like the LS case, we can find the optimal which is given by

= ¿( 𝑿

𝑿+ 𝜆 𝐼𝐷)
−1
𝑿 𝒚

21
A closer look at regularization Remember – in general,
weights with large magnitude
are bad since they can cause
 The regularized objective we minimized is overfitting on training data and
𝑁 may not work well on test data
𝐿𝑟𝑒𝑔 ( 𝒘 ) =∑ ( 𝑦 𝑛 − 𝒘 ⊤ 𝒙 𝑛 )2 + 𝜆 𝒘 ⊤ 𝒘
𝑛=1

 Minimizing w.r.t. gives a solution for that


 Keeps the training error small Good because, consequently, the Not a “smooth” model
individual entries of the weight since its test data
 Has a small squared norm = vector are also prevented from predictions may
becoming too large change drastically
even with small
 Small entries in are good since they lead to “smooth” models changes in some
feature’s value

A typical learned without reg.


𝒙 𝑛=¿ 1.2 0.5 2.4 0.3 0.8 0.1 0.9 2.1 𝑦 𝑛= 0.8
3.2 1.8 1.3 2.1 10000 2.5 3.1 0.1

𝒙 𝑚=¿ 1.2 0.5 2.4 0.3 0.8 0.1 0.9 2.1


𝑦 𝑚=100
Just to fit the training data where one of the
inputs was possibly an outlier, this weight
Exact same feature vectors only Very different outputs though (maybe
became too big. Such a weight vector will
differing in just one feature by a small one of these two training ex. is an
possibly do poorly on normal test inputs
amount outlier)
Note that optimizing loss
22
Other Ways to Control Overfitting functions with such regularizers is
usually harder than ridge reg. but
several advanced techniques exist
(we will see some of those later)
 Use a regularizer defined by other norms, e.g.,
Use them if you have a very
norm regularizer large number of features but
𝐷
many irrelevant features.
‖𝒘‖1 =∑ ¿ 𝑤𝑑 ∨¿ ¿ These regularizers can help in
When should I used these 𝑑=1 automatic feature selection
regularizers instead of the sparse means many entries
regularizer? ‖𝒘‖0 =¿ nnz ( 𝒘 ) Using such regularizers in will be zero or near
gives a sparse weight zero. Thus those features
Automatic feature
vector as solution will be considered
selection? Wow, cool!!! norm regularizer (counts
But how exactly? number of nonzeros in irrelevant by the model
and will not influence
prediction
 Use non-regularization based approaches
 Early-stopping (stopping training just when we have a decent val. set accuracy)
 Dropout (in each iteration, don’t update some of the weights) All of these are very popular ways
to control overfitting in deep
 Injecting noise in the inputs learning models. More on these
later when we talk about deep
learning
23
Linear Regression as Solving System of Linear Eqs
 The form of the lin. reg. model is akin to a system of linear equation
 Assuming training examples with features each, we have
First training example: 𝑦 1=𝑥11 𝑤 1+ 𝑥 12 𝑤 2+ …+ 𝑥 1 𝐷 𝑤 𝐷
Note: Here denotes the
Second training example: 𝑦 2=𝑥 21 𝑤1 + 𝑥 22 𝑤2 +… + 𝑥2 𝐷 𝑤 𝐷
feature of the training
example
equations and unknowns
N-th training example: 𝑦 𝑁 =𝑥 𝑁 1 𝑤1 + 𝑥 𝑁 2 𝑤 2+ …+ 𝑥 𝑁𝐷 𝑤 𝐷 here ()
 However, in regression, we rarely have but rather or
 Thus we have an underdetermined () or overdetermined () system
 Methods to solve over/underdetermined systems can be used for lin-reg as well
 Many of these methods don’t require expensive matrix inversion
Now solve
this!
Solving lin-reg ⊤ −1 ⊤
𝒘 ¿(𝑿 𝑿) 𝑿 𝒚 where , and
as system of lin eq.
System of lin. Eqns with equations and unknowns
24
Calculus and Optimization for ML
 Regularized Linear Regression (a.k.a. Ridge Regression)

Problem more compactly written as Solution more compactly as

 Getting closed-form soln required simple calculus, but is expensive to compute


 Especially when is very large (since we need to invert a matrix)

 How to solve this and other (possibly more difficult) optimization problems arising in ML efficiently?

 What’s the basic calculus and optimization knowledge we need for ML?

You might also like