0% found this document useful (0 votes)
65 views20 pages

CIS 4526: Foundations of Machine Learning Linear Regression: (Modified From Sanja Fidler)

This document provides an overview of linear regression problems and models. It discusses: 1. Regression problems involve predicting a continuous target value from input data and commonly involve curve fitting, time series forecasting, and other applications. 2. Linear regression models the relationship between input and output with a linear function and aims to minimize the loss between predictions and targets. 3. The model can be trained with a closed-form solution by taking the pseudo-inverse of the input data or using gradient descent to iteratively minimize loss.

Uploaded by

Mary Liu
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
65 views20 pages

CIS 4526: Foundations of Machine Learning Linear Regression: (Modified From Sanja Fidler)

This document provides an overview of linear regression problems and models. It discusses: 1. Regression problems involve predicting a continuous target value from input data and commonly involve curve fitting, time series forecasting, and other applications. 2. Linear regression models the relationship between input and output with a linear function and aims to minimize the loss between predictions and targets. 3. The model can be trained with a closed-form solution by taking the pseudo-inverse of the input data or using gradient descent to iteratively minimize loss.

Uploaded by

Mary Liu
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 20

CIS 4526: Foundations of Machine Learning

Linear Regression
(modified from Sanja Fidler)

Instructor: Kai Zhang


CIS @ Temple University, Fall 2020
Regression Problems
Regression Problems
Regression Problems

• Curve Fitting
Regression Problems

• Time Series Forecast


Regression Problem

• What
  do all these problems have in common?
– Input: d-dimensional samples/vectors
– Output: continuous target value
• How to make predictions?
– A model, a function that represents the relationship between
and
– A loss or (cost, or objective function), which tells us how well
our model approximates the training examples
– Optimization, a way of finding the parameters of our model that
minimizes the loss function
Simple 1-D example

 𝑦(𝑥)
Model Selection

• Model Complexity
– In what form should we parameterize the prediction
function?
– How complex should the model be?
• Example: linear, quadratic, or degree-d polynomial? (1-d case)
• Common Belief
– Simple models
• less flexible, but may be easy to solve (such as a linear model)
– Complex models
• more powerful, but difficult to solve, and prone to overfitting
• We will start from building simple, linear models
Model Selection

• We will start from building simple, linear models


Higher-dimensional Linear Regression

• Circles are training examples


• Line/Plane represent the model/hypothesis
• Red lines represent in-sample error

10
Linear Model
• Given
  d-dimensional training samples
• Linear model
– Equivalent form:
  ′
𝒘
  𝑥𝑛
– So we usually apply ``augmented’’, (d+1)-dimensional data
• More convenient to derive closed-form solutions
Training (In-Sample) Error, Ein

Loss Function
Model

𝒘= [ ¿ ]
 
 𝑹(𝒅+𝟏)× 𝟏

Input data matrix Output

 𝑹 𝑵 × 𝟏
 𝑹 𝑵 ×(𝒅 +𝟏)

12
Minimizing Ein by closed-form solution
In order to minimize a function E(w)  
=
=
=
We need to set its gradient to 0
 
=0
And then solve the resultant equation (usually easier)

Since X’X is a square matrix, we can use its inverse

 
Why? If we want to solve
(it is the counterpart/generalization of square We need to solve
matrix inverse in solving linear systems) is a rectangular (no inverse defined)
But applies here, as if can be inverted!
13
Pseudo Inverse by SVD

• Non-square m-by-n matrix typically has no inverse, two definitions
 – (1) Mathematical flavored definition: Moore-Penrose generalized inverse is an n-by-m matrix such
that 

– (2) Machine learning flavored definition: as the solution of the following linear system

• Pseudo-inverse can be computed by SVD.


– Using SVD then (Assume that A is of full column rank)

𝛴  𝑈 ′
A 𝑈  𝛴 𝑉
+¿¿
 
=
   ’   𝐴 =    
Properties
  of U,V, and
 
’ (invertible)

– How to prove that SVD based is the pseudo inverse of A (by using the properties of U,V, and
above)?
• either ,
• or is the solution of in
Linear Regression Algorithm

 For numerical stability, we can replace the pseudo-inverse by

15
Augmented Linear Model

• Can
  we obtain both (1) closed-form solution and (2)
capacity of modelling nonlinear shapes?
• Nonlinear Data Augmentation
– If we want to use the following model

– Then the regression can be written as

, , ,1
 
     
.
.
 ¿ .
.
.
. .
.
Minimizing Loss by Gradient Descent

Compute gradient & setting it to 0


Compute gradient & iterative hill climbing (matrix form, more compact, closed-form)
  =   =
=

w=

Question: will they lead to the same solution?


Convexity/Optimality
• Gradient-based method vs Closed-form solutions
– Gradient based solution is easy to derive, but need to choose step
length, and iterate many times
– Closed-form solution is not always achievable, needs more math; it may
only need one step to go to the optimal solution
• For the objective in linear regression
– The two methods theoretically find the same solution
– Because the loss function is quadratic, convex
• For general optimization problems
– Gradient method is more popular/feasible
• Prone to local optimal
– Closed-form solution is usually difficult
• But you can be lucky if you can
– Derive fixed point iteration
– Or transform it to an SVD problem
Stochastic Gradient Descent
•• The Gradient is a sum of N terms (given N samples)
 

Compute gradient
(summation form)

– Can be quite expensive for large sample set

• Stochastic Gradient Descent:

– one sample at a time: sample index n can be randomly chosen

• Mini-batch Stochastic gradient


– One can choose a random subset of samples (indexed by B)

– |B| = 32, 64,…


SGD may drive the
iterations out of a
local optimal

Gradient
Stochastic
descent
Gradient

SGD leads to fluctuating


objective function

Loss
Function

You might also like