100% found this document useful (1 vote)
97 views43 pages

CS464 Ch9 LinearRegression

CS464 covers linear regression. Linear regression finds the linear relationship between features (X) and an outcome (Y) by estimating parameters to minimize error. The parameters are estimated using ordinary least squares, which finds the slope and intercept that minimize the sum of squared errors between predicted and actual Y values. Gradient descent can also be used to iteratively estimate optimal parameters by minimizing a loss function.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
100% found this document useful (1 vote)
97 views43 pages

CS464 Ch9 LinearRegression

CS464 covers linear regression. Linear regression finds the linear relationship between features (X) and an outcome (Y) by estimating parameters to minimize error. The parameters are estimated using ordinary least squares, which finds the slope and intercept that minimize the sum of squared errors between predicted and actual Y values. Gradient descent can also be used to iteratively estimate optimal parameters by minimizing a loss function.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 43

CS464

Linear Regression

(slides based on the slides provided by Öznur Taştan and


Mehmet Koyutürk)
Regression
•  Some historical sales data of houses
•  Base our predictions of housing sale prices (Y) on the
observable features such as the size of the house (X)

X
Regression
•  Assume the data is generated by a function that generates a
value for the outcome variable (y) using the values of the
features (x), plus some error

•  The outcome variable is


real-valued
•  So this is not classification

x
Regression

Target Y
Target Y

Feature X Feature X

•  Assume a functional form for f(x)


•  Find a good f(x) within that family of functions
Linear Regression
Linear Regression

•  We will focus on linear regression


Linear Regression

•  Linear regression

Parameters needs to be estimated


Slope and the Intercept
•  The slope is a number that indicates how slanted the
regression line is and the direction in which it slants

x1

•  The Y-intercept is the value of y where the regression


crosses the Y axis (that is, when X equals zero)
Housing Example
•  Suppose we have a dataset giving the living areas
and prices of 47 houses from Portland, Oregon:
Single-Feature Model

Price (in $1000)

Living area (Square feet)

If we regress on a single variable (feature), living area:


w0 = 140.27, w1 = 0.1345
Two-Feature Model

If the number of bedrooms were included as one of the input


features as well, we get
w0 = 89.60
w1 = 0.1392
w2 = −8.738
How do we interpret this result, given that the
regression coefficients for a single variable were:
w0 = 140.27, w1 = 0.1345?
Multiple Linear Regression
•  The reason for this difference in the relationship between y
and x2 is that x1 and x2 are strongly correlated.
•  That is: when x2 increases, so does x1.

Combining several simple regressions (each using the


method of least squares) generally only gives us the same
result as a multiple regression if the explanatory variables
are orthogonal.
Optimization Problem

•  Linear regression (or any regression thereof) is an


optimization problem

•  We need to define our constraints and an objective


function
–  In linear regression, the constraint is the linearity of the
function

•  What is our objective?


Optimization Problem

•  Linear regression (or any regression thereof) is an


optimization problem

•  We need to define our constraints and an objective


function
–  In linear regression, the constraint is the linearity of the
function

•  What is our objective?


–  Minimize our error in approximating the outcome
variable
Loss Functions
Measure of Error
•  We can measure the prediction loss in terms of
squared error. Loss on one example:

Predicted value
•  Loss on n training examples: Actual value
Ordinary Least Squares (OLS)
Loss Function on Two Features
Least Squares Linear Fit to Data
•  Most popular estimation method is least squares:
–  Determine linear coefficients w0, w that minimize sum
of squared loss (SSL)
–  Use standard (multivariate) differential calculus:
•  differentiate SSL with respect to w0, w
•  find zeros of each partial differential equation
•  solve for w0, w
Minimize the Squared Loss
•  Minimize the empirical squared loss:
Direct Minimization
•  Minimize the empirical squared loss:

•  To get the optimal parameter values take derivative


Finding Optimal Parameters

This is a system of linear equations!


Regression in Matrix Notation
Solution in Matrix Form
Probabilistic View of Linear Regression
•  In a statistical regression model we model both the
function and noise

Whatever we cannot capture with


our chosen family of functions will
be interpreted as noise
Maximum Likelihood Estimation
•  Given observations:

•  Find the parameters w that maximize the (conditional)


likelihood of the outputs
Likelihood of the Observed Outputs
•  Likelihood of the observed data:

•  It is often easier (but equivalent) to try to maximize


the log-likelihood:

The maximum likelihood estimate of w is the one that


minimizes the mean squared residual error!
The MLE of σ
•  The maximum likelihood estimate of the noise variance σ2 is

Maximum likelihood
setting of the parameters
i.e. the mean squared prediction error.
Numerical Solution
•  Matrix inversion is computationally very expensive
–  Θ(n3) for n features

•  Using the analytical form to compute the optimal


solution may not be feasible even for moderate
values of n

•  Also only possible if XTX is not singular à multi-


collinearity problem.
–  Determinant is zero
–  Not full rank ..
Gradient Descent
•  General algorithm for optimization
–  Assign values to Θis to minimize J(Θ)
Gradient Descent
•  General algorithm for optimization
–  Assign values to Θis to minimize J(Θ)
Regression
Coefficients
(wi) Sum of squares
of error
Gradient Descent
Gradient descent
Gradient descent in more dimensions
Gradient descent
Step size

Stopping
condition
Comments on gradient descent
Comments on gradient descent
Variations on gradient descent
•  Batch Gradient Descent
–  Update weights after calcualting the error for each
example (epoch)
–  Pros: Stable convergence, computationally efficient
–  Cons: Might stuck in a local minima
•  Stochastic Gradient Descent
–  Recalculate the weights for each sample
–  Pros: Might avoid local minima
–  Cons: Error jumps around, computationally expensive
•  Mini batch Gradient Descent
–  Hybrid approach
Extending Application of Linear Regression
•  The inputs X for linear regression can be:
–  Original quantitative inputs
–  Transformation of quantitative inputs, e.g. log, exp,
square root, square, etc.
–  Polynomial transformation
•  Example: y = w0 + w1⋅x + w2⋅x2 + w3⋅x3
–  Dummy coding of categorical inputs
•  Binary variable for each value of the categorical variable
–  Interactions between variables
•  Example: x3 = x1 ⋅ x2 This is what statisticians
call an interaction
•  This allows use of linear regression techniques to fit much
more complicated non-linear datasets.
Non-linear functions
Basis Functions
Different Basis Functions
Example of fitting polynomial curve with linear model

You might also like