Regression
Regression
• Linear models
• Estimate of the regression coefficients
• Model evaluation
• Interpretation
1
Predicting a Variable
Let’s imagine a scenario where we'd like to predict one variable using
another (or a set of other) variables.
Examples:
• Predicting the number of views a YouTube video will get next week
based on video length, the date it was posted, the previous number of
views, etc.
• Predicting which movies a Netflix user will rate highly based on their
previous movie ratings, demographic data, etc.
• Recommendation system
2
Data
The Advertising data set consists of the sales of a particular
product in 200 different markets, and advertising budgets
,
3
Response vs. Predictor Variables
4
Response vs. Predictor Variables
X Y
predictors outcome
features response variable
covariates dependent variable
n observations
p predictors
5
Response vs. Predictor Variables
𝑋 = 𝑋! , … , 𝑋"
𝑋# = 𝑥!# , … , 𝑥$# , … , 𝑥%# 𝑌 = 𝑦! , … , 𝑦%
predictors outcome
features response variable
covariates dependent variable
n observations
p predictors
6
Linear Models
If we ask the question:
𝑓 𝑥 = 𝛽! + 𝛽" 𝑋
7
Linear Regression
Yb = fb(X) = c1 X + c0
where 𝛽!! and 𝛽!" are estimates of 𝛽! and 𝛽" respectively, that
we compute using observations.
8
Estimate of the regression coefficients
For a given data set
9
Estimate of the regression coefficients (cont)
Is this line good?
10
Estimate of the regression coefficients (cont)
Maybe this one?
11
Estimate of the regression coefficients (cont)
Or this one?
12
Estimate of the regression coefficients (cont)
Question: Which line is the best?
For each observation (𝑥( , 𝑦( ), the absolute residual is used to calculate the
residuals 𝑟) = |𝑦) −𝑦+) |.
Loss Function: Aggregate Residuals
14
Estimate of the regression coefficients (cont)
Then the optimal values for 𝛽-+ and 𝛽-* should be:
WE CALL THIS FITTING
OR TRAINING THE
b0 , b1 = argminL( 0, 1 ). MODEL
0, 1
15
Optimization
16
Optimization: Estimate of the regression coefficients
Brute force
A way to estimate argmin!& ,!' 𝐿 is to calculate the loss function for every
possible 𝛽# and 𝛽$ . Then select the 𝛽# and 𝛽$ that minimize the loss function.
Example: Estimate the the loss function for different 𝛽$ when 𝛽# is fixed to be 6:
Very computationally
expensive with many
coefficients
17
Gradient Descent
When we can’t analytically solve for the stationary points of the gradient, we
can still exploit the information in the gradient.
The gradient ∇𝐿 at any point is the direction of the steepest increase. The
negative gradient is the direction of steepest decrease.
By following the gradient, we can eventually find the lowest point.
This method is called Gradient Descent
18
Estimate of the regression coefficients: analytical solution
Take the gradient of the loss function and find the values of 𝛽#! and 𝛽#" where the
,- ,-
gradient is zero: ∇𝐿 = , =0
,.! ,."
This does not usually yield to a close form solution. However, for linear regression this
procedure gives us explicit formulae for 𝛽#! and 𝛽#" :
P
ˆ1 = i (x
Pi
x)(yi y)
(x x) 2
i i
ˆ0 = ȳ ˆ1 x̄
We need to evaluate the fitted model on new data, data that the model
did not train on, the test data.
20
Evaluation: Model Interpretation
The MSE of this model is very small. But the The MSE is very small, but the intercept is -0.5
slope is -0.05. That means the larger the which means that for very small budget we will
budget the less the sales. have negative sales.
21
Multi, Poly Regression and Model Selection
Part B: Multi-regression
Multiple Linear Regression
23
Response vs. Predictor Variables
X Y
predictors outcome
features response variable
covariates dependent variable
n observations
p predictors
24
Multilinear Models
In practice, it is unlikely that any response variable Y depends solely on one predictor x.
Rather, we expect that is a function of multiple predictors 𝑓(𝑋! , … , 𝑋" ). Using the
notation we introduced last lecture,
25
Multiple Linear Regression
0 1 0 1
0 1 1 x1,1 ... x1,J 0
y1 B C B 1 C
B .. C B 1 x2,1 ... x2,J C B C
Y = @ . A, X=B .. .. .. .. C, = B . C,
@ . . . . A @ .. A
yy
1 xn,1 ... xn,J J
26
Multilinear Model, example
= ×
27
Multiple Linear Regression
Y =X +✏
We will again choose the MSE as our loss function, which can be
expressed in vector notation as
1
MSE( ) = kY X k2
n
Minimizing the MSE using vector calculus yields,
1
b = X> X X> Y = argmin MSE( ).
28
Interpreting multi-linear regression
29
Qualitative Predictors
Income Limit Rating Cards Age Education Gender Student Married Ethnicity Balance
14.890 3606 283 2 34 11 Male No Yes Caucasian 333
30
Qualitative Predictors
31
Qualitative Predictors
32
Qualitative Predictors
33
More than two levels: One hot encoding
Often, the qualitative predictor takes more than two values (e.g. ethnicity in
the credit data).
35
Polynomial Regression
36
Fitting non-linear data
We want a model:
𝑦 = 𝑓4 𝑥
Where 𝑓is a non-linear
function and 𝛽 is a
vector of the parameters
of 𝑓.
37
Polynomial Regression
38
Polynomial Regression
This looks a lot like multi-linear regression where the predictors are
powers of x!
Multi-Regression
0 1 0 1
0 1 1 x1,1 ... x1,J 0
y1 B C B 1 C
B C B 1 x2,1 ... x2,J C B C
Y = @ ... A , X=B .. .. .. .. C, = B . C,
@ . . . . A @ .. A
yyn
1 xn,1 ... xn,J J
Poly-Regression
0 1 0 1
0 1 1 x11 ... xM
1 0
y1 B C B C
B C B 1 x12 ... xM C B 1 C
Y = @ ... A ,
2
X=B .. .. .. .. C, =B .. C.
@ . . . . A @ . A
yn
1 xn ... xM
n M
39
Model Training
Underfitting: when the degree is We want a model that fits the Overfitting: when the degree is
too low, the model cannot fit the trend and ignores the noise. too high, the model fits all the
trend. noisy data points.
41
Feature Scaling
Do we need to scale out features for polynomial regression?
However, if the range of 𝑋 is low or large then we run into troubles. Consider a
polynomial degree of 20 and the maximum or minimum value of any predictor is large
or small. Those numbers to the 20th power will be problematic.
43