0% found this document useful (0 votes)
8 views

Regression

Uploaded by

Aadit Baheti
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views

Regression

Uploaded by

Aadit Baheti
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 44

Introduction to Regression

Part A – Linear Models


Lecture Outline

• Linear models
• Estimate of the regression coefficients
• Model evaluation
• Interpretation

1
Predicting a Variable

Let’s imagine a scenario where we'd like to predict one variable using
another (or a set of other) variables.

Examples:
• Predicting the number of views a YouTube video will get next week
based on video length, the date it was posted, the previous number of
views, etc.
• Predicting which movies a Netflix user will rate highly based on their
previous movie ratings, demographic data, etc.
• Recommendation system

2
Data
The Advertising data set consists of the sales of a particular
product in 200 different markets, and advertising budgets
,

for the product in each of those markets for three different


media: TV, radio, and newspaper. Everything is given in units
of $1000.
TV radio newspaper sales
230.1 37.8 69.2 22.1
44.5 39.3 45.1 10.4
17.2 45.9 69.3 9.3
151.5 41.3 58.5 18.5
180.8 10.8 58.4 12.9

3
Response vs. Predictor Variables

There is an asymmetry in many of these problems:


The variable we would like to predict may be more
difficult to measure, is more important than the
other(s), or maybe directly or indirectly influenced by the
other variable(s).

Thus, we'd like to define two categories of variables:


• variables whose values we want to predict
• variables whose values we use to make our prediction

4
Response vs. Predictor Variables

X Y
predictors outcome
features response variable
covariates dependent variable
n observations

TV radio newspaper sales


230.1 37.8 69.2 22.1
44.5 39.3 45.1 10.4
17.2 45.9 69.3 9.3
151.5 41.3 58.5 18.5
180.8 10.8 58.4 12.9

p predictors
5
Response vs. Predictor Variables
𝑋 = 𝑋! , … , 𝑋"
𝑋# = 𝑥!# , … , 𝑥$# , … , 𝑥%# 𝑌 = 𝑦! , … , 𝑦%
predictors outcome
features response variable
covariates dependent variable
n observations

TV radio newspaper sales


230.1 37.8 69.2 22.1
44.5 39.3 45.1 10.4
17.2 45.9 69.3 9.3
151.5 41.3 58.5 18.5
180.8 10.8 58.4 12.9

p predictors
6
Linear Models
If we ask the question:

“how much more sales do we expect if we double the TV advertising


budget?”

Alternatively, we can build a model by first assuming a simple form of 𝑓:

𝑓 𝑥 = 𝛽! + 𝛽" 𝑋

7
Linear Regression

… then it follows that our estimate is:

Yb = fb(X) = c1 X + c0

where 𝛽!! and 𝛽!" are estimates of 𝛽! and 𝛽" respectively, that
we compute using observations.

8
Estimate of the regression coefficients
For a given data set

9
Estimate of the regression coefficients (cont)
Is this line good?

10
Estimate of the regression coefficients (cont)
Maybe this one?

11
Estimate of the regression coefficients (cont)
Or this one?

12
Estimate of the regression coefficients (cont)
Question: Which line is the best?
For each observation (𝑥( , 𝑦( ), the absolute residual is used to calculate the
residuals 𝑟) = |𝑦) −𝑦+) |.
Loss Function: Aggregate Residuals

How do we aggregate residuals across the entire dataset?

1. Max Absolute Error


2. Mean Absolute Error
3. Mean Squared Error

14
Estimate of the regression coefficients (cont)

Again, we use MSE as our loss function,


Xn Xn
1 2 1 2
L( 0 , 1 ) = (yi ybi ) = [yi ( 1X + 0 )] .
n i=1 n i=1

We choose 𝛽-*and 𝛽-+ in order to minimize the predictive errors made by


our model, i.e. minimize our loss function.

Then the optimal values for 𝛽-+ and 𝛽-* should be:
WE CALL THIS FITTING
OR TRAINING THE
b0 , b1 = argminL( 0, 1 ). MODEL
0, 1

15
Optimization

How does one minimize a loss function?


The global minima or maxima of
𝐿 𝛽+, 𝛽* must occur at a point where
the gradient (slope)
,- ,-
∇𝐿 = ,
,.! ,."
=0

• Brute Force: Try every combination


• Exact: Solve the above equation
• Greedy Algorithm: Gradient Descent

16
Optimization: Estimate of the regression coefficients
Brute force
A way to estimate argmin!& ,!' 𝐿 is to calculate the loss function for every
possible 𝛽# and 𝛽$ . Then select the 𝛽# and 𝛽$ that minimize the loss function.

Example: Estimate the the loss function for different 𝛽$ when 𝛽# is fixed to be 6:

Very computationally
expensive with many
coefficients

17
Gradient Descent
When we can’t analytically solve for the stationary points of the gradient, we
can still exploit the information in the gradient.
The gradient ∇𝐿 at any point is the direction of the steepest increase. The
negative gradient is the direction of steepest decrease.
By following the gradient, we can eventually find the lowest point.
This method is called Gradient Descent

18
Estimate of the regression coefficients: analytical solution
Take the gradient of the loss function and find the values of 𝛽#! and 𝛽#" where the
,- ,-
gradient is zero: ∇𝐿 = , =0
,.! ,."
This does not usually yield to a close form solution. However, for linear regression this
procedure gives us explicit formulae for 𝛽#! and 𝛽#" :
P
ˆ1 = i (x
Pi
x)(yi y)
(x x) 2
i i

ˆ0 = ȳ ˆ1 x̄

where 𝑦& and 𝑥̅ are sample means.


The line: b Y = b1 X + b0
is called the regression line.
19
Evaluation: Test Error

We need to evaluate the fitted model on new data, data that the model
did not train on, the test data.

The training MSE


here is 2.0 where the
test MSE is 12.3.
The training data
contains a strange
point – an outlier –
which confuses the
model.

Fitting to meaningless patterns in the training is called overfitting.

20
Evaluation: Model Interpretation

For linear models it’s important to interpret the parameters

The MSE of this model is very small. But the The MSE is very small, but the intercept is -0.5
slope is -0.05. That means the larger the which means that for very small budget we will
budget the less the sales. have negative sales.

21
Multi, Poly Regression and Model Selection
Part B: Multi-regression
Multiple Linear Regression

If you must guess someone's height, would you rather be told


• Their weight, only
• Their weight and gender
• Their weight, gender, and income
• Their weight, gender, income, and favorite number

Of course, you'd always want as much data about a person as possible.


Even though height and favorite number may not be strongly related, at
worst you could just ignore the information on favorite number. We want
our models to be able to take in lots of data as they make their
predictions.

23
Response vs. Predictor Variables

X Y
predictors outcome
features response variable
covariates dependent variable
n observations

TV radio newspaper sales


230.1 37.8 69.2 22.1
44.5 39.3 45.1 10.4
17.2 45.9 69.3 9.3
151.5 41.3 58.5 18.5
180.8 10.8 58.4 12.9

p predictors
24
Multilinear Models

In practice, it is unlikely that any response variable Y depends solely on one predictor x.
Rather, we expect that is a function of multiple predictors 𝑓(𝑋! , … , 𝑋" ). Using the
notation we introduced last lecture,

𝑌 = 𝑦! , … , 𝑦# , 𝑋 = 𝑋! , … , 𝑋" and 𝑋$ = 𝑥!$ , … , 𝑥%$ , … , 𝑥#$ ,

we can still assume a simple form for 𝑓 – a multilinear


form:
𝑓 𝑋! , … , 𝑋( = 𝛽) + 𝛽! 𝑋! + ⋯ + 𝛽( 𝑋(

. has the form:


Hence, 𝑓,
𝑓3 𝑋! , … , 𝑋( = 𝛽3) + 𝛽3! 𝑋! + ⋯ + 𝛽3( 𝑋(

25
Multiple Linear Regression

Given a set of observations,


{(x1,1 , . . . , x1,J , y1 ), . . . (xn,1 , . . . , xn,J , yn )},

the data and the model can be expressed in vector notation,

0 1 0 1
0 1 1 x1,1 ... x1,J 0
y1 B C B 1 C
B .. C B 1 x2,1 ... x2,J C B C
Y = @ . A, X=B .. .. .. .. C, = B . C,
@ . . . . A @ .. A
yy
1 xn,1 ... xn,J J

26
Multilinear Model, example

For our data


𝑆𝑎𝑙𝑒𝑠 = 𝛽" + 𝛽! × 𝑇𝑉 + 𝛽/ ×𝑅𝑎𝑑𝑖𝑜 + 𝛽0 ×𝑁𝑒𝑤𝑠𝑝𝑎𝑝𝑒𝑟

In linear algebra notation

𝑆𝑎𝑙𝑒𝑠! 1 𝑇𝑉! 𝑅𝑎𝑑𝑖𝑜! 𝑁𝑒𝑤𝑠! 𝛽"


𝒀= ⋮ ,𝑿 = ⋮ ⋮ ⋮ ,𝜷 = ⋮
𝑆𝑎𝑙𝑒𝑠1 1 𝑇𝑉1 . 𝑅𝑎𝑑𝑖𝑜1 𝑁𝑒𝑤𝑠1 𝛽0

= ×

27
Multiple Linear Regression

The model takes a simple algebraic form:

Y =X +✏

We will again choose the MSE as our loss function, which can be
expressed in vector notation as
1
MSE( ) = kY X k2
n
Minimizing the MSE using vector calculus yields,

1
b = X> X X> Y = argmin MSE( ).

28
Interpreting multi-linear regression

For linear models, it is easy to interpret the model parameters.

When we have a large number of predictors:


𝑋! , … , 𝑋" , there will be a large number of
model parameters, 𝛽! , 𝛽& , … , 𝛽" .

Looking at the values of 𝛽’s is impractical, so


we visualize these values in a feature
importance graph.

The feature importance graph shows which


predictors has the most impact on the
model’s prediction.

29
Qualitative Predictors

So far, we have assumed that all variables are quantitative. But in


practice, often some predictors are qualitative.
Example: The credit data set contains information about balance, age,
cards, education, income, limit , and rating for a number of potential
customers.

Income Limit Rating Cards Age Education Gender Student Married Ethnicity Balance
14.890 3606 283 2 34 11 Male No Yes Caucasian 333

106.02 6645 483 3 82 15 Female Yes Yes Asian 903

104.59 7075 514 4 71 11 Male No No Asian 580

148.92 9504 681 3 36 11 Female No No Asian 964


55.882 4897 357 2 68 16 Male No Yes Caucasian 331

30
Qualitative Predictors

If the predictor takes only two values, then we create an indicator or


dummy variable that takes on two possible numerical values.
For example for the gender, we create a new variable:

1 if i th person is female
xi =
0 if i th person is male
We then use this variable as a predictor in the regression equation.

0 + 1 + ✏i if i th person is female
yi = 0 + 1 xi + ✏i =
0 + ✏i if i th person is male

31
Qualitative Predictors

Question: What is interpretation of 𝛽+ and 𝛽*?

32
Qualitative Predictors

Question: What is interpretation of 𝛽+ and 𝛽*?

• 𝛽+ is the average credit card balance among males,

• 𝛽+ + 𝛽* is the average credit card balance among females,

• and 𝛽* the average difference in credit card balance between females


and males.

Example: Calculate 𝛽+ and 𝛽* for the Credit data.


You should find 𝛽+~$509, 𝛽*~$19

33
More than two levels: One hot encoding

Often, the qualitative predictor takes more than two values (e.g. ethnicity in
the credit data).

In this situation, a single dummy variable cannot represent all possible


values.

We create additional dummy variable as:



1 if i th person is Asian
xi,1 =
0 if i th person is not Asian

1 if i th person is Caucasian
xi,2 =
0 if i th person is not Caucasian
34
More than two levels: One hot encoding

We then use these variables as predictors, the regression


equation becomes: 8
< 0 + 1 + ✏i if i th person is Asian
yi = 0 + 1 xi,1 + 2 xi,2 + ✏i = 0 + 2 + ✏i if i th person is Caucasian
:
0 + ✏i if i th person is AfricanAmerican

Question: What is the interpretation of 𝛽" , 𝛽! , 𝛽/ ?

35
Polynomial Regression

36
Fitting non-linear data

Multi-linear models can fit large datasets with many


predictors. But the relationship between predictor and target
isn’t always linear.

We want a model:
𝑦 = 𝑓4 𝑥
Where 𝑓is a non-linear
function and 𝛽 is a
vector of the parameters
of 𝑓.

37
Polynomial Regression

The simplest non-linear model we can consider, for a response Y and a


predictor X, is a polynomial model of degree M,
𝑦 = 𝛽" + 𝛽! 𝑥 + 𝛽/ 𝑥 / + ⋯ + 𝛽5 𝑥 5
Just as in the case of linear regression with cross terms, polynomial
regression is a special case of linear regression - we treat each 𝑥 5 as a
separate predictor. Thus, we can write
0 1 0 1
0 1 1 x11 ... xM
1 0
y1 B C B C
B .. C B 1 x12 ... xM
2 C B 1 C
Y = @ . A, X=B .. .. .. .. C, =B .. C.
@ . . . . A @ . A
yn
1 xn ... xM
n M

38
Polynomial Regression
This looks a lot like multi-linear regression where the predictors are
powers of x!
Multi-Regression
0 1 0 1
0 1 1 x1,1 ... x1,J 0
y1 B C B 1 C
B C B 1 x2,1 ... x2,J C B C
Y = @ ... A , X=B .. .. .. .. C, = B . C,
@ . . . . A @ .. A
yyn
1 xn,1 ... xn,J J

Poly-Regression
0 1 0 1
0 1 1 x11 ... xM
1 0
y1 B C B C
B C B 1 x12 ... xM C B 1 C
Y = @ ... A ,
2
X=B .. .. .. .. C, =B .. C.
@ . . . . A @ . A
yn
1 xn ... xM
n M
39
Model Training

Give a dataset 𝑥! , 𝑦! , 𝑥/ , 𝑦/ , … , 𝑥1 , 𝑦1 , we find the optimal


polynomial model:
𝑦 = 𝛽" + 𝛽! 𝑥 + 𝛽/ 𝑥 / + ⋯ + 𝛽5 𝑥 5

1. We transform the data by adding new predictors:


𝑥B = [1, 𝑥B! , 𝑥B/ , … , 𝑥B5 ]
where 𝑥B6 = 𝑥 6
2. Fit the parameters by minimizing the MSE using vector
calculus. As in multi-linear regression:
8𝟏
E= 𝑿
𝜷 F𝑻 𝑿
F F𝑻𝒚
𝑿
40
Polynomial Regression (cont)
Fitting a polynomial model requires choosing a degree.

Degree 1 Degree 2 Degree 50

Underfitting: when the degree is We want a model that fits the Overfitting: when the degree is
too low, the model cannot fit the trend and ignores the noise. too high, the model fits all the
trend. noisy data points.

41
Feature Scaling
Do we need to scale out features for polynomial regression?

Linear regression, 𝑌 = 𝑋𝛽, is invariant under scaling. If 𝑋 is called by some number


!
𝜆 then 𝛽 will be scaled by and MSE will be identical.
'

However, if the range of 𝑋 is low or large then we run into troubles. Consider a
polynomial degree of 20 and the maximum or minimum value of any predictor is large
or small. Those numbers to the 20th power will be problematic.

It is always a good idea to scale 𝑋 when considering polynomial regression:


#()*
𝑋 − 𝑋<
𝑋 =
𝜎+

Note: sklearn’s StandardScaler() can do this.


42
High degree of polynomial
leads to OVERFITTING!

43

You might also like