0% found this document useful (0 votes)
8 views

Lecture 3

cityu hk

Uploaded by

rub.crecycle
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views

Lecture 3

cityu hk

Uploaded by

rub.crecycle
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 35

LECTURE 3

REGRESSION ANALYSIS
- MULTIPLE REGRESSION

1
AGENDA

 Last class:
 𝑌෡𝑖 = 0.326 + 0.1578 𝑋𝑖  For every $1 increase in taxi fare, what can we expect?
 𝑟 2 = 0.5533  What does it say about our model?
 𝐻0 : 𝛽1 = 0  p-value is very, very close to 0, which implies…

 Basic Concepts of Multiple Linear Regression


 Using Categorical (Dummy) Variables
 Measures of Variation and Statistical Inference

2
FORMULATION OF MULTIPLE REGRESSION
MODEL

 A multiple regression model is to relate one dependent variable with two or more
independent variables in a linear function
Population Intercept Population Slope Coefficients

𝑌𝑖 = 𝛽0 + 𝛽1 𝑋1𝑖 + 𝛽2 𝑋2𝑖 + ⋯ + 𝛽𝐾 𝑋𝐾𝑖 + 𝜀𝑖

Dependent Variable Independent Variable Random Error

 K is the number of independent variables (e.g., K = 1 for simple linear regression)


 𝛽0 , 𝛽1 , 𝛽2 … , 𝛽𝐾 are the K+1 parameters in a multiple regression model with K independent
variables
 𝑏0 , 𝑏1 , 𝑏2 … , 𝑏𝐾 are used to represent sample intercept and sample slope coefficients
3
MULTIPLE REGRESSION, 2 EXPLANATORY
VARIABLES

 Say we have 𝑛 data points or 𝑛 observations


 Our observations are in the form 𝑋11 , 𝑋21 , 𝑌1 , 𝑋12 , 𝑋22 , 𝑌2 , … , 𝑋1𝑛 , 𝑋2𝑛 , 𝑌𝑛

Observati Taxi – Pre- Ratecode ID Taxi - Tips (𝑿𝟏𝒊 , 𝑿𝟐𝒊 , 𝒀𝒊 )


on # tipped fare 1=NYC,
2=JFK
#1 8.30 1 1.65 (8.30, 1, 1.65)

#2 15.30 1 1.00 (15.30, 1, 1.00)

#3 7.80 1 1.25 (7.80, 1, 1.25)

#27 52.80 2 5.00 (14.80, 2, 3.70)


4

Source: https://www1.nyc.gov/site/tlc/about/tlc-trip-record-data.page
TLC Trip Record Data: January 2019 Yellow Taxi Trip Records
Published by NYC Taxi & Limousine Commission We will need to “fix” this later…
FORMULATION OF MULTIPLE REGRESSION
MODEL

5
FORMULATION OF MULTIPLE REGRESSION
MODEL

 Coefficients in a multiple regression net out the impact of each independent


variable in the regression equation
 The estimated slope coefficient, 𝑏𝑗 , measures the change in the average value of
𝑌 as a result of a one-unit increase in 𝑋𝑗 , holding all other independent variables
constant – “ceteris paribus effect”
remain constant

෡ ∙ = 𝑏0 + 𝑏1 𝑋1∙ + 𝑏2 𝑋2∙ + ⋯ + 𝒃𝒋 𝑋𝑗∙ + ⋯ + 𝑏𝐾 𝑋𝐾∙


𝒀

6
EXAMPLE – USING CATEGORICAL (DUMMY)
VARIABLES

 Last time, we did a simple linear regression on taxi fare and tips.
 We want to see if the location also affects the tip.
 Column E (RatecodeID) has 2 possibilities: 1= New York City, 2 = JFK Airport
 Can we use column E as-is? Consider two trips from NYC and JFK, both with
fares of $10.

Observation Taxi – Pre- Ratecode ID What the model looks like


#𝒊 tipped fare 1=NYC, 2=JFK ෡ 𝒊 = 𝑏0 + 𝑏1 𝑋1𝒊 + 𝑏2 𝑋2𝒊
𝒀
𝑿𝟏𝒊 𝑿𝟐𝒊
e.g.1 10.00 1 𝑌෡1 = 𝑏0 + 10 𝑏1 + 𝑏2

e.g.2 10.00 2 𝑌෡2 = 𝑏0 + 10 𝑏1 + 2𝑏2 7

𝑏2 vs 2𝑏2 ? Double
the bonus?
USING CATEGORICAL (DUMMY) VARIABLES

 Column E (RatecodeID) has 2 possibilities: 1= New York City, 2 = JFK Airport


 Let’s define a new column: AreaID. We are “inside” the area if we are in NYC,
“outside” the area if we are NOT in NYC (i.e. JFK, etc).
 We can pre-process the data so that 𝑋2𝑖 = 1 if we are inside NYC and 𝑋2𝑖 = 0
if we are outside NYC

Observation Taxi – Pre- Ratecode ID What the model looks like


#𝒊 tipped fare 1=NYC, 0=JFK ෡ 𝒊 = 𝑏0 + 𝑏1 𝑋1𝒊 + 𝑏2 𝑋2𝒊
𝒀
𝑿𝟏𝒊 𝑿𝟐𝒊
e.g.1 10.00 1 𝑌෡1 = 𝑏0 + 10 𝑏1 + 𝑏2
8

e.g.2 10.00 20 𝑌෡2 = 𝑏0 + 10 𝑏1


USING CATEGORICAL (DUMMY) VARIABLES
 𝑋2𝑖 = 1 if we are inside NYC and 𝑋2𝑖 = 0 if we are outside NYC
 Interpretation:
 If 𝑏2 > 0: Everything else remaining constant, we expect to receive a bonus tip of
$|𝑏2 | when we pick up a passenger in NYC
 If 𝑏2 < 0: Everything else remaining constant, we expect our tip to reduce by $|𝑏2 |
when we pick up a passenger in NYC.
 This variable incorporates a fixed tip amount for NYC vs non-NYC trips, NOT a
change in the tips %!

Observation Taxi – Pre- Ratecode ID What the model looks like


#𝒊 tipped fare 1=NYC, 0=JFK ෡ 𝒊 = 𝑏0 + 𝑏1 𝑋1𝒊 + 𝑏2 𝑋2𝒊
𝒀
𝑿𝟏𝒊 𝑿𝟐𝒊
e.g.1 10.00 1 𝑌෡1 = 𝑏0 + 10 𝑏1 + 𝑏2
9

e.g.2 10.00 20 𝑌෡2 = 𝑏0 + 10 𝑏1


USING CATEGORICAL (DUMMY) VARIABLES

 Useful when an explanatory variable isn’t numerical (e.g. colours, locations)


 Use 0, 1 variables: 0 = “is not, does not fit definition”, 1 = “is, fits definition”
 If a category has 𝑐 choices, then we need 𝑐 − 1 categorical variables
 E.g. Product design: A product can be red, yellow, or blue. We want to see how
colour affects popularity. In a regression model, we need 2 categorical variable
 𝑋1 = 1 if it is red, and 0 otherwise
 𝑋2 = 1 if it is yellow, and 0 otherwise

Obs # 𝒊 Red? Yellow? What the model looks like


𝑿𝟏𝒊 𝑿𝟐𝒊 ෡ 𝒊 = 𝑏0 + 𝑏1 𝑋1𝒊 + 𝑏2 𝑋2𝒊 + ⋯
𝒀

e.g.1 (Red) 1 0 𝑌෡1 = 𝑏0 + 𝑏1 + ⋯


10
e.g.2 (Yellow) 0 1 𝑌෡2 = 𝑏0 + 𝑏2 + ⋯
e.g. 3 (Blue) 0 0 𝑌෡3 = 𝑏0 + ⋯
BUILDING THE MODEL
 After fixing the categorical variable for AreaID, we can fill in the regression
window.

11
MODEL OUTPUT
 Excel’s Output:

෡ = 𝟏. 𝟑𝟕𝟕𝟏 + 𝟎. 𝟏𝟒𝟖𝟖 𝑿𝟏 − 𝟎. 𝟗𝟓𝟐𝟏 𝑿𝟐


𝒀

12

*Scientific notation: 1.7284E − 226 = 1.7284 × 10−226 ≈ 0


INTERPRETATION OF ESTIMATES

 The estimated multiple regression equation:


𝑌෠ = 1.3771 + 0.1488 𝑋1 − 0.9521 𝑋2
 𝑌෠ = Estimated taxi tips in NYC in $
 𝑋1 = Pre-tip amount in $
 𝑋2 = Area indicator (NYC =1, Non-NYC (JFK) = 0)
 Interpretation of the estimated slope coefficient:
 𝑏1 = 0.1488 says that the estimated average tips increase by $0.1488 for each $1
increase in pre-tip taxi fare, given that other independent variables remain constant
 𝑏2 = −0.9521 says that the estimated average tips decrease by $0.952 when starts in
NYC instead of JFK, given that other independent variables remain constant
13
COMPARISON OF MODELS
 Suppose we add more explanatory variables
 𝑋1 = Pre-tip amount in $
 𝑋2 = Area indicator (NYC =1, Non-NYC (JFK) = 0)
 𝑋3 = # of riders
 𝑋4 = New Year’s Day indicatory (Jan 1 =1, otherwise =0)


𝒀
= 𝟏. 𝟑𝟏𝟖𝟏 + 𝟎. 𝟏𝟒𝟖𝟓 𝑿𝟏 − 𝟎. 𝟗𝟓𝟎𝟏 𝑿𝟐
+ 𝟎. 𝟎𝟒𝟎𝟒𝑿𝟑 + 𝟎. 𝟎𝟓𝟎𝟑𝑿𝟒

14
INTERPRETATION OF ESTIMATES

 Multiple regression model:


𝑌෠ = 1.3181 + 0.1485 𝑋1 − 0.9501 𝑋2 + 0.0404𝑋3 + 0.0503𝑋4
 The estimated slope coefficient
 𝑏1 = 0.1485 says that the estimated average tips increase by $0.1485 for each $1
increase in pre-tip taxi fare, holding all other things equal
 𝑏2 = −0.9521 says that the estimated average tips decrease by $0.952 when starts in
NYC instead of non-NYC (JFK), holding all other things equal
 𝑏3 = 0.0404 says that the estimated average tips increase by $0.0404 for each
additional rider, holding all other things equal
 𝑏4 = 0.0503 says that the estimated average tips increase by $0.0503 if it it on New
year day, holding all other things equal
15
EVALUATE THE MODEL

 𝑟 2 and adjusted 𝑟 2
 F-test for overall model significance
 t-test for a particular 𝑋-variable significance

16
MEASURES OF VARIATION - 𝑟 2

 𝑌෠ = 1.3181 + 0.1485 𝑋1 − 0.9501 𝑋2 + 0.0404𝑋3 + 0.0503𝑋4


 Total variation of the 𝑌-variable is made up of two parts

𝑆𝑆𝑇 = 𝑆𝑆𝑅 + 𝑆𝑆𝐸


where
ത 2
𝑆𝑆𝑇 = σ𝑛𝑖=1(𝑌𝑖 − 𝑌) SSR - regression SSE - error
𝑆𝑆𝑅 = σ𝑛𝑖=1(𝑌෠𝑖 − 𝑌)
ത 2 𝑌ത 𝑌෠𝑖 𝑌𝑖
𝑆𝑆𝐸 = σ𝑛𝑖=1(𝑌𝑖 − 𝑌෠𝑖 )2

Pre-tip New Year’s


fare Area # of Day 17
passengers
MEASURES OF VARIATION - 𝑟 2

 We can ALWAYS increase 𝑟 2 by adding variables that don’t explain the changes in 𝑌
 Easier to see with less data. See “r-squared comparison” tab in spreadsheet
 We add one more column of 0/1s. 1 = odd number row, 0 = even number row

Vs.

18
MEASURES OF VARIATION - 𝑟 2

 What is the net effect of adding a new 𝑋-variable?


 𝑟 2 increases , even if the new 𝑋-variable is explaining an insignificant proportion of the
variation of the 𝑌-variable
 Is it fair to use 𝑟 2 for comparing models with different number of 𝑋-variables?

 A degree of freedom* will be lost, as a slope coefficient has to be estimated for that
new 𝑋-variable
 Did the new 𝑋-variable add enough explanatory power to offset the loss of one degree of
freedom?

 Degree of freedom on the residual = 𝑛 − 𝐾 + 1 = 𝑛 − 1 − 𝐾

*Degrees of freedom: Number of independent pieces of information (data values) in the random sample.
If 𝐾 + 1 parameters (intercept, slopes) must be estimated before the sum of squares errors, SSE, can be calculated from a sample of size
n, the degrees of freedom are equal to 𝑛 − (𝐾 + 1) (𝐾 + 1 coefficients of b0, b1, …, bK).
19
MEASURES OF VARIATION – ADJUSTED 𝑟 2

𝑆𝑆𝐸
(Recall: 𝑟 2 = 1 − 𝑆𝑆𝑇)
𝑆𝑆𝐸Τ 𝑛−𝐾−1 𝑛−1
 Adjusted 𝑟 2 = 1 − = 1− (1 − 𝑟 2 )
𝑆𝑆𝑇Τ 𝑛−1 𝑛−𝐾−1

 Measures the proportion of variation of the 𝑌 values that is explained by the


regression equation with the independent variable 𝑋1 , 𝑋2 , … , 𝑋𝐾 , after the
adjusting for sample size (𝑛) and the number of 𝑋-variables used (𝐾)
 Smaller than or equal to 𝑟 2 , and can be negative
 Penalize the excessive use of 𝑋-variables
 Useful in comparing among models with different number of 𝑋-variables

20
EXAMPLE – ADJUSTED 𝑟 2
 Compare the models that we’ve built
 Number of Observations: 197,103
 SST: 1,163,798

1 explanatory 2 explanatory 4 variables


variable (pre-tip variables (pre-tip
fare) fare, area ID)

Degree of freedom – 197,101 197,100 197,098


residual
SSE 519,852 517,136 516,911
𝑟2 0.553314 0.555647 0.555841
21
Adjusted 𝑟2 0.553312 0.555643 0.555832
INFERENCE: OVERALL MODEL SIGNIFICANCE

 Is the model significant? Do we need a model?


 F-test

22
OVERALL MODEL SIGNIFICANCE: F-TEST

 F-test for the overall model significance

 Null hypothesis 𝐻0 : 𝛽1 = 𝛽2 = ⋯ = 𝛽𝐾 = 0 (none of the 𝑋-variables affects 𝑌)

 Alternative hypothesis: 𝐻1 : 𝐴𝑡 𝑙𝑒𝑎𝑠𝑡 𝑜𝑛𝑒 𝛽𝑖 ≠ 0 (at least one 𝑋-variable affects 𝑌)

 We want to REJECT the null hypothesis by showing that the probability of seeing
our value of 𝑏1 , 𝑏2 , … , 𝑏𝐾 is “low” if it 𝐻0 was indeed true.

 F-statistic : For SSR For SSE

𝑀𝑆𝑅 𝑆𝑆𝑅/𝐾
F = 𝑀𝑆𝐸 = 𝑆𝑆𝐸/(𝑛−𝐾−1) with 𝐾, (𝑛 − 1 − 𝐾) degrees of freedom (d.f.)

23
OVERALL MODEL SIGNIFICANCE: F-TEST

𝑀𝑆𝑅 𝑆𝑆𝑅/𝐾
 F= = 𝑆𝑆𝐸/(𝑛−𝐾−1) with 𝐾, (𝑛 − 1 − 𝐾) degrees of freedom (d.f.)
𝑀𝑆𝐸

 First decide on size of rejection region 𝛼 (one tails)  Level of significance

 Method 1 (with F-table): Rejection region approach

 Reject 𝐻0 if F > critical value (C.V.) = 𝐹𝛼,𝐾,(𝑛−𝐾−1)

 Method 2 (with Excel output): p-value approach

 p-value = 𝑃(𝐹 ≥ F)

 Reject 𝐻0 if p-value < 𝛼


24
OVERALL MODEL SIGNIFICANCE: F-TEST
Probability distribution of F. Suppose 𝛼 = 0.05

At 5% significance level, p-value  0 < 5%. Therefore 𝐻0


is rejected.

 = tail area = P(F ≥ C. V.)

𝐩 − 𝐯𝐚𝐥𝐮𝐞 = P(𝐹 ≥ F)

F
0 C. V. = F =61,664, calculated
𝐹𝛼,𝐾,(𝑛−𝐾−1) =
from sample data
𝟐. 𝟑𝟕
25
SIGNIFICANCE OF A PARTICULAR X-VARIABLE:
T-TEST
 Even if we reject the 𝐻0 in our F-test, we cannot distinguish which 𝑋-variable(s)
has a significant impact on the 𝑌-variable
 t-test for a particular 𝑋-variable’s significance
 Null 𝐻0 : 𝛽𝑖 = 0 (𝑋𝑖 has no linear relationship with 𝑌, given presence of other 𝑋-
variable(s))
 Alternative 𝐻1 : 𝛽𝑖 ≠ 0 (𝑋𝑖 is linearly related to 𝑌, given presence of other 𝑋-
variable(s))

26
SIGNIFICANCE OF A PARTICULAR X-VARIABLE:
T-TEST

 Null 𝐻0 : 𝛽1 = 0
 Method 1: Rejection region approach
 Reject 𝐻0 if T > C. V. = 𝑡𝛼Τ2,(𝑛−𝐾−1)

 Method 2: p-value approach


 p-value = 𝑃(|T| ≥ |t|)
 Reject 𝐻0 if p-value < 𝛼

Student’s t-distribution
Probability

standard 𝛼
𝛼
error
2 2
If 𝛼 = 5%,
𝑡 then 𝑡0.025 ,(𝑛−5) ≈
-348.81 C.V.= 𝜷𝟏 = 𝟎 C.V.= t=348.81 27
-1.96 1.96 1.96
EXAMPLE

 Conclusion: p-value is smaller than 5%, so reject 𝐻0 . The pre-tip fare is significantly
related to the tips, given presence of other 𝑋-variables.
 What about the other variables?

*
*

 According to the t-test results, the p-value for each of the four explanatory variables
is smaller than 5%,.
 This indicates each explanatory variable is significantly related to tips paid in NYC,
given presence of other 𝑋-variables.

*Scientific notation: 6.41657E − 08 = 6.41657 × 10−8 = 0.0000000642 ≈ 0


28
EXAMPLE

 What does the table look like if there is an insignificant explanatory variable?
 Added fifth variable to labels rows as “odd” or “even” (“5var – odd/even” tab)

 The p-value for “Odd/Even transaction” is LARGER than 5%, so we cannot reject
𝐻0 . This indicates that odd/even transactions is not significantly related to tips
paid in NYC, given presence of other 𝑋-variables. 29
VARIABLES SELECTION STRATEGIES

 Some of the independent variables are insignificant based on t-test results


 We may consider eliminating insignificant independent variables using the following
methods:
 All possible regressions
 Backward elimination
 Forward selection
 Stepwise regression

30
ALL POSSIBLE REGRESSIONS

 To develop all the possible regression models between the dependent variable
and all possible combinations of independent variables
 If there are 𝐾 𝑋-variables to consider using, there are (2𝐾 −1) possible
regression models to be developed
 The criteria for selecting the best model may include
 Mean Sum of Squares Errors (MSE)
 Adjusted 𝑟 2
 Disadvantages of all possible regressions
 No unique conclusion, with different criteria, different conclusions will arise
 Look at overall model performance, but not individual variable significance
 When there is a large number of potential 𝑋-variables, computational time can be long

31
BACKWARD ELIMINATION

 Evaluate individual variable significance 𝑋1 , 𝑋2 , 𝑋3 , 𝑋4 , 𝑋5

Step 1: Build a model by using all potential 𝑋-variables 𝑋1 , 𝑋2 , 𝑋3 , 𝑋4


Step 2: Identify the least significant 𝑋-variable using t-test
Step 3: Remove this 𝑋-variable if its p-value is larger than the specified level of
significance; otherwise terminate the procedure
Step 4: Develop a new regression model after removing this 𝑋-variable, repeat
steps 2 and 3 until all remaining 𝑋-variables are significant

32
FORWARD SELECTION
nothing
 Evaluate individual variable significance

𝑋1 𝑋2 𝑋3 𝑋4 𝑋5
Step 1: Start with a model which only contains the intercept term
Step 2: Identify the most significant 𝑋-variable using t-test 𝑋1 , 𝑋2 𝑋1 , 𝑋3 𝑋1 , 𝑋4 𝑋1 , 𝑋5
Step 3: Add this 𝑋-variable if its p-value is smaller than the specified
level of significance; otherwise terminate the procedure
Step 4: Develop a new regression model after including this 𝑋-variable,
repeat steps 2 and 3 until all significant 𝑋-variables are entered

33
STEPWISE REGRESSION

 Evaluate individual variable significance


 An 𝑋-variable entering can later leave; an 𝑋-variable eliminated can later go back in

Step 1: Start with a model which only contains the intercept term
Step 2: Identify the most significant 𝑋-variable, add this 𝑋-variable if its p-value is smaller
than the specified level of significance; otherwise terminate the procedure
Step 3: Identify the least significant 𝑋-variable from the model, remove this 𝑋-variable if
its p-value is larger than the specified level of significance
Step 4: Repeat steps 2 and 3 until all significant 𝑋-variables are entered and none of them
have to be removed

34
PRINCIPLE OF MODEL BUILDING

 A good model should


 Have few independent variables
 Have high predictive power
 Have low correlation between independent variables
 Be easy to interpret

35

You might also like