Lecture 3
Lecture 3
REGRESSION ANALYSIS
- MULTIPLE REGRESSION
1
AGENDA
Last class:
𝑌𝑖 = 0.326 + 0.1578 𝑋𝑖 For every $1 increase in taxi fare, what can we expect?
𝑟 2 = 0.5533 What does it say about our model?
𝐻0 : 𝛽1 = 0 p-value is very, very close to 0, which implies…
2
FORMULATION OF MULTIPLE REGRESSION
MODEL
A multiple regression model is to relate one dependent variable with two or more
independent variables in a linear function
Population Intercept Population Slope Coefficients
Source: https://www1.nyc.gov/site/tlc/about/tlc-trip-record-data.page
TLC Trip Record Data: January 2019 Yellow Taxi Trip Records
Published by NYC Taxi & Limousine Commission We will need to “fix” this later…
FORMULATION OF MULTIPLE REGRESSION
MODEL
5
FORMULATION OF MULTIPLE REGRESSION
MODEL
6
EXAMPLE – USING CATEGORICAL (DUMMY)
VARIABLES
Last time, we did a simple linear regression on taxi fare and tips.
We want to see if the location also affects the tip.
Column E (RatecodeID) has 2 possibilities: 1= New York City, 2 = JFK Airport
Can we use column E as-is? Consider two trips from NYC and JFK, both with
fares of $10.
𝑏2 vs 2𝑏2 ? Double
the bonus?
USING CATEGORICAL (DUMMY) VARIABLES
11
MODEL OUTPUT
Excel’s Output:
12
𝒀
= 𝟏. 𝟑𝟏𝟖𝟏 + 𝟎. 𝟏𝟒𝟖𝟓 𝑿𝟏 − 𝟎. 𝟗𝟓𝟎𝟏 𝑿𝟐
+ 𝟎. 𝟎𝟒𝟎𝟒𝑿𝟑 + 𝟎. 𝟎𝟓𝟎𝟑𝑿𝟒
14
INTERPRETATION OF ESTIMATES
𝑟 2 and adjusted 𝑟 2
F-test for overall model significance
t-test for a particular 𝑋-variable significance
16
MEASURES OF VARIATION - 𝑟 2
We can ALWAYS increase 𝑟 2 by adding variables that don’t explain the changes in 𝑌
Easier to see with less data. See “r-squared comparison” tab in spreadsheet
We add one more column of 0/1s. 1 = odd number row, 0 = even number row
Vs.
18
MEASURES OF VARIATION - 𝑟 2
A degree of freedom* will be lost, as a slope coefficient has to be estimated for that
new 𝑋-variable
Did the new 𝑋-variable add enough explanatory power to offset the loss of one degree of
freedom?
*Degrees of freedom: Number of independent pieces of information (data values) in the random sample.
If 𝐾 + 1 parameters (intercept, slopes) must be estimated before the sum of squares errors, SSE, can be calculated from a sample of size
n, the degrees of freedom are equal to 𝑛 − (𝐾 + 1) (𝐾 + 1 coefficients of b0, b1, …, bK).
19
MEASURES OF VARIATION – ADJUSTED 𝑟 2
𝑆𝑆𝐸
(Recall: 𝑟 2 = 1 − 𝑆𝑆𝑇)
𝑆𝑆𝐸Τ 𝑛−𝐾−1 𝑛−1
Adjusted 𝑟 2 = 1 − = 1− (1 − 𝑟 2 )
𝑆𝑆𝑇Τ 𝑛−1 𝑛−𝐾−1
20
EXAMPLE – ADJUSTED 𝑟 2
Compare the models that we’ve built
Number of Observations: 197,103
SST: 1,163,798
22
OVERALL MODEL SIGNIFICANCE: F-TEST
We want to REJECT the null hypothesis by showing that the probability of seeing
our value of 𝑏1 , 𝑏2 , … , 𝑏𝐾 is “low” if it 𝐻0 was indeed true.
𝑀𝑆𝑅 𝑆𝑆𝑅/𝐾
F = 𝑀𝑆𝐸 = 𝑆𝑆𝐸/(𝑛−𝐾−1) with 𝐾, (𝑛 − 1 − 𝐾) degrees of freedom (d.f.)
23
OVERALL MODEL SIGNIFICANCE: F-TEST
𝑀𝑆𝑅 𝑆𝑆𝑅/𝐾
F= = 𝑆𝑆𝐸/(𝑛−𝐾−1) with 𝐾, (𝑛 − 1 − 𝐾) degrees of freedom (d.f.)
𝑀𝑆𝐸
p-value = 𝑃(𝐹 ≥ F)
𝐩 − 𝐯𝐚𝐥𝐮𝐞 = P(𝐹 ≥ F)
F
0 C. V. = F =61,664, calculated
𝐹𝛼,𝐾,(𝑛−𝐾−1) =
from sample data
𝟐. 𝟑𝟕
25
SIGNIFICANCE OF A PARTICULAR X-VARIABLE:
T-TEST
Even if we reject the 𝐻0 in our F-test, we cannot distinguish which 𝑋-variable(s)
has a significant impact on the 𝑌-variable
t-test for a particular 𝑋-variable’s significance
Null 𝐻0 : 𝛽𝑖 = 0 (𝑋𝑖 has no linear relationship with 𝑌, given presence of other 𝑋-
variable(s))
Alternative 𝐻1 : 𝛽𝑖 ≠ 0 (𝑋𝑖 is linearly related to 𝑌, given presence of other 𝑋-
variable(s))
26
SIGNIFICANCE OF A PARTICULAR X-VARIABLE:
T-TEST
Null 𝐻0 : 𝛽1 = 0
Method 1: Rejection region approach
Reject 𝐻0 if T > C. V. = 𝑡𝛼Τ2,(𝑛−𝐾−1)
Student’s t-distribution
Probability
standard 𝛼
𝛼
error
2 2
If 𝛼 = 5%,
𝑡 then 𝑡0.025 ,(𝑛−5) ≈
-348.81 C.V.= 𝜷𝟏 = 𝟎 C.V.= t=348.81 27
-1.96 1.96 1.96
EXAMPLE
Conclusion: p-value is smaller than 5%, so reject 𝐻0 . The pre-tip fare is significantly
related to the tips, given presence of other 𝑋-variables.
What about the other variables?
*
*
According to the t-test results, the p-value for each of the four explanatory variables
is smaller than 5%,.
This indicates each explanatory variable is significantly related to tips paid in NYC,
given presence of other 𝑋-variables.
What does the table look like if there is an insignificant explanatory variable?
Added fifth variable to labels rows as “odd” or “even” (“5var – odd/even” tab)
The p-value for “Odd/Even transaction” is LARGER than 5%, so we cannot reject
𝐻0 . This indicates that odd/even transactions is not significantly related to tips
paid in NYC, given presence of other 𝑋-variables. 29
VARIABLES SELECTION STRATEGIES
30
ALL POSSIBLE REGRESSIONS
To develop all the possible regression models between the dependent variable
and all possible combinations of independent variables
If there are 𝐾 𝑋-variables to consider using, there are (2𝐾 −1) possible
regression models to be developed
The criteria for selecting the best model may include
Mean Sum of Squares Errors (MSE)
Adjusted 𝑟 2
Disadvantages of all possible regressions
No unique conclusion, with different criteria, different conclusions will arise
Look at overall model performance, but not individual variable significance
When there is a large number of potential 𝑋-variables, computational time can be long
31
BACKWARD ELIMINATION
32
FORWARD SELECTION
nothing
Evaluate individual variable significance
𝑋1 𝑋2 𝑋3 𝑋4 𝑋5
Step 1: Start with a model which only contains the intercept term
Step 2: Identify the most significant 𝑋-variable using t-test 𝑋1 , 𝑋2 𝑋1 , 𝑋3 𝑋1 , 𝑋4 𝑋1 , 𝑋5
Step 3: Add this 𝑋-variable if its p-value is smaller than the specified
level of significance; otherwise terminate the procedure
Step 4: Develop a new regression model after including this 𝑋-variable,
repeat steps 2 and 3 until all significant 𝑋-variables are entered
33
STEPWISE REGRESSION
Step 1: Start with a model which only contains the intercept term
Step 2: Identify the most significant 𝑋-variable, add this 𝑋-variable if its p-value is smaller
than the specified level of significance; otherwise terminate the procedure
Step 3: Identify the least significant 𝑋-variable from the model, remove this 𝑋-variable if
its p-value is larger than the specified level of significance
Step 4: Repeat steps 2 and 3 until all significant 𝑋-variables are entered and none of them
have to be removed
34
PRINCIPLE OF MODEL BUILDING
35