Lecture 3
Lecture 3
REGRESSION ANALYSIS
- MULTIPLE REGRESSION
1
AGENDA
2
FORMULATION OF MULTIPLE REGRESSION
MODEL
𝑌 𝛽 𝛽𝑋 𝛽 𝑋 ⋯ 𝛽 𝑋 𝜀
3
FORMULATION OF MULTIPLE REGRESSION
MODEL
4
FORMULATION OF MULTIPLE REGRESSION
MODEL
5
EXAMPLE
Recall the example in the last topic, we wish to find possible factors that
affecting taxi tips in NYC.The relationship between the taxi fare and the size of
the tip is estimated using a 2-variable regression model.
Today we wish to include more factors that could possibly affect tips:
Area
number of riders
Holiday reasons
……
6
MULTIPLE LINEAR REGRESSION
7
MULTIPLE LINEAR REGRESSION
Excel’s Output:
8
MULTIPLE LINEAR REGRESSION
9
INTERPRETATION OF ESTIMATES
10
COMPARISON OF MODELS
Suppose we run another linear regression model only used pre-tip taxi fare and
# of riders as independent variables
11
EVALUATE THE MODEL
𝑟 and adjusted 𝑟
F-test for overall model significance
t-test for a particular 𝑋-variable significance
12
MEASURES OF VARIATION --
where
𝑆𝑆𝑇 ∑ 𝑌 𝑌
𝑆𝑆𝑅 ∑ 𝑌 𝑌
𝑆𝑆𝐸 ∑ 𝑌 𝑌
13
MEASURES OF VARIATION --
The blue part: SSE, the variation
attributable to factors other than
Tips
taxifare and # of riders
Taxi-fare
# of riders
14
The grey, orange and purple parts: SSR, the total variation of 𝑌-variable that being
explained by the regression equation with independent variables
MEASURES OF VARIATION --
A degree of freedom will be lost, as a slope coefficient has to be estimated for that
new 𝑋-variable
Did the new 𝑋-variable add enough explanatory power to offset the lose of one degree of
freedom?
15
MEASURES OF VARIATION – ADJUSTED
⁄
Adjusted 𝑟 1 1 𝑟 1
⁄
16
EXAMPLE
Compare the model only used pre-tip amount against the model used 5
independent variables, which one fits better?
Number of Observations: 197,103 vs 197,103
Degree of freedom (n-K-1): 197,101 vs 197,097
𝑟 : 0.5533 vs 0.6075
Adjusted 𝑟 : 0.5533 vs 0.6075
17
INFERENCE: OVERALL MODEL SIGNIFICANCE
p-value 𝑃 𝐹 F
Reject 𝐻 if F > C. V. 𝐹 , , or p-value < 𝛼
18
INFERENCE:A PARTICULAR X-VARIABLE
SIGNIFICANCE
t with 𝑛 𝐾 1 d.f.
p-value 𝑃 𝑡 |t|
Reject 𝐻 if |t| > C. 𝑉. 𝑡 ⁄ , or p-value < 𝛼
19
EXAMPLE
For the model used 5 independent variables, is the overall model significant?
20
EXAMPLE
According to the t-test results, the p-value for all five independent variables are
smaller than 5%, indicating all of them are significantly related to tips paid in
NYC.
21
VARIABLES SELECTION STRATEGIES
22
ALL POSSIBLE REGRESSIONS
To develop all the possible regression models between the dependent variable
and all possible combinations of independent variables
If there are 𝐾 𝑋-variables to consider using, there are 2 1 possible
regression models to be developed
The criteria for selecting the best model may include
MSE
Adjusted 𝑟
Disadvantages of all possible regressions
No unique conclusion, with different criteria, different conclusions will arise
Look at overall model performance, but not individual variable significance
23
When there is a large number of potential 𝑋-variables, computational time can be long
BACKWARD ELIMINATION
24
FORWARD SELECTION
Step 1: Start with a model which only contain the intercept term
Step 2: Identify the most significant 𝑋-variable using t-test
Step 3: Add this 𝑋-variable if its p-value is smaller than the specified level of
significance; otherwise terminate the procedure
Step 4: Develop a new regression model after including this 𝑋-variable, repeat steps
2 and 3 until all significant 𝑋-variables are entered
25
STEPWISE REGRESSION
Step 1: Start with a model which only contain the intercept term
Step 2: Identify the most significant 𝑋-variable, add this 𝑋-variable if its p-value is
smaller than the specified level of significance; otherwise terminate the procedure
Step 3: Identify the least significant 𝑋-variable from the model, remove this 𝑋-
variable if its p-value is larger than the specified level of significance
Step 4: Repeat steps 2 and 3 until all significant 𝑋-variables are entered and none of
them have to be removed
26
PRINCIPLE OF MODEL BUILDING
27