0% found this document useful (0 votes)
5 views

L10 Multiple Regression

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views

L10 Multiple Regression

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 14

10/24/24

BES

Business and Economics Statistics


Lecture 10
Multiple Regression

Intro Model Least Squares Method Inference Some issues

BES

Multiple Regression Model

Intro Model Least Squares Method Inference Some issues

BES
Example : Housing Values in
Suburbs of Boston
It is the most common dataset that is used by ML learners to
understand how Multiple Linear Regression works. This
dataset contains information collected from the U.S. Census
about housing in the suburbs of Boston. The Boston data
frame has 506 rows and 14 columns (features).
With this data our objective is create a model using linear
regression to predict the houses price

Intro Model Least Squares Method Inference Some issues

1
10/24/24

BES
Example : Housing Values in Suburbs of Boston

Suppose characteristics that will turn out to be important include:


•CRIM : per capita crime rate by town
•INDUS : proportion of non-retail business acres per town.
•RM : average number of rooms per dwelling
•DIS : weighted distances to five Boston employment centres
•RAD : index of accessibility to radial highways
•PTRATIO : pupil-teacher ratio by town
•LSTAT : lower status of the population (percent)
•MEDV : Median value of owner-occupied homes in $1000’s

Intro Model Least Squares Method Inference Some issues

BES
Example : Housing Values in Suburbs of Boston

With this data our objective is create a model using linear regression
to predict the houses price

Intro Model Least Squares Method Inference Some issues

Multiple Regression Model BES

Multiple regression analysis is the study of how a dependent


variable y is related to two or more independent variables.

The equation that describes how the dependent variable y is


related to the independent variables and an error term is
called the multiple regression model
y = β! + β"x" + β#x# + ⋯ + β$x$ + ε

Intro Model Least Squares Method Inference Some issues

2
10/24/24

Multiple Regression Model BES

The equation that describes how the mean value of y is


related to the independent variables is called the multiple
regression equation.

E(y) = β! + β"x" + β#x# + ⋯ + β$x$


Estimated multiple regression equation
𝑦! = b! + b"x" + b#x# + ⋯ + b$x$
Where b! , b" , b# , … , b$ are the estimates of unknown parameters
𝑦( = predicted value of the dependent variable
7

Intro Model Least Squares Method Inference Some issues

Estimation PROCESS BES


Multiple regression Model
Sample Data:
𝑦 = 𝛽! 𝑥! + ⋯ + 𝛽" 𝑥 " + 𝜀
Multiple regression Equation
𝑥! 𝑥" … 𝑥# 𝑦
E(y) = 𝛽! 𝑥! + ⋯ + 𝛽" 𝑥 " . . . . .
Unknown Parameters . . . . .
b 0, b 1, …, b p . . . .

b 0 ,b 1, …, b p Estimated Multiple
provide estimates of Regression Equation
b 0, b 1, …, b p 𝑦, = 𝑏! 𝑥! + ⋯ + 𝑏" 𝑥 "
Sample Statistics b 0 ,b 1, …, b p

Intro Model Least Squares Method Inference Some issues

BES
Least squares Criterion

min .(y+ − y0+)#

Where
• y% : observed value of the dependent variable for the 𝑖&'
• y,. : predicted value of the dependent variable for the 𝑖/0

Intro Model Least Squares Method Inference Some issues

3
10/24/24

BES
Example : Housing Values in Suburbs of Boston

• Initially the managers believed that the house price would


be closely related to The percentage of lower status of the
population.
• After reviewing this scatter diagram, the managers
hypothesized that the simple linear regression model could
be used to describe the relationship between 2 variables.

10

Intro Model Least Squares Method Inference Some issues

10

BES
Example : Housing Values in Suburbs of Boston

11

Intro Model Least Squares Method Inference Some issues

11

BES
Example : Housing Values in Suburbs of Boston
Call:
> mean(Crossvalidation)
lm(formula = medv ~ lstat, data = Boston)
[1] 6.200729
Residuals:
Min 1Q Median 3Q Max
-15.168 -3.990 -1.318 2.034 24.500

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 34.55384 0.56263 61.41 <2e-16 ***
lstat -0.95005 0.03873 -24.53 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 6.216 on 504 degrees of freedom


Multiple R-squared: 0.5441, Adjusted R-squared: 0.5432
F-statistic: 601.6 on 1 and 504 DF, p-value: < 2.2e-16
12

Intro Model Least Squares Method Inference Some issues

12

4
10/24/24

BES
Example : Housing Values in Suburbs of Boston
Call:
lm(formula = medv ~ lstat, data = Boston)
• The estimated regression equation is
𝑦( = b' + b! x!
Residuals:
Min 1Q Median 3Q Max
-15.168 -3.990 -1.318 2.034 24.500
• At the .05 level of significance,
t-value of _________ and its associated p-
Coefficients: value of _________ indicate that the
Estimate Std. Error t value Pr(>|t|)
(Intercept) 34.55384 0.56263 61.41 <2e-16 ***
relationship is significant; higher value are
lstat -0.95005 0.03873 -24.53 <2e-16 *** associated with lower level of lstat.
--- • With a coefficient of
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
determination (_____), we see that
Residual standard error: 6.216 on 504 degrees of freedom ______ of the variability in Housing
Multiple R-squared: 0.5441, Adjusted R-squared: 0.5432 values can be explained by the linear
F-statistic: 601.6 on 1 and 504 DF, p-value: < 2.2e-16
effect of lstat. 13

Intro Model Least Squares Method Inference Some issues

13

BES
Example : Housing Values in Suburbs of Boston

This finding is fairly good, but the managers might want to


consider adding more independent variables to explain
some of the remaining variability in the Housing values.

Summary statistics
mean med sd min max
rm 6.284634 6.2085 0.7026171 3.561 8.78
lstat 12.653063 11.3600 7.1410615 1.730 37.97
medv 22.532806 21.2000 9.1971041. 5.000. 50.00
14

Intro Model Least Squares Method Inference Some issues

14

BES
Example : Housing Values in Suburbs of Boston

Scatterplot matrix of three variables.


Each square is a scatter plot.

15

Intro Model Least Squares Method Inference Some issues

15

5
10/24/24

BES
Example : Housing Values in Suburbs of Boston
Call: > mean(Crossvalidation)
lm(formula = medv ~ rm + lstat, data = Boston) [1] 5.520818

Residuals:
Min 1Q Median 3Q Max
-18.076 -3.516 -1.010 1.909 28.131

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -1.35827 3.17283 -0.428 0.669
rm 5.09479 0.44447 11.463 <2e-16 ***
lstat -0.64236 0.04373 -14.689 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 5.54 on 503 degrees of freedom


Multiple R-squared: 0.6386, Adjusted R-squared: 0.6371
F-statistic: 444.3 on 2 and 503 DF, p-value: < 2.2e-16
16

Intro Model Least Squares Method Inference Some issues

16

BES
Inference in Multiple LR

• Model parameters: intercept and slopes


• Inference tasks:
• Multiple Coefficient of Determination
• Model Assumptions
• Testing for Significance
• Estimation and Prediction
17

Intro Model Least Squares Method Inference Some issues

17

BES
Example : Housing Values in Suburbs of Boston
Call:
lm(formula = medv ~ rm + lstat, data = Boston)
• In simple linear regression, we
Residuals: interpret 𝐛𝟏 as an estimate of the
Min 1Q Median 3Q Max change in y for a one-unit change in
-18.076 -3.516 -1.010 1.909 28.131
the independent variable.
Coefficients:
Estimate Std. Error. t value Pr(>|t|) • In multiple regression analysis, we
(Intercept) -1.35827 3.17283 -0.428 0.669
rm 5.09479 0.44447 11.463 <2e-16 ***
interpret each regression coefficient
lstat -0.64236 0.04373 -14.689 <2e-16 *** as follows: 𝐛𝒊 represents an
--- estimate of the change in y
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
corresponding to a one-unit change
Residual standard error: 5.54 on 503 degrees of freedom in 𝒙𝒊 when all other independent
Multiple R-squared: 0.6386, Adjusted R-squared: 0.6371 variables are held constant.
F-statistic: 444.3 on 2 and 503 DF, p-value: < 2.2e-16
18

Intro Model Least Squares Method Inference Some issues

18

6
10/24/24

BES
Multiple Coefficient of Determination
The term multiple coefficient of determination indicates that we are
measuring the goodness of fit for the estimated multiple regression
equation. The multiple coefficient of determination is computed as
follows:
𝑺𝑺𝑹
𝑹𝟐 =
𝑺𝑺𝑻

Relationship among SST, SSR and SSE


• Total sum squares 𝑺𝑺𝑻 = ∑𝒏𝒊3𝟏(𝒚𝒊 − 𝒚)𝟐
• Total sum squares due to regression 𝑺𝑺𝑹 = ∑𝒏𝒊3𝟏(𝒚,𝒊 − 𝒚)𝟐
• Total sum squares due to error 𝑺𝑺𝑬 = ∑𝒏𝒊3𝟏(𝒚𝒊 − 𝒚,𝒊 )𝟐
19

Intro Model Least Squares Method Inference Some issues

19

BES
Multiple Coefficient of Determination
The multiple coefficient of determination can be interpreted as the
proportion of the variability in the dependent variable that can be
explained by the estimated multiple regression equation.

Example : Housing Values in Suburbs of Boston


______% of the variability in housing value y is explained by the
estimated multiple regression equation with average number of room
per dwelling and lower status of the population as the independent
variables.

20

Intro Model Least Squares Method Inference Some issues

20

BES
Adjusted Multiple Coefficient
of Determination
• 𝑹𝟐 measures the goodness of fit of a single model and is
not a fair criterion for comparing models with different
sizes.
• The adjusted 𝑹𝟐 criterion is more suitable to avoid
overestimating the impact of adding an independent
variable on the amount of variability explained by the
estimated regression equation.

21

Intro Model Least Squares Method Inference Some issues

21

7
10/24/24

BES
Adjusted Multiple Coefficient
of Determination
With n denoting the number of observations and p denoting
the number of independent variables:

𝒏−𝟏
𝑹𝟐𝒂 = 𝟏 − (𝟏 − 𝑹𝟐)
𝒏−𝒑−𝟏
The larger the 𝑹𝟐𝒂, the better the model.
22

Intro Model Least Squares Method Inference Some issues

22

BES
Assumptions about the error term 𝜀

1. The error 𝜀 is a random variable with mean of zero. 𝐄 𝛆 = 𝟎

2. The variance of 𝜀 , denoted by σ 2, is the same for


all values of the independent variables.

3. The values of 𝜀 are independent.

4. The error 𝜀 is a normally distributed random variable.

23

Intro Model Least Squares Method Inference Some issues

23

BES
Testing for Significance
The F test (overall significance) is used to determine
whether a significant relationship exists between the
dependent variable and the set of all the independent variables
The t test (individual significance) is used to determine
whether each of the individual independent variables is
significant.
24

Intro Model Least Squares Method Inference Some issues

24

8
10/24/24

BES
The F test for Overall significance
• The hypotheses
H0: b1 = b2 =…= bp = 0
HA: One or more of the parameters are not equal to 0
• Test statistic:
𝑀𝑆𝑅
𝐹=
𝑀𝑆𝐸
• Rejection Rule:
p-value approach: Reject 𝐻! if 𝑝 − 𝑣𝑎𝑙𝑢𝑒 ≤ 𝛼
Critical value approach
If 𝐻' is rejected, the test gives us sufficient statistical evidence to conclude that one or more of the
parameters are not equal to zero and that the overall relationship between y and the set of
independent variables is significant.
In contrast, we do not have sufficient evidence to conclude that a significant relationship is present25

Intro Model Least Squares Method Inference Some issues

25

BES
The t test for individual significance
• The hypotheses
H0:bi = 0
HA:bi ≠ 0
• Test statistic:
𝑏7
𝑡=
𝑠8*
• Rejection Rule:
p-value approach: Reject 𝐻! if 𝑝 − 𝑣𝑎𝑙𝑢𝑒 ≤ 𝛼
Critical value approach
If 𝐻' is rejected, the test gives us sufficient statistical evidence to conclude that a statistically
significant relationship exists between the two variables. In contrast, we will have insufficient
evidence to conclude that a significant relationship exists. 26

Intro Model Least Squares Method Inference Some issues

26

BES
Estimation and Prediction
• Two intervals can be used to discover how closely the
predicted value will match the true value of y
• prediction interval – for one specific value of y
• confidence interval – for the expected value of y.
• The prediction interval is wider than the confidence interval.

27

Intro Model Least Squares Method Inference Some issues

27

9
10/24/24

BES
Estimation and Prediction
95% Confidence Interval 95% Prediction Interval
rm lstat medv fit Lower Limit Upper Limit Lower Limit Upper Limit

1 6.575 4.98 24 28.941014 28.2144747 29.6675527 18.0318973 39.85013

2 6.421 9.14 21.6 25.484206 24.9407772 26.0276341 14.5857527 36.382659

3 7.185 4.03 34.7 32.659075 31.8307466 33.4874029 21.7427068 43.575443

4 6.998 2.94 33.4 32.40652 31.5816204 33.2314196 21.4904117 43.322628

5 7.147 5.33 36.2 31.630407 30.8458961 32.4149179 20.7172764 42.543538

6 6.43 5.21 28.7 28.054527 27.3064826 28.8025714 17.1439573 38.965097

7 6.012 12.43 22.9 21.287078 20.7422847 21.8318722 10.3885574 32.1856

8 6.172 19.15 27.1 17.785597 17.08701 18.4841831 6.87830608 28.692887

9 5.631 29.93 16.5 8.104693 6.7919477 9.4174391 -2.8590771 19.068464

10 6.004 17.1 18.9 18.246507 17.6762044 18.8168091 7.34668075 29.146333


28

Intro Model Least Squares Method Inference Some issues

28

BES
Categorical independent Variables
In many situations, however, we must work with
categorical independent variables such as gender ( male,
female), method of payment (cash, credit card, check),
and so on. The purpose of this section is to show how
categorical variables are handled in regression analysis.

29

Intro Model Least Squares Method Inference Some issues

29

BES
Categorical independent Variables
Johnson Filtration, Inc., provides maintenance service for
water-filtration systems throughout southern Florida.
Customers contact Johnson with requests for
maintenance service on their water-filtration systems. To
estimate the service time and the service cost, Johnson’s
managers want to predict the repair time necessary for
each maintenance request. Repair time is believed to
be related to two factors, the number of months since
the last maintenance service and the type of repair
problem (mechanical or electrical).
30

Intro Model Least Squares Method Inference Some issues

30

10
10/24/24

BES
Categorical independent Variables

Service Repair Time in Months Since Last Type of Repair


Repair person
Call Hours (y) Service (𝑥!) (𝑥+)
1 2.9 2 Electrical Donna Newton
2 3 6 Mechanical Donna Newton
3 4.8 8 Electrical Bob Jones
4 1.8 3 Mechanical Donna Newton
5 2.9 2 Electrical Donna Newton
6 4.9 7 Electrical Bob Jones
7 4.2 9 Mechanical Bob Jones
8 4.8 8 Mechanical Bob Jones
9 4.4 4 Electrical Bob Jones
10 4.5 6 Electrical Donna Newton
31

Intro Model Least Squares Method Inference Some issues

31

BES
Categorical independent Variables

To incorporate the type of repair into the regression


model, we define the following variable.

0 if the type of repair is mechanical


𝑥# = =
1 if the type of repair is electrical

In regression analysis it is called a dummy or indicator


variable.
32

Intro Model Least Squares Method Inference Some issues

32

BES
Categorical independent Variables
lm(formula = Repair.Time.in.Hours ~ Months.Since.Last.Service +
Type.of.Repair, data = Johnson)

Residuals:
Min 1Q Median 3Q Max
-0.49412 -0.24690 -0.06842 -0.00960 0.76858

Coefficients:
Estimate Std. Error t value. Pr(>|t|)
(Intercept) 0.93050 0.46697 1.993. 0.086558 .
Months.Since.Last.Service 0.38762 0.06257 6.195. 0.000447 ***
Type.of.Repair 1.26269 0.31413 4.020 0.005062 **
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.459 on 7 degrees of freedom


Multiple R-squared: 0.8592, Adjusted R-squared: 0.819
F-statistic: 21.36 on 2 and 7 DF, p-value: 0.001048
33

Intro Model Least Squares Method Inference Some issues

33

11
10/24/24

BES
Residual Analysis (Self study)

• Standardized residual for Observation i23/Residual i23

• Detecting Outliers

• Studentized Deleted Residuals and Outliers

• Influential Observations

34

Intro Model Least Squares Method Inference Some issues

34

BES
Form of equation
Lin-lin 𝑌, = 𝛽! + 𝛽+𝑋, + 𝜀, 𝑋 increases 1 unit ® Y increases 𝛽+ unit

Log-log 𝑙𝑛 𝑌, = 𝛽! + 𝛽+ 𝑙𝑛 𝑋, + 𝜀, 𝑋 increases 1% ® Y increases 𝛽+%

!
Lin-log 𝑌, = 𝛽! + 𝛽+ 𝑙𝑛 𝑋, + 𝜀, 𝑋 increases 1% ® Y increases !'' 𝛽+ %

Log-lin 𝑙𝑛 𝑌, = 𝛽! + 𝛽+𝑋, + 𝜀, 𝑋 increases 1 unit ® Y increases 100𝛽+%

35

Intro Model Least Squares Method Inference Some issues

35

BES
Problem context
There are 15 companies. Each observation consists of a
value for a response variable (profits) and values for the two
explanatory variables (assets and sales). ($ billions)

36
Intro Model Least Squares Method Inference Some issues

36

12
10/24/24

BES
Data
Assets Sales Profits
Symbol Company ($ billions) ($ billions) ($ billions)
AES AES Corporation 34.806 16.07 1.234
AEP American Electric Power 45.155 14.44 1.38
CNP CenterPoint Energy, Inc. 19.676 10.725 0.391
ED Consolidated Edison, Inc. 33.498 13.429 1.077
D Dominion Resources, Inc. 42.053 16.29 1.834
DUK Duke Energy Corporation 53.077 13.207 2.195
EIX Edison International 44.615 14.112 1.215
EXC Exelon Corporation 47.817 19.065 2.867
FE FirstEnergy 33.521 13.627 1.342
FPL FPL Group Corporation 44.821 16.68 1.754
NI NiSource 20.032 8.309 -0.502
PCG PG&E 40.86 14.628 1.338
PEG Public Service Enterprise Group 29.049 13.322 1.192
SO Southern Company, Inc. 48.347 17.11 1.492
WMB Williams Companies, Inc. 26.006 11.256 0.694

37
Intro Model Least Squares Method Inference Some issues

37

BES
Relationship
Both assets and sales have reasonably strong positive
correlations with profits.

38
Intro Model Least Squares Method Inference Some issues

38

R output BES
Call:
lm(formula = Profits ~ Assets + Sales, data = new)

Residuals:
Min 1Q Median 3Q Max
-0.59593 -0.23446 0.01586 0.22414 0.52315

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -2.00559 0.50093. -4.004 0.00175 **
Assets 0.03285 0.01391 2.362 0.03592 *
Sales 0.14642 0.05317 2.754 0.01748 *
---
Signif. codes:
0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.354 on 12 degrees of freedom


Multiple R-squared: 0.819, Adjusted R-squared: 0.7888
F-statistic: 27.14 on 2 and 12 DF, p-value: 3.521e-05
39
Intro Model Least Squares Method Inference Some issues

39

13
10/24/24

BES
Improve the multiple
linear regression model
The logical next step of this analysis is to remove the non-
significant variables and fit the model to see if the performance
improves.
It starts with all the features, then gradually drops the worst
predictors one at a time until it finds the best model.

40
Intro Model Least Squares Method Inference Some issues

40

BES
Improve the multiple
linear regression model
The logical next step of this analysis is to remove the non-
significant variables and fit the model to see if the performance
improves.
It starts with all the features, then gradually drops the worst
predictors one at a time until it finds the best model.

41
Intro Model Least Squares Method Inference Some issues

41

BES
Cross-validation

Cross-validation
combines fitness
measures in
prediction to derive
a more accurate
estimate of model
prediction
performance.

42
Intro Model Least Squares Method Inference Some issues

42

14

You might also like