L10 Multiple Regression
L10 Multiple Regression
BES
BES
BES
Example : Housing Values in
Suburbs of Boston
It is the most common dataset that is used by ML learners to
understand how Multiple Linear Regression works. This
dataset contains information collected from the U.S. Census
about housing in the suburbs of Boston. The Boston data
frame has 506 rows and 14 columns (features).
With this data our objective is create a model using linear
regression to predict the houses price
1
10/24/24
BES
Example : Housing Values in Suburbs of Boston
BES
Example : Housing Values in Suburbs of Boston
With this data our objective is create a model using linear regression
to predict the houses price
2
10/24/24
b 0 ,b 1, …, b p Estimated Multiple
provide estimates of Regression Equation
b 0, b 1, …, b p 𝑦, = 𝑏! 𝑥! + ⋯ + 𝑏" 𝑥 "
Sample Statistics b 0 ,b 1, …, b p
BES
Least squares Criterion
Where
• y% : observed value of the dependent variable for the 𝑖&'
• y,. : predicted value of the dependent variable for the 𝑖/0
3
10/24/24
BES
Example : Housing Values in Suburbs of Boston
10
10
BES
Example : Housing Values in Suburbs of Boston
11
11
BES
Example : Housing Values in Suburbs of Boston
Call:
> mean(Crossvalidation)
lm(formula = medv ~ lstat, data = Boston)
[1] 6.200729
Residuals:
Min 1Q Median 3Q Max
-15.168 -3.990 -1.318 2.034 24.500
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 34.55384 0.56263 61.41 <2e-16 ***
lstat -0.95005 0.03873 -24.53 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
12
4
10/24/24
BES
Example : Housing Values in Suburbs of Boston
Call:
lm(formula = medv ~ lstat, data = Boston)
• The estimated regression equation is
𝑦( = b' + b! x!
Residuals:
Min 1Q Median 3Q Max
-15.168 -3.990 -1.318 2.034 24.500
• At the .05 level of significance,
t-value of _________ and its associated p-
Coefficients: value of _________ indicate that the
Estimate Std. Error t value Pr(>|t|)
(Intercept) 34.55384 0.56263 61.41 <2e-16 ***
relationship is significant; higher value are
lstat -0.95005 0.03873 -24.53 <2e-16 *** associated with lower level of lstat.
--- • With a coefficient of
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
determination (_____), we see that
Residual standard error: 6.216 on 504 degrees of freedom ______ of the variability in Housing
Multiple R-squared: 0.5441, Adjusted R-squared: 0.5432 values can be explained by the linear
F-statistic: 601.6 on 1 and 504 DF, p-value: < 2.2e-16
effect of lstat. 13
13
BES
Example : Housing Values in Suburbs of Boston
Summary statistics
mean med sd min max
rm 6.284634 6.2085 0.7026171 3.561 8.78
lstat 12.653063 11.3600 7.1410615 1.730 37.97
medv 22.532806 21.2000 9.1971041. 5.000. 50.00
14
14
BES
Example : Housing Values in Suburbs of Boston
15
15
5
10/24/24
BES
Example : Housing Values in Suburbs of Boston
Call: > mean(Crossvalidation)
lm(formula = medv ~ rm + lstat, data = Boston) [1] 5.520818
Residuals:
Min 1Q Median 3Q Max
-18.076 -3.516 -1.010 1.909 28.131
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -1.35827 3.17283 -0.428 0.669
rm 5.09479 0.44447 11.463 <2e-16 ***
lstat -0.64236 0.04373 -14.689 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
16
BES
Inference in Multiple LR
17
BES
Example : Housing Values in Suburbs of Boston
Call:
lm(formula = medv ~ rm + lstat, data = Boston)
• In simple linear regression, we
Residuals: interpret 𝐛𝟏 as an estimate of the
Min 1Q Median 3Q Max change in y for a one-unit change in
-18.076 -3.516 -1.010 1.909 28.131
the independent variable.
Coefficients:
Estimate Std. Error. t value Pr(>|t|) • In multiple regression analysis, we
(Intercept) -1.35827 3.17283 -0.428 0.669
rm 5.09479 0.44447 11.463 <2e-16 ***
interpret each regression coefficient
lstat -0.64236 0.04373 -14.689 <2e-16 *** as follows: 𝐛𝒊 represents an
--- estimate of the change in y
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
corresponding to a one-unit change
Residual standard error: 5.54 on 503 degrees of freedom in 𝒙𝒊 when all other independent
Multiple R-squared: 0.6386, Adjusted R-squared: 0.6371 variables are held constant.
F-statistic: 444.3 on 2 and 503 DF, p-value: < 2.2e-16
18
18
6
10/24/24
BES
Multiple Coefficient of Determination
The term multiple coefficient of determination indicates that we are
measuring the goodness of fit for the estimated multiple regression
equation. The multiple coefficient of determination is computed as
follows:
𝑺𝑺𝑹
𝑹𝟐 =
𝑺𝑺𝑻
19
BES
Multiple Coefficient of Determination
The multiple coefficient of determination can be interpreted as the
proportion of the variability in the dependent variable that can be
explained by the estimated multiple regression equation.
20
20
BES
Adjusted Multiple Coefficient
of Determination
• 𝑹𝟐 measures the goodness of fit of a single model and is
not a fair criterion for comparing models with different
sizes.
• The adjusted 𝑹𝟐 criterion is more suitable to avoid
overestimating the impact of adding an independent
variable on the amount of variability explained by the
estimated regression equation.
21
21
7
10/24/24
BES
Adjusted Multiple Coefficient
of Determination
With n denoting the number of observations and p denoting
the number of independent variables:
𝒏−𝟏
𝑹𝟐𝒂 = 𝟏 − (𝟏 − 𝑹𝟐)
𝒏−𝒑−𝟏
The larger the 𝑹𝟐𝒂, the better the model.
22
22
BES
Assumptions about the error term 𝜀
23
23
BES
Testing for Significance
The F test (overall significance) is used to determine
whether a significant relationship exists between the
dependent variable and the set of all the independent variables
The t test (individual significance) is used to determine
whether each of the individual independent variables is
significant.
24
24
8
10/24/24
BES
The F test for Overall significance
• The hypotheses
H0: b1 = b2 =…= bp = 0
HA: One or more of the parameters are not equal to 0
• Test statistic:
𝑀𝑆𝑅
𝐹=
𝑀𝑆𝐸
• Rejection Rule:
p-value approach: Reject 𝐻! if 𝑝 − 𝑣𝑎𝑙𝑢𝑒 ≤ 𝛼
Critical value approach
If 𝐻' is rejected, the test gives us sufficient statistical evidence to conclude that one or more of the
parameters are not equal to zero and that the overall relationship between y and the set of
independent variables is significant.
In contrast, we do not have sufficient evidence to conclude that a significant relationship is present25
25
BES
The t test for individual significance
• The hypotheses
H0:bi = 0
HA:bi ≠ 0
• Test statistic:
𝑏7
𝑡=
𝑠8*
• Rejection Rule:
p-value approach: Reject 𝐻! if 𝑝 − 𝑣𝑎𝑙𝑢𝑒 ≤ 𝛼
Critical value approach
If 𝐻' is rejected, the test gives us sufficient statistical evidence to conclude that a statistically
significant relationship exists between the two variables. In contrast, we will have insufficient
evidence to conclude that a significant relationship exists. 26
26
BES
Estimation and Prediction
• Two intervals can be used to discover how closely the
predicted value will match the true value of y
• prediction interval – for one specific value of y
• confidence interval – for the expected value of y.
• The prediction interval is wider than the confidence interval.
27
27
9
10/24/24
BES
Estimation and Prediction
95% Confidence Interval 95% Prediction Interval
rm lstat medv fit Lower Limit Upper Limit Lower Limit Upper Limit
28
BES
Categorical independent Variables
In many situations, however, we must work with
categorical independent variables such as gender ( male,
female), method of payment (cash, credit card, check),
and so on. The purpose of this section is to show how
categorical variables are handled in regression analysis.
29
29
BES
Categorical independent Variables
Johnson Filtration, Inc., provides maintenance service for
water-filtration systems throughout southern Florida.
Customers contact Johnson with requests for
maintenance service on their water-filtration systems. To
estimate the service time and the service cost, Johnson’s
managers want to predict the repair time necessary for
each maintenance request. Repair time is believed to
be related to two factors, the number of months since
the last maintenance service and the type of repair
problem (mechanical or electrical).
30
30
10
10/24/24
BES
Categorical independent Variables
31
BES
Categorical independent Variables
32
BES
Categorical independent Variables
lm(formula = Repair.Time.in.Hours ~ Months.Since.Last.Service +
Type.of.Repair, data = Johnson)
Residuals:
Min 1Q Median 3Q Max
-0.49412 -0.24690 -0.06842 -0.00960 0.76858
Coefficients:
Estimate Std. Error t value. Pr(>|t|)
(Intercept) 0.93050 0.46697 1.993. 0.086558 .
Months.Since.Last.Service 0.38762 0.06257 6.195. 0.000447 ***
Type.of.Repair 1.26269 0.31413 4.020 0.005062 **
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
33
11
10/24/24
BES
Residual Analysis (Self study)
• Detecting Outliers
• Influential Observations
34
34
BES
Form of equation
Lin-lin 𝑌, = 𝛽! + 𝛽+𝑋, + 𝜀, 𝑋 increases 1 unit ® Y increases 𝛽+ unit
!
Lin-log 𝑌, = 𝛽! + 𝛽+ 𝑙𝑛 𝑋, + 𝜀, 𝑋 increases 1% ® Y increases !'' 𝛽+ %
35
35
BES
Problem context
There are 15 companies. Each observation consists of a
value for a response variable (profits) and values for the two
explanatory variables (assets and sales). ($ billions)
36
Intro Model Least Squares Method Inference Some issues
36
12
10/24/24
BES
Data
Assets Sales Profits
Symbol Company ($ billions) ($ billions) ($ billions)
AES AES Corporation 34.806 16.07 1.234
AEP American Electric Power 45.155 14.44 1.38
CNP CenterPoint Energy, Inc. 19.676 10.725 0.391
ED Consolidated Edison, Inc. 33.498 13.429 1.077
D Dominion Resources, Inc. 42.053 16.29 1.834
DUK Duke Energy Corporation 53.077 13.207 2.195
EIX Edison International 44.615 14.112 1.215
EXC Exelon Corporation 47.817 19.065 2.867
FE FirstEnergy 33.521 13.627 1.342
FPL FPL Group Corporation 44.821 16.68 1.754
NI NiSource 20.032 8.309 -0.502
PCG PG&E 40.86 14.628 1.338
PEG Public Service Enterprise Group 29.049 13.322 1.192
SO Southern Company, Inc. 48.347 17.11 1.492
WMB Williams Companies, Inc. 26.006 11.256 0.694
37
Intro Model Least Squares Method Inference Some issues
37
BES
Relationship
Both assets and sales have reasonably strong positive
correlations with profits.
38
Intro Model Least Squares Method Inference Some issues
38
R output BES
Call:
lm(formula = Profits ~ Assets + Sales, data = new)
Residuals:
Min 1Q Median 3Q Max
-0.59593 -0.23446 0.01586 0.22414 0.52315
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -2.00559 0.50093. -4.004 0.00175 **
Assets 0.03285 0.01391 2.362 0.03592 *
Sales 0.14642 0.05317 2.754 0.01748 *
---
Signif. codes:
0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
39
13
10/24/24
BES
Improve the multiple
linear regression model
The logical next step of this analysis is to remove the non-
significant variables and fit the model to see if the performance
improves.
It starts with all the features, then gradually drops the worst
predictors one at a time until it finds the best model.
40
Intro Model Least Squares Method Inference Some issues
40
BES
Improve the multiple
linear regression model
The logical next step of this analysis is to remove the non-
significant variables and fit the model to see if the performance
improves.
It starts with all the features, then gradually drops the worst
predictors one at a time until it finds the best model.
41
Intro Model Least Squares Method Inference Some issues
41
BES
Cross-validation
Cross-validation
combines fitness
measures in
prediction to derive
a more accurate
estimate of model
prediction
performance.
42
Intro Model Least Squares Method Inference Some issues
42
14