Multiple Regression
Multiple Regression
REGRESSION
MULTIPLE REGRESSION
where
n is the number of observations
Y is the dependent variable
x is explanatory variables
β0 is the y-intercept (constant term)
β1, β2 and βn is the slope coefficients for each explanatory variable
ε error or residual
Multiple regression equation
where
n is the number of observations
Y is the dependent variable
x is explanatory variables
β0 is the y-intercept (constant term)
β1, β2 and βn is the slope coefficients for each explanatory variable
Estimated multiple regression equation
where
n is the number of observations
Y is the dependent variable
x is explanatory variables
b0 is the y-intercept (constant term)
b1, b2 and bn is the slope coefficients for each explanatory variable
ESTIMATION PROCESS
Estimated Multiple
Regression Equation
b0, b1, b2, . . . , bp
provide estimates of yˆ = b0 + b1 x1 + b2 x2 + ... + bp xp
0, 1, 2, . . . , p Sample statistics are
b0, b1, b2, . . . , bp
LEAST SQUARES METHOD
min ( yi − yˆ i )2
Predict values of the dependent variables are computed using estimated multiple regression
equation
R Square = 0.9038
i
( y − y ) 2
= i
( ˆ
y − y ) 2
+ i i
( y − ˆ
y )2
where:
SST = total sum of squares
SSR = sum of squares due to regression
SSE = sum of squares due to error
Coefficient of Determination (R square)
The Coefficient of Determination is the measure of the variance in response variable ‘y’ that can be
predicted using predictor variables x1, x2, .. xn It shows the accuracy of regression line.
The value of the coefficient of Determination varies from 0 to 1. 0 means there is no linear
relationship between predictor variables x1, x2, .. xn and response variable ‘y’ and 1 mean there is a
perfect linear relationship between input and output.
n−1
Ra2 = 1 − ( 1 − R 2 )
n−p−1
Example – Butler Trucking Company – ANOVA Summary
Relationship
SST = SSR + SSE
SSR = SSR(x1) + SSR(x2) …SSR(xn)
MSR = SSR / k
MSR (X1) = 15.871 / 1 = 15.871 Degree of freedom for SSR(miles) is k = 1
MSR (X2) = 5.729 / 1 = 5.729 Degree of freedom for SSR(deliveries) is k = 1
MSE = SSE / DE = 2.299/7 = 0.328 MSE = SSE / n-k-1
F value for x1 = 15.871/ 0.328 = 48.3 n = 10 (total number of observations)
F value for x2 = 5.729 / 0.328 = 17.4 k = 2 (total number of independent variables)
Correlation Coefficient (Multiple R)
R2 = SSR/SST
= 21.6/23.899
= 0.9038 Where
n = 10
K = 2 (number of independent
variable)
Multiple Regression Model
y^ = b0 + b1x1 + b2x2
where
y = annual salary ($000)
x1 = years of experience
x2 = score on programmer aptitude test
Solving for the Estimates of 0, 1, 2
Least Squares
Input Data Output
x1 x2 y Computer b0 =
Package b1 =
4 78 24
for Solving b2 =
7 100 43
Multiple
. . . R2 =
. . . Regression
3 89 30 Problems etc.
Solving for the Estimates of b0, b1, b2
b1 = 1.404
b2 = 0.251
Model fitting is a measure of how well a machine learning model generalizes to similar data
to that on which it was trained. A model that is well-fitted produces more accurate outcomes.
Model fitting is the essence of machine learning. If your model doesn’t fit your data correctly,
the outcomes it produces will not be accurate enough to be useful for practical decision-
making.
A measure of model fit tells us how well our regression line captures the underlying
data.
Testing for Significance
Hypotheses H0: 1 = 2 = . . . = p = 0
Ha: One or more of the parameters
is not equal to zero.
Hypotheses H0: 1 = 2 = 0
Ha: One or both of the parameters
is not equal to zero.
E(y) = ẞ0 + ẞ1 x1 + ẞ2 x2 + ẞn xn
Hypothesis
H0 : ẞ1 = ẞ2 = ẞn = 0
Ha : One or more parameters are not equal
Two tests are commonly used. Both require an estimate of variance of e in the regression model
T – test
F - test
F TEST
𝑀𝑆𝑅
F=
𝑀𝑆𝐸
10.8
F= = 32.92 SSR for miles and deliveries = 15.871 + 5.729 = 21.6
0.328
MSR for miles and deliveries = SSR/p = 21.6 / 2 = 10.8
Critical value for F .01= 9.55 for degree of freedom 2 as numerator and 7
as denominator
And value of F = 32.92 which is very high so we reject it H0 : ẞ1 = ẞ2
Testing for Significance: t Test
Hypotheses H0 : i = 0
H a : i 0
bi
Test Statistics t=
sbi
Hypotheses H0 : i = 0
H a : i 0
𝑏1 0.06113
t= = = 6.18
𝑆𝑏1 0.00989
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1209.433 8066.345 0.150 0.882
Month -3.286 78.110 -0.042 0.967
MachineHours 44.707 4.237 10.551 1.71e-10 ***
ProductionRuns 931.803 07.050 8.704 6.86e-09 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
If number of observations are not given
Residual standard error: 4222 on 24 degrees of freedom 24 = n-k-1
Multiple R-squared: 0.8639, Adjusted R-squared: 0.8469 24 = n-3-1
F-statistic: 50.79 on 3 and 24 DF, p-value: 1.519e-10 n = 28
Categorical Independent Variables
Categorical variables have to convert into numerical variable to use for regression purpose.
REGRESSION ANALYSIS WITH DUMMY VARIABLES
Linear regression use quantitative variables referred to as “numeric” variables, these are
variables that represent a measurable quantity.
Examples include:
•Number of square feet in a house
•The population size of a city
•Age of an individual
Dummy Variables: Numeric variables used in regression analysis to represent categorical
data that can only take on one of two values: zero or one.
But if we wish to use categorical variables as predictor variables. These are variables that
take on names or labels and can fit into categories. Examples include:
•Gender (e.g. “male”, “female”)
•Marital status (e.g. “married”, “single”, “divorced”)
The number of dummy variables we must create equals k-1 where k is the number of different
values the categorical variable can take on.
Example 1: Create a Dummy Variable with Only Two Values
Multicollinearity occurs when two or more independent variables in a regression model are highly correlated,
making it difficult to determine the individual contribution of each variable to the dependent variable.
When one or more predictor variables are highly correlated, the regression model suffers
from multicollinearity, which causes the coefficient estimates in the model to become unreliable.
No correlation between consecutive residuals. It’s assumed that the residuals are independent.
One way to determine if this assumption is met is to perform a Durbin-Watson test, which is
used to detect the presence of autocorrelation in the residuals of a regression.
The Durbin-Watson test is a statistical test used to detect the presence of autocorrelation in the
residuals of a regression analysis.
In general, if test value is less than 1.5 (Positive) or greater than 2.5 (Negative) then there is
potentially a serious autocorrelation problem.
if test value is between 1.5 and 2.5 then autocorrelation is likely not a cause for concern.
Multiple linear regression assumes that the residuals have constant variance at every point in
the linear model.
The amount of error in the residuals is similar at each point of the linear model.
Multiple linear regression assumes that the residuals of the model are normally distributed.
2. Predict the value for price if the area is 2000 sq ft, No of bedrooms is 3 and No of bathrooms is 2
Q - Below is the regression output related to the annual post-college earning (USD) which is based on the college
annual education cost (USD), its graduation rate (%), and debt which is the percentage of students paying loans (%).
VIF Value –
Miles 1.026963
Deliveries 15.26963
Durbin-Watson test -
1. Interpret the significance of each input based on the p-value
Autocorrelation D-W Statistic p-value 2. Interpret based on VIF value of each predictor.
1.6143758 1.304288 0.028 3. Interpret based on Durbin value.