0% found this document useful (0 votes)
4 views

Multiple Regression

multiple regression topic of ba
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views

Multiple Regression

multiple regression topic of ba
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 60

MULTIPLE LINEAR

REGRESSION
MULTIPLE REGRESSION

Multiple Regression Model


Least Squares Method
Coefficient of Determination
Model Assumptions
Testing for Significance
Using the Estimated Regression Equation for
Estimation and Prediction
Categorical Independent Variables
MULTIPLE LINEAR REGRESSION
Multiple linear regression (MLR), also known simply as multiple regression, is a statistical technique
that uses several explanatory variables to predict the outcome of a response variable.

Multiple regression model

y = β0 + β1x1 + β2x2 + … + βnxn + ε

where
n is the number of observations
Y is the dependent variable
x is explanatory variables
β0 is the y-intercept (constant term)
β1, β2 and βn is the slope coefficients for each explanatory variable
ε error or residual
Multiple regression equation

E(y) = β0 + β1x1 + β2x2 + … + βnxn

where
n is the number of observations
Y is the dependent variable
x is explanatory variables
β0 is the y-intercept (constant term)
β1, β2 and βn is the slope coefficients for each explanatory variable
Estimated multiple regression equation

ŷ = b0 + b1x1 + b2x2 + … + bnxn

where
n is the number of observations
Y is the dependent variable
x is explanatory variables
b0 is the y-intercept (constant term)
b1, b2 and bn is the slope coefficients for each explanatory variable
ESTIMATION PROCESS

Multiple Regression Model


Sample Data:
E(y) = 0 + 1x1 + 2x2 +. . .+ pxp +  x1 x2 . . . xp y
Multiple Regression Equation . . . .
E(y) = 0 + 1x1 + 2x2 +. . .+ pxp . . . .
Unknown parameters are
0, 1, 2, . . . , p

Estimated Multiple
Regression Equation
b0, b1, b2, . . . , bp
provide estimates of yˆ = b0 + b1 x1 + b2 x2 + ... + bp xp
0, 1, 2, . . . , p Sample statistics are
b0, b1, b2, . . . , bp
LEAST SQUARES METHOD

• Least Squares Criterion

min  ( yi − yˆ i )2

Computation of Coefficient Values


The formulas for the regression coefficients
b0, b1, b2, . . . bp involve the use of matrix algebra.
We will rely on computer software packages to
perform the calculations.
LEAST SQUARES METHOD
The least squares method is a procedure for using sample data to find the estimated regression equation

Predict values of the dependent variables are computed using estimated multiple regression
equation

y = b0 + b1x1 + b2x2 + … + bnxn

If 2 independent variables are used to compute the output then equation is


y = b0 + b1x1 + b2x2
Where
b0 is intercept
b1 is slope of first variable
b2 is slope of first variable
Example – Butler Trucking Company

Assignment Miles (x) Time (y)


1 100 9.3
2 50 4.8
3 100 8.9
4 100 6.5
5 50 4.2
6 80 6.2
7 75 7.4
8 65 6.0
9 90 7.6
10 90 6.1

Linear Equation ???

Predict value for y using any value for x = 110


Example – Butler Trucking Company

Assignment Miles (x1) Deliveries (x2) Time (y)


1 100 4 9.3
2 50 3 4.8
3 100 4 8.9
4 100 2 6.5
5 50 2 4.2
6 80 2 6.2
7 75 3 7.4
8 65 4 6.0
9 90 3 7.6
10 90 2 6.1

Linear Equation is y = ???? R Square = ????


Multiple R = ???
Predict value for y using any value for x1 and x2 Adjusted R square = ???
Example – Butler Trucking Company

Linear Equation is y = -0.8687 + 0.0611 x1 + 0.9234 x2

Predict value for y using any value for x

R Square = 0.9038

Multiple R = 0.9506 (SQRT of R square)

Adjusted R square = 0.8763


Multiple Coefficient of Determination

Relationship Among SST, SSR, SSE

SST = SSR + SSE

 i
( y − y ) 2
=  i
( ˆ
y − y ) 2
+  i i
( y − ˆ
y )2

where:
SST = total sum of squares
SSR = sum of squares due to regression
SSE = sum of squares due to error
Coefficient of Determination (R square)

The Coefficient of Determination is the measure of the variance in response variable ‘y’ that can be
predicted using predictor variables x1, x2, .. xn It shows the accuracy of regression line.

Measure the goodness of fit

The value of the coefficient of Determination varies from 0 to 1. 0 means there is no linear
relationship between predictor variables x1, x2, .. xn and response variable ‘y’ and 1 mean there is a
perfect linear relationship between input and output.

R2 is calculated using Sum of Square (SS)


R2 = SSR/SST
Types of Sum of Square
1. Regression Sum of Square (SSR)
Relationship
2. Residual (Error) Sum of Square (SSE)
SST = SSR + SSE
3. Total Sum of Square (SST)
SSR = SSR(x1) + SSR(x2) …SSR(xn)
Adjusted Multiple Coefficient
of Determination

Adding independent variables, even ones that are


not statistically significant, causes the prediction
errors to become smaller, thus reducing the sum of
squares due to error, SSE.
Because SSR = SST – SSE, when SSE becomes smaller,
SSR becomes larger, causing R2 = SSR/SST to
increase.
The adjusted multiple coefficient of determination
compensates for the number of independent
variables in the model.
Adjusted Multiple Coefficient
of Determination

n−1
Ra2 = 1 − ( 1 − R 2 )
n−p−1
Example – Butler Trucking Company – ANOVA Summary

Relationship
SST = SSR + SSE
SSR = SSR(x1) + SSR(x2) …SSR(xn)

SSR = 15.871 + 5.729 = 21.6


SSE = 2.299

SST = 15.871 + 5.729 + 2.299= 23.899

MSR = SSR / k
MSR (X1) = 15.871 / 1 = 15.871 Degree of freedom for SSR(miles) is k = 1
MSR (X2) = 5.729 / 1 = 5.729 Degree of freedom for SSR(deliveries) is k = 1
MSE = SSE / DE = 2.299/7 = 0.328 MSE = SSE / n-k-1
F value for x1 = 15.871/ 0.328 = 48.3 n = 10 (total number of observations)
F value for x2 = 5.729 / 0.328 = 17.4 k = 2 (total number of independent variables)
Correlation Coefficient (Multiple R)

Correlation Coefficient = (Sign of slope m) SQRT(Coefficient of determination)


= (Sign of slope m) SQRT(R2)
= + SQRT (0.9038) = + 0.9506 Adjusted R - squared

R2 = SSR/SST
= 21.6/23.899
= 0.9038 Where
n = 10
K = 2 (number of independent
variable)
Multiple Regression Model

Example: Programmer Salary Survey


A software firm collected data for a sample of 20
computer programmers. A suggestion was made that
regression analysis could be used to determine if
salary was related to the years of experience and the
score on the firm’s programmer aptitude test.
The years of experience, score on the aptitude test
test, and corresponding annual salary ($1000s) for a
sample of 20 programmers is shown on the next slide.
Multiple Regression Model

Exper. Test Salary Exper. Test Salary


(Yrs.) Score ($000s) (Yrs.) Score ($000s)
4 78 24.0 9 88 38.0
7 100 43.0 2 73 26.6
1 86 23.7 10 75 36.2
5 82 34.3 5 81 31.6
8 86 35.8 6 74 29.0
10 84 38.0 8 87 34.0
0 75 22.2 4 79 30.1
1 80 23.1 6 94 33.9
6 83 30.0 3 70 28.2
6 91 33.0 3 89 30.0
Multiple Regression Model

Suppose we believe that salary (y) is related to


the years of experience (x1) and the score on the
programmer aptitude test (x2) by the following
regression model:

y^ = b0 + b1x1 + b2x2

where
y = annual salary ($000)
x1 = years of experience
x2 = score on programmer aptitude test
Solving for the Estimates of 0, 1, 2

Least Squares
Input Data Output
x1 x2 y Computer b0 =
Package b1 =
4 78 24
for Solving b2 =
7 100 43
Multiple
. . . R2 =
. . . Regression
3 89 30 Problems etc.
Solving for the Estimates of b0, b1, b2

Regression Equation Output

Predictor Coef SE Coef T p


Constant 3.17394 6.15607 0.5156 0.61279
Experience 1.4039 0.19857 7.0702 1.9E-06
Test Score 0.25089 0.07735 3.2433 0.00478
Estimated Regression Equation

SALARY = 3.174 + 1.404(EXPER) + 0.251(SCORE)

Note: The predicted salary will be in thousands of dollars.

Now predict salary for given experience = 5 years


and Test Score value = 70
Interpreting the Coefficients

In multiple regression analysis, we interpret each


regression coefficient as follows:

bi represents an estimate of the change in y


corresponding to a 1-unit increase in xi when all
other independent variables are held constant.
Interpreting the Coefficients

b1 = 1.404

Salary is expected to increase by $1,404 for


each additional year of experience (when the variable
score on programmer attitude test is held constant).
Interpreting the Coefficients

b2 = 0.251

Salary is expected to increase by $251 for each


additional point scored on the programmer aptitude
test (when the variable years of experience is held
constant).
MODEL FITTING

Model fitting is a measure of how well a machine learning model generalizes to similar data
to that on which it was trained. A model that is well-fitted produces more accurate outcomes.

Model fitting is the essence of machine learning. If your model doesn’t fit your data correctly,
the outcomes it produces will not be accurate enough to be useful for practical decision-
making.

A measure of model fit tells us how well our regression line captures the underlying
data.
Testing for Significance

In simple linear regression, the F and t tests provide


the same conclusion.

In multiple regression, the F and t tests have different


purposes.
Testing for Significance: F Test

The F test is used to determine whether a significant


relationship exists between the dependent variable
and the set of all the independent variables.

The F test is referred to as the test for overall


significance.
Testing for Significance: t Test

If the F test shows an overall significance, the t test is


used to determine whether each of the individual
independent variables is significant.

A separate t test is conducted for each of the


independent variables in the model.

We refer to each of these t tests as a test for individual


significance.
Testing for Significance: F Test

Hypotheses H0: 1 = 2 = . . . = p = 0
Ha: One or more of the parameters
is not equal to zero.

Test Statistics F = MSR/MSE

Rejection Rule Reject H0 if p-value <  or if F > F 


where F is based on an F distribution
with p d.f. in the numerator and
n - p - 1 d.f. in the denominator.
F Test for Overall Significance

Hypotheses H0: 1 = 2 = 0
Ha: One or both of the parameters
is not equal to zero.

Rejection Rule For  = .05 and d.f. = 2, 17; F.05 = 3.59


Reject H0 if p-value < .05 or F > 3.59
TESTING FOR SIGNIFICANCE
In a multiple regression equation, the mean or expected value of y is a linear function of x:

E(y) = ẞ0 + ẞ1 x1 + ẞ2 x2 + ẞn xn

If the value of ẞ1 = ẞ2 = ẞn = 0 , E(y) = ẞ0 + (0)x = ẞ0.

Hypothesis
H0 : ẞ1 = ẞ2 = ẞn = 0
Ha : One or more parameters are not equal

Two tests are commonly used. Both require an estimate of variance of e in the regression model

T – test
F - test
F TEST

𝑀𝑆𝑅
F=
𝑀𝑆𝐸
10.8
F= = 32.92 SSR for miles and deliveries = 15.871 + 5.729 = 21.6
0.328
MSR for miles and deliveries = SSR/p = 21.6 / 2 = 10.8

MSE = SSE/n-p-1 = 2.299/7 = 0.328

Critical value for F .01= 9.55 for degree of freedom 2 as numerator and 7
as denominator
And value of F = 32.92 which is very high so we reject it H0 : ẞ1 = ẞ2
Testing for Significance: t Test

Hypotheses H0 : i = 0
H a : i  0

bi
Test Statistics t=
sbi

Rejection Rule Reject H0 if p-value <  or


if t < -t or t > t where t
is based on a t distribution
with n - p - 1 degrees of freedom.
t Test for Significance
of Individual Parameters

Hypotheses H0 : i = 0
H a : i  0

Rejection Rule For  = .05 and d.f. = 17, t.025 = 2.11


Reject H0 if p-value < .05, or
if t < -2.11 or t > 2.11
T Test – for each
parameter

𝑏1 0.06113
t= = = 6.18
𝑆𝑏1 0.00989

𝑏2 0.923 Critical value for t .005= 3.499 for degree of freedom 7


t= = = 4.18
𝑆𝑏2 0.221
For b1 6.18 > 3.499
Also for b2 4.18 > 3.499
We reject H0 for both input parameters
Anova Table
R OUTPUT FOR 𝑏1 𝑏1
t= 𝑆𝑏1 =
PRODUCTION DATA 𝑆𝑏1
𝑡
𝑏1 = 𝑡 ∗ 𝑆𝑏1

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1209.433 8066.345 0.150 0.882
Month -3.286 78.110 -0.042 0.967
MachineHours 44.707 4.237 10.551 1.71e-10 ***
ProductionRuns 931.803 07.050 8.704 6.86e-09 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
If number of observations are not given
Residual standard error: 4222 on 24 degrees of freedom 24 = n-k-1
Multiple R-squared: 0.8639, Adjusted R-squared: 0.8469 24 = n-3-1
F-statistic: 50.79 on 3 and 24 DF, p-value: 1.519e-10 n = 28
Categorical Independent Variables

In many situations we must work with categorical


independent variables such as gender (male, female),
method of payment (cash, check, credit card), etc.

For example, x2 might represent gender where x2 = 0


indicates male and x2 = 1 indicates female.

In this case, x2 is called a dummy or indicator variable.


CATEGORICAL INDEPENDENT VARIABLES IN REGRESSION
ANALYSIS
A categorical variable (also called qualitative variable) refers to a characteristic that can't be quantifiable.
Categorical variables can be either nominal or ordinal.

ID Gender Age Income Education


Risk Age Pressure Smoker
12 57 150 Yes 1 Male 25 $50,000 High School
23 48 169 Yes
2 Female 30 $60,000 College
13 60 157 No
55 52 144 Yes 3 Male 28 $55,000 College
28 71 154 No
4 Female 35 $70,000 Graduate
17 66 162 No
19 69 143 Yes

In this example, "Gender" and "Education" are categorical


In this example, ”Smoker" is categorical variables. "Gender" is nominal, while "Education" is ordinal.
variables. Nominal data.

Categorical variables have to convert into numerical variable to use for regression purpose.
REGRESSION ANALYSIS WITH DUMMY VARIABLES

Linear regression use quantitative variables referred to as “numeric” variables, these are
variables that represent a measurable quantity.

Examples include:
•Number of square feet in a house
•The population size of a city
•Age of an individual
Dummy Variables: Numeric variables used in regression analysis to represent categorical
data that can only take on one of two values: zero or one.

But if we wish to use categorical variables as predictor variables. These are variables that
take on names or labels and can fit into categories. Examples include:
•Gender (e.g. “male”, “female”)
•Marital status (e.g. “married”, “single”, “divorced”)

The solution is to use dummy variables.


Create these variables that take on one of two values: zero or one.

The number of dummy variables we must create equals k-1 where k is the number of different
values the categorical variable can take on.
Example 1: Create a Dummy Variable with Only Two Values

The categorical variable Gender can


take on two different values (“Male” or
“Female”),
we only need to create k-1 = 2-1 = 1
dummy variable.

To create this dummy variable, we can


choose one of the values (“Male” or
“Female”) to represent 0 and the other
to represent 1.
Example 2: Create a Dummy Variable with Multiple Values
Suppose we have the following dataset and we would like to use marital status and age to
predict income:

Since it is currently a categorical


variable that can take on three
different values (“Single”, “Married”,
or “Divorced”), we need to create k-1
= 3-1 = 2 dummy variables.

Let “Single” be our baseline value


since it occurs most often.
Net Asset 5 Year Average Expense Ratio Morningst
Fund Name Fund Type Value ($) Return (%) (%) ar Rank
Amer Cent Inc & Growth Inv DE 28.88 12.39 0.67 2-Star CATEGORICAL DATASET –
American Century Intl. Disc IE 14.37 30.53 1.41 3-Star EXAMPLE (CONVERT INTO
American Century Tax-Free Bond FI 10.73 3.34 0.49 4-Star NUMERICAL)
American Century Ultra DE 24.94 10.88 0.99 3-Star
Ariel DE 46.39 11.32 1.03 2-Star
Artisan Intl Val IE 25.52 24.95 1.23 3-Star
Artisan Small Cap DE 16.92 15.67 1.18 3-Star
Baron Asset FI 50.67 16.77 1.31 5-Star
Brandywine DE 36.58 18.14 1.08 4-Star
Brown Cap Small IE 35.73 15.85 1.20 4-Star

Net Asset 5 Year Average Expense Ratio


Fund Name Fund Type Value ($) Return (%) (%) Morningstar Rank
Amer Cent Inc & Growth Inv 1 28.88 12.39 0.67 2
American Century Intl. Disc 2 14.37 30.53 1.41 3
American Century Tax-Free
Bond 3 10.73 3.34 0.49 4
American Century Ultra 1 24.94 10.88 0.99 3
Ariel 1 46.39 11.32 1.03 2
Artisan Intl Val 2 25.52 24.95 1.23 3
Artisan Small Cap 1 16.92 15.67 1.18 3
Baron Asset 3 50.67 16.77 1.31 5
Brandywine 1 36.58 18.14 1.08 4
Brown Cap Small 2 35.73 15.85 1.20 4
ASSUMPTIONS OF MULTIPLE LINEAR REGRESSION

Assumption 1: Linear Relationship


Multiple linear regression assumes that there is a linear relationship between each predictor
variable and the response variable.

How to Determine if this Assumption is Met


The easiest way to determine if this assumption is met is to create a scatter plot of each
predictor variable and the response variable. Or based on Correlation Coefficient value
Assumption 2. No Multicollinearity: None of the predictor or input variables are highly correlated with
each other.

Multicollinearity occurs when two or more independent variables in a regression model are highly correlated,
making it difficult to determine the individual contribution of each variable to the dependent variable.

When one or more predictor variables are highly correlated, the regression model suffers
from multicollinearity, which causes the coefficient estimates in the model to become unreliable.

How to Determine if this Assumption is Met


Create correlation matrix
OR
Determine if this assumption is met is to calculating the VIF (variance inflation factor) value for each
predictor variable.
VIF values start at 1 and have no upper limit. VIF of 1 indicates No correlation

•VIF < 5: Low multicollinearity.


•5 < VIF < 10: Moderate multicollinearity.
•VIF > 10: High multicollinearity.
Assumption 3: Independence or No Autocorrelation in the Residuals
Multiple linear regression assumes that each observation in the dataset is independent.

In the residual, the value of y(x) is independent to y(x+1)

No correlation between consecutive residuals. It’s assumed that the residuals are independent.

How to Determine if this Assumption is Met

One way to determine if this assumption is met is to perform a Durbin-Watson test, which is
used to detect the presence of autocorrelation in the residuals of a regression.

The Durbin-Watson test is a statistical test used to detect the presence of autocorrelation in the
residuals of a regression analysis.
In general, if test value is less than 1.5 (Positive) or greater than 2.5 (Negative) then there is
potentially a serious autocorrelation problem.
if test value is between 1.5 and 2.5 then autocorrelation is likely not a cause for concern.

The DW statistic always produces


a value between 0 and 4.

Correlation vs. Autocorrelation


Correlation measures the relationship between two variables, whereas autocorrelation measures the
relationship of a variable with lagged values of itself.
Assumption 4: The variance of the residuals is constant (Homoscedasticity)

Multiple linear regression assumes that the residuals have constant variance at every point in
the linear model.

The amount of error in the residuals is similar at each point of the linear model.

How to Determine if this Assumption is Met


The simplest way to determine if this assumption is met is to create a plot of standardized
residuals versus predicted values.
Assumption 5: Multivariate Normality

Multiple linear regression assumes that the residuals of the model are normally distributed.

How to Determine if this Assumption is Met


There are two common ways to check if this assumption is met:
1. Check the assumption by comparing value of residual and Z Score
2. 2. Check the assumption visually using a Histogram
NONLINEAR REGRESSION
Nonlinear regression is a mathematical model that fits an equation to certain data using
a generated line.

As is the case with a linear regression that uses a straight-line equation


while nonlinear regression relates the two variables in a nonlinear (curved) relationship.
First is Linear and other two non linear
COEFFICIENTS

T value for variable Time


Standard error for variable Share
Coefficient for variable Work
Define on VIF value for variable Rating
Calculate VIF value for given R square value
Df Sum Sq Mean Sq F value Pr(>F)
SquareFeet 1 2803636 2.24e-13 ***
Bedrooms 1 799206 2.38e-05 ***
Bathrooms 1 427256 0.00168 **
Residuals 124 5138423

1. Find the value for Multiple R,


2. R square,
3. Adjusted R square and
4. standard error.
5. F value
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -56.4083 172.0038 -0.328 0.743504
SquareFeet 0.3564 _____ 3.341 0.001102 **
Bedrooms 104.5993 29.1233 ______ 0.000472 ***
Bathrooms _______ 42.1867 3.211 0.001685 **
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 203.6 on 124 degrees of freedom


Multiple R-squared: 0.4396, Adjusted R-squared: ____________
F-statistic: 32.42 on 3 and 124 DF, p-value: 1.535e-15

1. Fill all the tables in table.


2. Create a least squares line (equation) for multiple linear regression for the given analysis.
3. Predict the value for price if the area is 2000 sq ft, No of bedrooms is 3 and No of bathrooms is 2
1. Create a least squares line (equation) for multiple linear regression for the given analysis.

2. Predict the value for price if the area is 2000 sq ft, No of bedrooms is 3 and No of bathrooms is 2
Q - Below is the regression output related to the annual post-college earning (USD) which is based on the college
annual education cost (USD), its graduation rate (%), and debt which is the percentage of students paying loans (%).

1. Interpret the value of the coefficient of


determination.

2. Comment on the significance of predictors


mentioning their hypotheses and specific p-value
at a 5% level of significance.

3. Predict the earnings if the cost = 25000,


grad = 60, and debt = 80.

4. Interpret the values of VIF for Grad.


VIF Value for Grad: 22.56
EXAMPLE -

VIF Value –

Miles 1.026963
Deliveries 15.26963

1. Interpret the significance of each input based on the p-value.


2. Interpret based on R2 value.
3. Write the equation.
4. Find output where the value of miles is 10 and deliveries is 30.
VIF Value –

Month MachineHours ProductionRuns


1.045953 1.063087 1.102359

Durbin-Watson test -
1. Interpret the significance of each input based on the p-value
Autocorrelation D-W Statistic p-value 2. Interpret based on VIF value of each predictor.
1.6143758 1.304288 0.028 3. Interpret based on Durbin value.

You might also like