0% found this document useful (0 votes)
12 views25 pages

Elementary Regression Analysis

The document provides an overview of elementary linear regression analysis, detailing its objectives of prediction and inference, as well as the necessary data and model components. It discusses the assumptions of the model, types of variables, and the estimation of parameters using Ordinary Least Squares (OLS) method. Additionally, it includes examples and graphical representations to illustrate the relationships between variables.

Uploaded by

tejuethereal
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views25 pages

Elementary Regression Analysis

The document provides an overview of elementary linear regression analysis, detailing its objectives of prediction and inference, as well as the necessary data and model components. It discusses the assumptions of the model, types of variables, and the estimation of parameters using Ordinary Least Squares (OLS) method. Additionally, it includes examples and graphical representations to illustrate the relationships between variables.

Uploaded by

tejuethereal
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 25

Elementary Linear Regression Analysis

“Regression towards mediocrity” - the term “Regression” was first (1885,


1886, 1889) used by Sir Francis Galton (1832-1911), a famous geneticist.

According to some surveys, Regression is one of the most used as well as


most misused methods.

In modern terminology, regression is classified under supervised learning


in machine learning.

1. The Problem Statement/Objectives

There are two main objectives that we may wish to do regression for.

a. Prediction: In many situations, a set of inputs, 𝑋1 , 𝑋2 , … , 𝑋𝑝 , are readily


available, but the output, 𝑌, cannot be easily obtained. In this setting, we can
predict 𝑌 using regression.

b. Inference: We are often interested in understanding the association between 𝑌


and 𝑋1 , 𝑋2 , … , 𝑋𝑝 .

Example

1. Is there a relationship between advertising budget and sales?

2. How strong is the relationship between advertising budget and sales?

3. Which media are associated with sales?

4. How large is the association between each medium and sales?

5. How accurately can we predict future sales?

6. Is the relationship linear?

7. Is there constructive interaction among the advertising media?

Prepared by Professor Malay Bhattacharyya


Page 1 of 25
2. Data

𝑌1 𝑋11 𝑋21 𝑋𝑝1


𝑌 𝑋 𝑋22 𝑋
( 2 ) , ( 12 … 𝑝2 ).
⋮ ⋮ ⋮ ⋮
𝑌𝑛 𝑋1𝑛 𝑋2𝑛 𝑋𝑝𝑛

The data represent a simple random sample. 𝑛 = sample size. The variable 𝑌
represents a quantity that we wish to forecast. The variables 𝑋1 , 𝑋2 , … , 𝑋𝑝 represent
quantitative or qualitative variables that we think, may be related to 𝑌. The data could
be collected over time or locations.

Example

𝑌 = Sales of a particular product (in thousands of units).

𝑝 = 3.

𝑋1 = amount (in thousands of dollars) spent on TV advertisement.

𝑋2 = amount (in thousands of dollars) spent on Radio advertisement.

𝑋3 = amount (in thousands of dollars) spent on News Paper advertisement.

Snapshot of the Example Data (Source: Kaggle)


(Advertisement Data.xlsx)

Sales TV Radio Newspaper


22.1 230.1 37.8 69.2
10.4 44.5 39.3 45.1
12 17.2 45.9 69.3
16.5 151.5 41.3 58.5
17.9 180.8 10.8 58.4
7.2 8.7 48.9 75
11.8 57.5 32.8 23.5
13.2 120.2 19.6 11.6
4.8 8.6 2.1 1
15.6 199.8 2.6 21.2
. . . .
. . . .
3. The Model

a. Mathematical Description

𝑌 = 𝛽0 + 𝛽1 𝑋1 + 𝛽2 𝑋2 + ⋯ + 𝛽𝑝 𝑋𝑝 + 𝜀 𝑜𝑟

Prepared by Professor Malay Bhattacharyya


Page 2 of 25
𝑌𝑖 = 𝛽0 + 𝛽1 𝑋1𝑖 + 𝛽2𝑖 𝑋2𝑖 + ⋯ + 𝛽𝑝𝑖 𝑋𝑝𝑖 + 𝜀𝑖 , 𝑖 = 1, 2, … , 𝑛

We shall call

𝑌 = 𝛽0 + 𝛽1 𝑋1 + 𝛽2 𝑋2 + ⋯ + 𝛽𝑝 𝑋𝑝 ,

the TRUE relationship between 𝑌 and the set of variables, 𝑋1 , 𝑋2 , … , 𝑋𝑝 .

b. Graphical Representation

Model: 𝑌𝑖 = 𝛽0 + 𝛽1 𝑋1𝑖 + 𝜀𝑖 , 𝑖 = 1, 2, … , 𝑛

Figure 1: Linear Regression with One Independent Variable - Data and Fitted
Line. Each black dot is a pair (𝑌𝑖 , 𝑋1𝑖 ).

Model: 𝑌𝑖 = 𝛽0 + 𝛽1 𝑋1𝑖 + 𝛽2𝑖 𝑋2𝑖 + 𝜀𝑖 , 𝑖 = 1, 2, … , 𝑛

Figure 2: Linear Regression with Two Independent Variables - Data and


Fitted Plane. Each red dot is a triplet (𝑌𝑖 , 𝑋1𝑖 , 𝑋2𝑖 ).
Prepared by Professor Malay Bhattacharyya
Page 3 of 25
4. Components of the Model

a. Independent Variable(s): 𝑋1 , 𝑋2 , … , 𝑋𝑝 – also called predictors, input


variables, features.

b. Dependent Variable: 𝑌 – also called response or output variable.

c. Coefficients of Independent Variables in the TRUE model: 𝛽0 , 𝛽1 , 𝛽2 , … . 𝛽𝑝

d. The Error Term: 𝜀 – assumed random.

5. Assumptions of the Model

a. 𝑌 is modelled as a random variable.

b. Each 𝑋𝑖 is, usually, assumed to be deterministic.

c. Linearity (Non-zero Correlation Coefficient): Each 𝑋𝑖 is linearly related


with 𝑌. 𝐶𝑜𝑟𝑟(𝑋𝑖 , 𝑌) ≠ 0.

d. No Linear Correlation between Independent Variables (No Multi-


Collinearity): 𝐶𝑜𝑟𝑟(𝑋𝑖 , 𝑋𝑗 ) = 0, for all pairs of (𝑖, 𝑗), 𝑖 ≠ 𝑗.

e. No Autocorrelation between Errors: 𝐶𝑜𝑟𝑟(𝜀𝑖 , 𝜀𝑗 ) = 0,


for all pairs of (𝑖, 𝑗), 𝑖 ≠ 𝑗.

f. No Correlation between Independent Variables and Error Terms:


𝐶𝑜𝑟𝑟(𝑋𝑖 , 𝜀𝑖 ) = 0, for all 𝑖.

g. Mean of Errors is Zero: 𝐸(𝜀𝑖 ) = 0, for 𝑖 = 1, 2, … , 𝑛. This implies

𝐸(𝑌𝑖 |𝑋1𝑖 , 𝑋2𝑖 , … , 𝑋𝑝𝑖 ) = 𝛽0 + 𝛽1 𝑋1𝑖 + 𝛽2𝑖 𝑋2𝑖 + ⋯ + 𝛽𝑝𝑖 𝑋𝑝𝑖 , 𝑖 = 1, 2, … , 𝑛.

h. Constant Variance for Errors: 𝑉𝑎𝑟(𝜀𝑖 ) = 𝜎 2 , for 𝑖 = 1, 2, … , 𝑛. This


implies

𝑉𝑎𝑟(𝑌𝑖 |𝑋1𝑖 , 𝑋2𝑖 , … , 𝑋𝑝𝑖 ) = 𝜎 2 , for 𝑖 = 1, 2, … , 𝑛.

i. Normality of Errors: 𝜀𝑖 ~𝑁(0, 𝜎 2 ), for 𝑖 = 1, 2, … , 𝑛. As a consequence of


this and the above,

𝑌𝑖 |𝑋1𝑖 , 𝑋2𝑖 , … , 𝑋𝑝𝑖 ~𝑁(𝛽0 + 𝛽1 𝑋1𝑖 + 𝛽2𝑖 𝑋2𝑖 + ⋯ + 𝛽𝑝𝑖 𝑋𝑝𝑖 , 𝜎 2 ),


for 𝑖 = 1, 2, … , 𝑛.

Prepared by Professor Malay Bhattacharyya


Page 4 of 25
This means that each 𝑌𝑖 , given 𝑿𝟏𝒊 , 𝑿𝟐𝒊 , … , 𝑿𝒑𝒊 , is normally distributed with mean as
a linear function of 𝑋1𝑖 , 𝑋2𝑖 , … , 𝑋𝑝𝑖 and constant variance 𝜎 2 .

Figure 3: The Assumptions in Linear Regression. The solid line represents the
conditional mean of Y, given 𝑋1 , 𝑋2 , … , 𝑋𝑝 . The purple curves represent the
conditional distributions of Y, given 𝑋1 , 𝑋2 , … , 𝑋𝑝 .

j. 𝒏 > 𝒑 (Thumb Rule: 𝑛 should be at least 1.5 ∗ 𝑝 to 2 ∗ 𝑝.)

6. Type of Variables

a. Quantitative – E.g., Age, Income, Sales, Stock Return etc., usually,


continuous. In the linear regression setting 𝑌 is quantitative.

b. Categorical – E.g., Education Level, Brand, Season, Gender etc. Independent


variables 𝑋1 , 𝑋2 , … , 𝑋𝑝 can be categorical.

7. Parameters of the Model

a. Coefficients: 𝛽0 , 𝛽1 , 𝛽2 , … . 𝛽𝑝 – coefficients of 𝑋1 , 𝑋2 , … , 𝑋𝑝 in the TRUE


model.

b. Error Variance: 𝜎 2 - TRUE common variance of the errors.

8. Doing the Regression

a. Scatter Plot (plot of 𝒀 vs 𝑿): First, look at the scatter plots.


How large is the association between each medium and sales?

Prepared by Professor Malay Bhattacharyya


Page 5 of 25
Figure 4 Scatter Plots. Top: Sales vs TV. Middle: Sales vs Newspaper. Bottom: Sales vs Radio.

What do we see in the scatter plots?

1. Sales clearly go up with TV advertisement expenditure going up. Positive


Relationship.

2. The relationship between Sales and TV is linear (approximately).

3. If you draw an imaginary line through the points, the points are quite close to
the line. Strong linear relationship.

Prepared by Professor Malay Bhattacharyya


Page 6 of 25
4. Sales and expenditure on Newspaper do not seem to have a strong
relationship. One can, however, find a mild linear positive relation.

5. Sales and Radio seem to have a positive linear relationship that is not as
strong as that with TV, but stronger than that with Newspaper.

6. Specifically,

Correlation Coefficients
TV Radio Newspaper
Sales 0.901208 0.349631 0.15796

9. Estimation of the Parameters

Ordinary Least Squares Method (OLS) (You may skip this section if you wish.)

1. Note that we do not need any of the assumptions stated above for the OLS
method.

2. Let us first consider a model with only one independent variable. Here 𝒑 = 𝟏.

The model is:

𝑌𝑖 = 𝛽0 + 𝛽1 𝑋1𝑖 + 𝜀𝑖 , 𝑖 = 1, 2, … , 𝑛.

The following are the data points.

(𝑌1 , 𝑋11 ), (𝑌2 , 𝑋12 ), … , (𝑌𝑛 , 𝑋1𝑛 ).

Let us write the Residual Sum of Square (RSS) as


𝑛 𝑛 𝑛
2 2
𝑅𝑆𝑆 = ∑ 𝜀̂𝑖2 = ∑(𝑌𝑖 − 𝑌̂𝑖 ) = ∑(𝑌𝑖 − 𝛽̂0 − 𝛽̂1 𝑋1𝑖 )
𝑖=1 𝑖=1 𝑖=1

The 𝛽̂ (pronounced as “𝛽-hat”) denotes an estimate1 of 𝛽.

We find the estimates by minimising the RSS with respect to 𝛽̂0 and 𝛽̂1 . The
estimates are:
𝑛
∑ (𝑋1𝑖 −𝑋̅1 )(𝑌𝑖 −𝑌̅)
𝛽̂1 = 𝑖=1
∑𝑛 (𝑋 ̅ )2
, 𝛽̂0 = 𝑌̅ − 𝛽̂1 𝑋̅1 , and
𝑖=1 1𝑖 −𝑋1

1
Technically, 𝛽̂ and other such quantities are called “Estimators”. An estimator has a probability distribution.
For a given sample, the value of the estimator is called an estimate.
Prepared by Professor Malay Bhattacharyya
Page 7 of 25
1
𝜎̂22 = 𝑛−2 ∑𝑛𝑖=1(𝑌𝑖 − 𝑌̂𝑖 )2 , 𝜀̂𝑖 = 𝑌𝑖 − 𝑌̂𝑖 , 𝑖 = 1, 2, … , 𝑛

3. When we have 𝒑 ≥ 𝟏 independent variables, writing

𝑌1 1 𝑋11 𝑋21 𝑋𝑝1 𝛽0


𝑌 𝑋12 𝑋22 𝑋 𝛽
𝑌 = ( 2) , 𝑋 = (1 … 𝑝2 ) , 𝛽 = ( 1 ),
⋮ ⋮ ⋮ ⋮ ⋮ ⋮
𝑌𝑛 1 𝑋1𝑛 𝑋2𝑛 𝑋𝑝𝑛 𝛽𝑝

We can show that

𝛽̂0 𝑛
𝛽̂1 1
𝛽̂ = = (𝑋 ′ −1 ′
𝑋) 𝑋 𝑌; 𝜎̂22 = ∑(𝑌𝑖 − 𝑌̂𝑖 )2 .
⋮ 𝑛−𝑝−2
𝑖=1
̂
𝛽𝑝
( )

𝑋′ denotes the transpose of the matrix 𝑋.

4. Geometry of Least Squares

𝜀̂1
Here 𝑝 = 1 and 𝑛 = 3. 2 ൭𝜀̂2 ൱
𝜀̂3

𝑌1
൭𝑌2 ൱
𝑌3
𝑌̂1
𝑋1 ቌ𝑌̂2 ቍ
൭ 𝑋2 ൱ 𝑌̂3
𝑋3
1

1 𝑥1
1 Space generate by 𝛽0 ൭1൱ + 𝛽0 ൭𝑋2 ൱
൭1 ൱ 𝑋3
1
1

3 for various values of 𝛽0 𝑎𝑛𝑑 𝛽1 .

a. Estimation for the Example Data

𝑌̂ = 𝛽̂0 + 𝛽̂1 𝑋1 + 𝛽̂2 𝑋2 + ⋯ + 𝛽̂𝑝 𝑋𝑝

𝑌̂𝑖 = 𝛽̂0 + 𝛽̂1 𝑋1𝑖 + 𝛽̂2 𝑋2𝑖 + ⋯ + 𝛽̂𝑝 𝑋𝑝𝑖 , 𝑖 = 1, 2, … , 𝑛

Is there a relationship between TV advertising budget and Sales?

We have used JMP to do the regression analysis for the example data. One can
use any other software to the same analysis.
Prepared by Professor Malay Bhattacharyya
Page 8 of 25
In the example, we take 𝑌 = 𝑆𝑎𝑙𝑒𝑠, 𝑋1 = 𝑇𝑉. 𝛽̂0 = 6.9748, 𝛽̂1 = 0.0555,
𝜎̂ = 2.2957. The estimated regression equation is:

̂ = 6.9748 + 0.0555 ∗ 𝑇𝑉
𝑆𝑎𝑙𝑒𝑠

b. Interpret the Estimated Parameters

1. Intercept: 𝛽̂0 = 6.9748. This represents that (even) when 𝑇𝑉 = 0, i.e.,


there is no expenditure on TV advertisement, the estimated average Sales
will be 6.9748 thousand of units of the product.

2. Slope: 𝛽̂1 = 0.0555. This represents the increase (decrease) of Average


Sales when TV advertisement expenditure is increased (decreased) by one
unit. In other words, if TV advertisement expenditure is increased
(decreased) by $1000, then the average sales will go up (go down) by 55
units.

3. Error Variance: 𝜎̂ 2 = 2.29572 = 5.27. This is the estimated common


variance of the error distribution. The smaller the estimated SE of errors,
the closer the data points to the fitted line. We shall see below how to use
it.

10. How Good is the Estimated Model? How strong is the relationship between TV
advertising budget and Sales?

i. R-Squared (𝑹𝟐 )
𝑛 𝑛 𝑛
2 2
∑(𝑌𝑖 − 𝑌̅)2 = ∑(𝑌̂𝑖 − 𝑌̅) + ∑(𝑌𝑖 − 𝑌̂𝑖 ) ;
𝑖=1 𝑖=1 𝑖=1

𝑇𝑆𝑆 = 𝑅𝑒𝑔𝑆𝑆 + 𝑅𝑆𝑆.

𝑇𝑆𝑆 = ∑𝑛𝑖=1(𝑌𝑖 − 𝑌̅)2 is the total sum of squares in 𝑌. TSS measures the total
variability in the response variable 𝑌.

𝑅𝑒𝑔𝑆𝑆 measures the amount of variability in 𝑌 that is captured by the


regression model.

Finally, 𝑅𝑆𝑆 is the amount of variability in 𝑌 that is left unexplained by the


regression model.
Prepared by Professor Malay Bhattacharyya
Page 9 of 25
𝑅𝑒𝑔𝑆𝑆 𝑇𝑆𝑆 − 𝑅𝑆𝑆 𝑅𝑆𝑆
𝑅2 = = =1− . 0 < 𝑅 2 < 1.
𝑇𝑆𝑆 𝑇𝑆𝑆 𝑇𝑆𝑆

𝑅 2 can, therefore, be interpreted as the fraction (percentage, when multiplied


by 100) of the variability in 𝑌 that is explained by the regression model, i.e.,
the set of independent variables in the model.

The higher the value of 𝑅 2 , the better the regression model, generally
speaking.

Note that when we have only one independent variable, say 𝑋1 , in the
regression model, then 𝑅 2 = 𝑟 2 , where 𝑟 = 𝐶𝑜𝑟𝑟(𝑌, 𝑋1 ).

E.g., in the regression of Sales on TV, 𝑅 2 = 0.812176 = 0.9012082 = 𝑟 2 .

For the one variable regression model between Sales and TV, 𝑅 2 = 0.812176.
This means 81.22% of variation in Sales is explained by the regression model
or the independent variable, TV.

Therefore, we can conclude that this is a good regression model. Hold on!

ii. Adjusted R-Squared (Adjusted 𝑹𝟐 )


Since RSS always decreases as more variables are added to the model, the 𝑅 2
always increases as more variables are added.

𝑅𝑆𝑆/(𝑛 − 𝑝 − 1)
𝐴𝑑𝑗𝑢𝑠𝑡𝑒𝑑 𝑅 2 = 1 − .
𝑇𝑆𝑆/(𝑛 − 1)

A large value of adjusted 𝑅 2 indicates a model with a small error. Maximizing


the adjusted 𝑅 2 is equivalent to minimizing 𝑅𝑆𝑆/(𝑛 − 𝑝 − 1). While RSS
always decreases as the number of variables in the model increases, 𝑅𝑆𝑆/(𝑛 −
𝑝 − 1) may increase or decrease, due to the presence of 𝑝 in the denominator.

The intuition behind the adjusted 𝑅 2 is that once all of the right variables have
been included in the model, adding additional noise variables will lead to only
an exceedingly small decrease in RSS. Since adding noise variables leads to
an increase in 𝑝, such variables will lead to an increase in 𝑅𝑆𝑆/(𝑛 − 𝑝 − 1),
and consequently a decrease in the adjusted 𝑅 2 . Therefore, in theory, the
model with the largest adjusted 𝑅 2 will have only correct variables and no
noise variables. Unlike the 𝑹𝟐 statistic, the adjusted 𝑹𝟐 statistic pays a
price for the inclusion of unnecessary variables in the model.

iii. AIC (Akaike Information Criterion): Assuming normal errors, we


can estimate the AIC as, without an irrelevant constant,
1
𝐴𝐼𝐶 = (𝑅𝑆𝑆 + 2𝑝𝜎̂ 2 )
𝑛

Prepared by Professor Malay Bhattacharyya


Page 10 of 25
The smaller the AIC, the better the model.

iv. BIC (Bayesian Information Criterion): Assuming normal errors, we


can estimate the BIC as, without an irrelevant constant,
1
𝐵𝐼𝐶 = (𝑅𝑆𝑆 + log (𝑛)𝑝𝜎̂ 2 )
𝑛

The smaller the BIC, the better the model.

AIC and BIC are more theoretically justified criteria. However, we shall not
discuss them here further.

The following is the output from JMP.

11. Properties of the Estimated Parameters


We state the standard errors and the distributions of the 𝛽 estimates below. We shall
use them for further analysis.

1 𝑋̅12
1. 𝛽̂0 ~𝑁 (𝛽0 , 𝑉𝑎𝑟(𝛽̂0 )) , 𝐸(𝛽̂0 ) = 𝛽0 , 𝑉𝑎𝑟(𝛽̂0 ) = 𝜎 2 [𝑛 + ∑𝑛 ̅ 2
],
𝑖=1(𝑋1𝑖 −𝑋1 )

𝑆𝐸(𝛽̂0 ) = √𝑉𝑎𝑟(𝛽̂0 ).
𝜎2
2. 𝛽̂1 ~𝑁 (𝛽1 , 𝑉𝑎𝑟(𝛽̂1 )) , 𝐸(𝛽̂1 ) = 𝛽1 , 𝑉𝑎𝑟(𝛽̂1 ) = [∑𝑛 ̅ 2
],
𝑖=1(𝑋1𝑖 −𝑋1 )

𝑆𝐸(𝛽̂1 ) = √𝑉𝑎𝑟(𝛽̂1 ).

Note that the formulas for 𝑉𝑎𝑟(𝛽̂0 ) and 𝑉𝑎𝑟(𝛽̂1 ) have the term 𝜎 2 . But, 𝜎 2 is
UNKNOWN. All other terms are known. Therefore, 𝑉𝑎𝑟(𝛽̂0 ) and 𝑉𝑎𝑟(𝛽̂1 ) are also
UNKNOWN. So, we estimate them by substituting 𝜎 2 by its estimate, 𝜎̂22 . We use
these estimates for testing of hypothesis regarding 𝛽0 and 𝛽1 . This is a crucial point.
This fact ensures that each standardised 𝜷 ̂ has a 𝒕 distribution.

For 𝑝 ≥ 1, 𝐸(𝛽̂ ) = 𝛽; 𝑉𝑎𝑟(𝛽̂ ) = 𝜎 2 (𝑋 ′ 𝑋)−1 ; 𝛽̂ ~𝑀𝑁(𝛽, 𝜎 2 (𝑋 ′ 𝑋)−1 ).

Prepared by Professor Malay Bhattacharyya


Page 11 of 25
𝐸(𝛽̂0 )
𝐸(𝛽̂1)
𝐸(𝛽̂ ) = and 𝑉𝑎𝑟(𝛽̂ ) represents the covariance matrix of 𝛽̂ .

𝐸(𝛽̂𝑝 )
( )

3. 𝐸(𝜎̂22 ) = 𝜎 2

a. ANOVA Table
Is the model as a whole good enough?

Using ANOVA, we can test if the model, as a whole, is statistically significant


or not.

𝐻0 : 𝛽1 = 𝛽2 = ⋯ = 𝛽𝑝 = 0 vs 𝐻1 : at least one 𝛽𝑖 ≠ 0. Equivalently,

2 2 2
𝐻0 : 𝑅𝑇𝑅𝑈𝐸 = 0 𝑣𝑠 𝐻1 : 𝑅𝑇𝑅𝑈𝐸 ≠ 0. 𝑅𝑇𝑅𝑈𝐸 represents the True 𝑅 2 of the model.

Test Statistic

𝑅𝑒𝑔𝑆𝑆/(𝑛 − 𝑝 − 1)
𝐹=
𝑅𝑆𝑆/(𝑛 − 1)

Distribution of the Test Statistic

When 𝐻0 is true, 𝐹 has an 𝐹 distribution with degrees of freedom


𝑛 − 𝑝 − 1, 𝑛 − 1.

In the example (see JMP output below), 𝐹𝑜𝑏𝑠 = 𝐅 𝐑𝐚𝐭𝐢𝐨 =


856.1767, and 𝑝 − 𝑣𝑎𝑙𝑢𝑒 = 𝐏𝐫𝐨𝐛 > 𝐅 < 0.001.

Therefore, we reject the null hypothesis. So, we conclude that the model is
significant.

b. Which Independent variables are important for Predicting/Explaining the


Dependent Variable?

Prepared by Professor Malay Bhattacharyya


Page 12 of 25
i. Individual 𝒕-tests

Test for 𝜷𝟎

𝐻0 : 𝛽0 = 0 vs 𝐻1 : 𝛽0 ≠ 0.

You may say that as 𝛽̂0 = 6.9748, there is no need for this test. You are right!
However, remember that you may not get such a high value for 𝛽̂0 always, in
every situation and every data.

Test Statistic

𝛽̂0 − 0
𝑡=
̂ (𝛽̂0 )
𝑆𝐸

̂ (𝛽̂0 ) = the estimated 𝑆𝐸(𝛽̂0 ), defined above. 𝑡, when 𝐻0 is true, has a 𝑡


𝑆𝐸
distribution with 𝑛 − 2 degrees of freedom.

̂ (𝛽̂0 ) = 0.322553, and 𝑡 = 21.62. Prob>|t| represents


In the above table, 𝑆𝐸
the 𝑝-value, which is < 0.0001.

Therefore, we reject 𝐻0 . That is, in the true model, as is evident, 𝛽0 is NOT


equal to zero.

Test for 𝜷𝟏

𝐻0 : 𝛽1 = 0 vs 𝐻1 : 𝛽1 ≠ 0.

Test Statistic

𝛽̂1 − 0
𝑡=
̂ (𝛽̂1 )
𝑆𝐸

̂ (𝛽̂1 ) = the estimated 𝑆𝐸(𝛽̂1 ), defined above. 𝑡, when 𝐻0 is true, has a 𝑡


𝑆𝐸
distribution with 𝑛 − 2 degrees of freedom.

̂ (𝛽̂1 ) = 0.001896, and 𝑡 = 29.26. Prob>|t| represents


In the above table, 𝑆𝐸
the 𝑝-value, which is < 0.0001.

Therefore, we reject 𝐻0 . That is in the true model 𝛽0 is NOT equal to zero.

Prepared by Professor Malay Bhattacharyya


Page 13 of 25
c. Verify the Assumptions
It is extremely useful to examine the estimated errors or the residuals
graphically to test if the data conform to the statistical model.

i. Test of No Autocorrelation Among Errors

The plot shows that the errors are quite random. They do not show any
systematic patterns. Most of the points are within their 95% confidence
bounds.

Durbin-Watson Test for First Order Autocorrelation

𝐻0 : 𝜌1 = 0 vs 𝐻1 : 𝜌1 ≠ 0.

Test Statistics
𝐷𝑊 ≈ 2(1 − 𝜌̂1 ).

𝜌̂1 is the estimated autocorrelation of order 1. 𝜌1 = 𝐶𝑜𝑟𝑟𝑒(𝜀𝑖−1 , 𝜀𝑖 ) is the


TRUE autocorrelation of order 1. It is easy to see that

Prepared by Professor Malay Bhattacharyya


Page 14 of 25
0 if 𝜌̂1 ≈ 1, i. e. , almost perfect positive linear autocorrelation
𝐷𝑊 ≈ {2 if 𝜌̂1 ≈ 0, i. e. , no linear autocorrelation
4 if 𝜌̂1 ≈ −1, i. e. , almost perfect negative linear autocorrelation

From the above table 𝐷𝑊 = 2.0294364. This is pretty closer to 2. Also, 𝑝-


value = 0.5827. Therefore, we do not reject the null hypothesis, i.e., it can be
assumed that there is no autocorrelation in the errors.

ii. Test of Constant Variance of Errors

Plot of the residuals versus the 𝑋 (𝑇𝑉) values reveals that the variation in the
residuals along the 𝑋 (𝑇𝑉) axis is more or less constant. This implies that the
errors can be assumed to be uncorrelated with 𝑋 (𝑇𝑉) and the errors have
constant variance.

iii. Test of Normality of Errors


There are many methods for testing the normality of errors. We
mention three ‘rough’ methods here that are easy to use and most often
used.

1. Histogram: Draw the histogram of the errors (standardised errors)


and check if it “looks” like a histogram of a normal distribution or
not.

Figure 5 Histogram of Estimated Errors with a fitted normal curve

Prepared by Professor Malay Bhattacharyya


Page 15 of 25
In the histogram above, we also fitted the normal curve. It is a
pretty close normal distribution.

2. Q-Q Plot (Quantile-Quantile Plot): A Quantile is, essentially, a


percentile. In a theoretical Q-Q plot, the actual quantiles are plotted
against themselves. For any distribution, it is a straight line with
𝑠𝑙𝑜𝑝𝑒 = 1.

For normal distribution, it is a straight line, passing through the origin


and makes a 45 degrees angle (𝑠𝑙𝑜𝑝𝑒 = 1) with the horizontal (and
therefore with the vertical) axis.

In practice, we plot the empirical quantiles obtained from the data and
super impose them on the theoretical line. If we find a good match, we
conclude that the data have come from a normal distribution.

Figure 6 Q-Q Plot of Errors

In the figure above, the empirical quantiles, represented by the dots,


match quite well with the straight line representing the theoretical
quantiles of a normal distribution.

a. Skewness and Kurtosis: We know that any normal distribution


has 𝑺𝒌𝒆𝒘𝒏𝒆𝒔𝒔 = 𝟎 𝐚𝐧𝐝 𝑲𝒖𝒓𝒕𝒐𝒔𝒊𝒔 = 𝟑, when they are
measured using the third and fourth moments of the normal
distribution. If we find that the estimated skewness and kurtosis of
the errors are ‘close’ to 0 and 3 respectively, we can very ‘loosely’
conclude that the error distribution is normal.
Kurtosis -0.03344
Skewness -0.01767

From the table above, we find that 𝐒𝐤𝐞𝐰𝐧𝐞𝐬𝐬 ≈ 𝟎. Note that Excel
computes “Kurtosis” as standard kurtosis minus 3. Therefore,
𝑲𝒖𝒓𝒕𝒐𝒔𝒊𝒔 ≈ 𝟑. We can, then, conclude that the error distribution is
approximately normal!

Prepared by Professor Malay Bhattacharyya


Page 16 of 25
12. Use the Model

a. Prediction

Given a value of 𝑇𝑉, say, 𝑇𝑉 = 100, we find

̂ = 6.9748 + 0.0555 ∗ 100 = 12. 5248.


𝑆𝑎𝑙𝑒𝑠

This means that if we spend $100,000 in TV advertisement, Sales will be


12524.8 units.

b. Interpretation of the Predicted Value

This Sales of 12524.8 units is the estimated average sales. In other words, if
we spend $100,000 repeatedly in one location, or once in separate locations,
the average sales will be 12524.8 units. If we spend in one location
repeatedly, it is possible that sometimes the sales be more than or less than
12524.8 units. Similarly, if we spend in separate locations, it is possible that
the sales could be more or less than 125258 units in some locations.

c. Prediction Error

The estimate of the SE of this estimated average sale is given by 𝜎̂ ∗


1 ̅1 )2
(𝑋10 −𝑋
√𝑛 + ∑𝑛 ̅ 2
. The following diagram depicts the confidence intervals
𝑖=1(𝑋1𝑖 −𝑋1 )

(formula given below) for estimated average sales.

i. Confidence Interval for Prediction

1 (𝑋10 − 𝑋̅1 )2
𝑌̂ ∓ 𝑡𝑛−2, 𝛼 ∗ 𝜎̂ ∗ √ + 𝑛
2 𝑛 ∑𝑖=1(𝑋1𝑖 − 𝑋̅1 )2

Prepared by Professor Malay Bhattacharyya


Page 17 of 25
Figure 7 Fitted Regression Line with Confidence Intervals

We observe that the confidence interval is the narrowest when 𝑇𝑉 =


𝐴𝑣𝑒𝑟𝑎𝑔𝑒 𝑇𝑉. The interval widens as we move further away from the
𝐴𝑣𝑒𝑟𝑎𝑔𝑒 𝑇𝑉 in either direction. This means that we could predict the sales
better for 𝑇𝑉 values closer to the 𝑎𝑣𝑒𝑟𝑎𝑔𝑒 𝑇𝑉. Our prediction will be poorer
for 𝑇𝑉 values that are further away from the 𝑎𝑣𝑒𝑟𝑎𝑔𝑒 𝑇𝑉.

d. Prediction Interval for Prediction a Single Value of 𝒀

If, on the other hand, we wanted to predict a single value of Sales for a single
expenditure of 𝑇𝑉 advertisement of $100,000, we would have more error in
our estimate. Note that the variable, Sales (𝑌), has a larger variance than that
of the average Sales, 𝑌̅. Also, the TRUE mean of 𝑌 is constant (zero variance).
In this case, the formula for the estimate of the SE is given in the Prediction
Interval below. We call it a prediction interval to avoid confusion with
confidence interval.

1 (𝑋10 − 𝑋̅1 )2
𝑌̂ ∓ 𝑡𝑛−2, 𝛼 ∗ 𝜎̂ ∗ √1 + + 𝑛
2 𝑛 ∑𝑖=1(𝑋1𝑖 − 𝑋̅1 )2

13. Regression with one Independent Categorical Variable (Salary Data.xlsx)

Categorical variables are modelled as a set of Dummy variables, coded usually as 0


and 1.

Let us consider the following example.

𝑌 = annual income, 𝑋1 = experience, and 𝑋2 represents the gender.

Prepared by Professor Malay Bhattacharyya


Page 18 of 25
Here 𝑌 and 𝑋1 are quantitative in nature, while 𝑋2 is categorical. We can code 𝑋2 as
follows:
1, if gender is male
𝑋2 = {
0, if gender is female

First let us look at the scatter plot between 𝑌 = 𝑆𝑎𝑙𝑎𝑟𝑦 and 𝑋1 = 𝐸𝑥𝑝𝑒𝑟𝑖𝑒𝑛𝑐𝑒:

Actual Salary vs Experience - Scatter Plot


9000
8000
7000
6000
Actual Salary

5000
4000
3000
2000
1000
0
0 10 20 30 40 50
Experience

Figure 8 Scatter Plot - Actual Salar vs Experience

Though the points seem to have a positive linear direction, they look divided into two
group – one above (males), and the other below (females).

Ignoring this group phenomenon, we run the following regression:

𝑌 = 𝛽0 + 𝛽1 𝑋1 + 𝜀.
The results are briefly summarised in the following graph and the table thereafter:

Regression: Salary on Experience


10000
Actual and Predicted Salary

8000

6000

4000

2000
Adjusted R-Squared = 0.3085
0
0 10 20 30 40 50
Experience

Predicted Salary (Y) 3 Acvtual Salary

Figure 9 Fitted Regression Line - Salary on Experience


Prepared by Professor Malay Bhattacharyya
Page 19 of 25
Term Estimate Std Error t Ratio Prob>|t|
Intercept 4593.3296 160.7238 28.58 <.0001*
Experience (X1) 64.787174 6.837469 9.48 <.0001*

Note that the Adjusted 𝑅 2 is quite low (0.3085). Further, as the fitted regression line
separates the two groups, the salary of the members belonging to the group below,
i.e., for the females, will almost always be overestimated, while that of the members
of the group above, i.e., for the males, underestimated.

Let us now include the variable 𝑋2 representing gender in the regression model.

𝑌 = 𝛽0 + 𝛽1 𝑋1 + 𝛽2 𝑋2 + 𝜀.

In fact, by virtue of the way we defined 𝑋2, the equation is equivalent to

𝛽 + 𝛽2 + 𝛽1 𝑋1 + 𝜀, if gender is male
𝑌={ 0
𝛽0 + 𝛽1 𝑋1 + 𝜀, if gender is female

We observe that the lines represented by the two equations are parallel with the same
slope 𝛽1, while the intercepts are different. 𝛽0 is the intercept for the females, and
𝛽0 + 𝛽2 is the intercept for the males. In case 𝛽2 > 0, the males will have a higher
average income than that of females with the same experience level! See the figure
below. More on this in class!
The results of the second regression is summarised in the following graph and the
table thereafter. Note the big improvement in the Adjusted 𝑅 2 . In addition, note that
there are, in fact, two lines and not one. One line goes through the points representing
females, and the other males. Though the lines are parallel (this means that the change
in income with 1 year change in experience is same for both females and males), the
intercepts are quite different (starting salaries are different)!

Regression: Salary on Experience and Gender

9000
8000
7000
6000
Actual and Predicted Salary

5000
4000
3000
2000
1000 Adjusted R-Squared =
0
0 10 20 30 40 50
Experience

Predicted Salary Female Acvtual Salary Predicted Salary Male

Figure 10 Fitted Regression Line - Salary on Experience and Gender


Prepared by Professor Malay Bhattacharyya
Page 20 of 25
Term Estimate Std Error t Ratio Prob>|t|
Intercept 3913.1516 54.63814 71.62 <.0001*
Experience (X1) 67.841535 2.216469 30.61 <.0001*
Gender Code (X2) 1222.7387 29.74915 41.10 <.0001*

Figure 11 Regression with one Categorical variable, Gender.

Let us now include an interaction (between experience and gender) term in the
regression.

𝑌 = 𝛽0 + 𝛽1 𝑋1 + 𝛽2 𝑋2 + 𝛽3 𝑋1 𝑋2 + 𝜀.

Again, by virtue of the way we defined 𝑋2, the equation is equivalent to

𝛽 + 𝛽2 + (𝛽1 + 𝛽3 ) 𝑋1 + 𝜀, if gender is male


𝑌={ 0
𝛽0 + 𝛽1 𝑋1 + 𝜀, if gender is female

Therefore, the intercept (the starting salary) will be

𝛽0 + 𝛽2 , if gender is male
𝐼𝑛𝑡𝑒𝑟𝑐𝑒𝑝𝑡 = {
𝛽0 , if gender is female

And the slope (the rate of increase/decrease of salary with experience) will be

𝛽 + 𝛽3 , if gender is male
𝑆𝑙𝑜𝑝𝑒 = { 1
𝛽1 , if gender is female

The results of the third regression are summarised in the graph below and the table
thereafter.

Regression: Salary on Experience, Gender


10000 and Interaction between Experience and
Gender
Actual and Predicted Salary

Adjusted R-Squared = 0.9378


0
0 10 20 30 40 50
Predicted Salary Female Acvtual Salary
Experience
Predicted Salary Male

Figure 12 Fitted Regression Line - Salary on Experience, Gender, and interaction between experience and gender

Prepared by Professor Malay Bhattacharyya


Page 21 of 25
Term Estimate Std Error t Ratio Prob>|t|
Intercept 4216.6947 72.63405 58.05 <.0001*
Experience (X1) 54.500739 3.075562 17.72 <.0001*
Gender Code (X2) 680.59371 97.09586 7.01 <.0001*
Experience*Gender Code (X1*X2) 24.038991 4.128495 5.82 <.0001*

There is a marginal improvement in the Adjusted 𝑅 2 . However, the two lines now
have different slopes.

Comparison between second and third regressions

Though there is a small increase in Adjusted 𝑅 2 in the third regression, compared to


the second, the estimates of the coefficients reveal important differences.

First, note that the intercept term in the second model (3913.15) is lower than that in
the third (4216.69). This implies that the starting salary of the females is lower
according to the second model than that given by the third. Further, males have a
much higher (by 1222.73) starting salary according to the second model than that
estimated by the third (by 680.59). In other words, if the third model is the TRUE
model, then the second model does very poorly in capturing the true relationship. The
second model, in fact, underestimates the starting salary of the females, while
overestimating that for males.

Let us now focus on the slope coefficients. By the very nature of the two models, the
second regression has a common slope. However, if we compare the slope values of
the third model with that of the second, we find that, in the second model, the slope of
the line representing the females is overestimated (67.84 against 54.50), whereas that
for the males is underestimated (67.84 against 74.53 = 54.50 + 24.03).

Therefore, in conclusion, we can claim that notwithstanding a marginal improvement


in Adjusted 𝑅 2 , the inclusion of the interaction term in the regression did capture the
characteristics of the data quite well rather than excluding it, if the TRUE model had
an interaction term.

14. Caution

a. Outliers
An observation that is substantially different from all other ones can make a
significant difference in the results of regression analysis.

Outliers play a key role in regression. More importantly, distant points can
have a strong influence on statistical models - deleting outliers from a
regression model can sometimes give completely different results.

Let us see the example below.

Prepared by Professor Malay Bhattacharyya


Page 22 of 25
Figure 13 Outliers with influence

It is NOT always wise to drop an observation just because it is an outlier. They


can be legitimate observations and are sometimes the most interesting ones.

Here is an example of an outliers without influence.

Figure 14 Outliers without influence

Although one 𝑌 value is unusual given its 𝑋 value, it has little influence on the
regression line because it is in the middle of the 𝑋-range.

Outliers with respect to the independent variables are called leverage points.
An observation with an unusual 𝑋 value, i.e., it is far from the mean, 𝑋̅, has
leverage on (i.e., the potential to influence) the regression line. The further
away from the mean 𝑋̅, in either direction, the more leverage the observation
has on the fitted regression. High leverage does not necessarily mean that it
influences the regression coefficients.

It is possible to have a high leverage and yet follow the general pattern of the
rest of the data. High leverage observations can affect the regression model,
too. Their response variables need not be outliers.

Figure (b) is an example of high leverage observation without influencing the


regression line because its value of 𝑌 puts it in line with the general pattern of
the data.

Prepared by Professor Malay Bhattacharyya


Page 23 of 25
Figure 15 High Leverage observation without influence

Figure (c) shows how high leverage observation can influence the regression
line.

Figure 16 High Leverage observation with influence

The dashed line represents the general pattern of the data. The solid line
represents the fitted regression, having a large influence of the leverage point
and thereby changing the slope drastically.

In summary, outliers in regression are:

1. Outliers are points that fall away from the cloud of points.

2. Outliers that fall horizontally away from the centre of the cloud are
called leverage points.

3. High leverage points that actually influence the slope of the regression
line are called influential points.

4. To determine if a point is influential, visualize the regression line with


and without the point. Does the slope of the line change considerably? If
so, then the point is influential. If not, then it is not.

b. Bootstrapping
When model assumptions are unreliable, one alternative approach to
statistical inference is bootstrapping. Bootstrapping is a robust approach to
statistical inference.
Prepared by Professor Malay Bhattacharyya
Page 24 of 25
In bootstrapping, we use only the data that we have collected and computing
power to estimate the uncertainty surrounding our parameter estimates.

Our primary assumption is that our original sample represents the


population. We can learn about uncertainty in our parameter estimates
through repeated samples (with replacement) from our original sample.
There are many methods of bootstrapping. We describe one below.

If we wish to use bootstrapping to obtain confidence intervals for our


coefficients in our example, we could follow these steps:

1. Take a (bootstrap) sample of 200 with replacement, so that some


observations will get sampled several times and others not at all. This
is case resampling, so that all information from a given instance (sales,
TV, Newspaper, Radio) remains together.

2. Fit a regression model to the bootstrap sample, saving 𝛽̂0 , 𝛽̂1 .

3. Repeat the two steps above a large number of times (say 1000 times).

4. A histogram of the 1000 bootstrap estimates for each parameter can be


plotted to show the bootstrap distribution.

5. A 95% confidence interval for each parameter can be found by taking


the middle 95% of each bootstrap distribution, i.e., by picking off the
2.5 and 97.5 percentiles.

Prepared by Professor Malay Bhattacharyya


Page 25 of 25

You might also like