Elementary Regression Analysis
Elementary Regression Analysis
There are two main objectives that we may wish to do regression for.
Example
The data represent a simple random sample. 𝑛 = sample size. The variable 𝑌
represents a quantity that we wish to forecast. The variables 𝑋1 , 𝑋2 , … , 𝑋𝑝 represent
quantitative or qualitative variables that we think, may be related to 𝑌. The data could
be collected over time or locations.
Example
𝑝 = 3.
a. Mathematical Description
𝑌 = 𝛽0 + 𝛽1 𝑋1 + 𝛽2 𝑋2 + ⋯ + 𝛽𝑝 𝑋𝑝 + 𝜀 𝑜𝑟
We shall call
𝑌 = 𝛽0 + 𝛽1 𝑋1 + 𝛽2 𝑋2 + ⋯ + 𝛽𝑝 𝑋𝑝 ,
b. Graphical Representation
Model: 𝑌𝑖 = 𝛽0 + 𝛽1 𝑋1𝑖 + 𝜀𝑖 , 𝑖 = 1, 2, … , 𝑛
Figure 1: Linear Regression with One Independent Variable - Data and Fitted
Line. Each black dot is a pair (𝑌𝑖 , 𝑋1𝑖 ).
Figure 3: The Assumptions in Linear Regression. The solid line represents the
conditional mean of Y, given 𝑋1 , 𝑋2 , … , 𝑋𝑝 . The purple curves represent the
conditional distributions of Y, given 𝑋1 , 𝑋2 , … , 𝑋𝑝 .
6. Type of Variables
3. If you draw an imaginary line through the points, the points are quite close to
the line. Strong linear relationship.
5. Sales and Radio seem to have a positive linear relationship that is not as
strong as that with TV, but stronger than that with Newspaper.
6. Specifically,
Correlation Coefficients
TV Radio Newspaper
Sales 0.901208 0.349631 0.15796
Ordinary Least Squares Method (OLS) (You may skip this section if you wish.)
1. Note that we do not need any of the assumptions stated above for the OLS
method.
2. Let us first consider a model with only one independent variable. Here 𝒑 = 𝟏.
𝑌𝑖 = 𝛽0 + 𝛽1 𝑋1𝑖 + 𝜀𝑖 , 𝑖 = 1, 2, … , 𝑛.
We find the estimates by minimising the RSS with respect to 𝛽̂0 and 𝛽̂1 . The
estimates are:
𝑛
∑ (𝑋1𝑖 −𝑋̅1 )(𝑌𝑖 −𝑌̅)
𝛽̂1 = 𝑖=1
∑𝑛 (𝑋 ̅ )2
, 𝛽̂0 = 𝑌̅ − 𝛽̂1 𝑋̅1 , and
𝑖=1 1𝑖 −𝑋1
1
Technically, 𝛽̂ and other such quantities are called “Estimators”. An estimator has a probability distribution.
For a given sample, the value of the estimator is called an estimate.
Prepared by Professor Malay Bhattacharyya
Page 7 of 25
1
𝜎̂22 = 𝑛−2 ∑𝑛𝑖=1(𝑌𝑖 − 𝑌̂𝑖 )2 , 𝜀̂𝑖 = 𝑌𝑖 − 𝑌̂𝑖 , 𝑖 = 1, 2, … , 𝑛
𝛽̂0 𝑛
𝛽̂1 1
𝛽̂ = = (𝑋 ′ −1 ′
𝑋) 𝑋 𝑌; 𝜎̂22 = ∑(𝑌𝑖 − 𝑌̂𝑖 )2 .
⋮ 𝑛−𝑝−2
𝑖=1
̂
𝛽𝑝
( )
𝜀̂1
Here 𝑝 = 1 and 𝑛 = 3. 2 ൭𝜀̂2 ൱
𝜀̂3
𝑌1
൭𝑌2 ൱
𝑌3
𝑌̂1
𝑋1 ቌ𝑌̂2 ቍ
൭ 𝑋2 ൱ 𝑌̂3
𝑋3
1
1 𝑥1
1 Space generate by 𝛽0 ൭1൱ + 𝛽0 ൭𝑋2 ൱
൭1 ൱ 𝑋3
1
1
We have used JMP to do the regression analysis for the example data. One can
use any other software to the same analysis.
Prepared by Professor Malay Bhattacharyya
Page 8 of 25
In the example, we take 𝑌 = 𝑆𝑎𝑙𝑒𝑠, 𝑋1 = 𝑇𝑉. 𝛽̂0 = 6.9748, 𝛽̂1 = 0.0555,
𝜎̂ = 2.2957. The estimated regression equation is:
̂ = 6.9748 + 0.0555 ∗ 𝑇𝑉
𝑆𝑎𝑙𝑒𝑠
10. How Good is the Estimated Model? How strong is the relationship between TV
advertising budget and Sales?
i. R-Squared (𝑹𝟐 )
𝑛 𝑛 𝑛
2 2
∑(𝑌𝑖 − 𝑌̅)2 = ∑(𝑌̂𝑖 − 𝑌̅) + ∑(𝑌𝑖 − 𝑌̂𝑖 ) ;
𝑖=1 𝑖=1 𝑖=1
𝑇𝑆𝑆 = ∑𝑛𝑖=1(𝑌𝑖 − 𝑌̅)2 is the total sum of squares in 𝑌. TSS measures the total
variability in the response variable 𝑌.
The higher the value of 𝑅 2 , the better the regression model, generally
speaking.
Note that when we have only one independent variable, say 𝑋1 , in the
regression model, then 𝑅 2 = 𝑟 2 , where 𝑟 = 𝐶𝑜𝑟𝑟(𝑌, 𝑋1 ).
For the one variable regression model between Sales and TV, 𝑅 2 = 0.812176.
This means 81.22% of variation in Sales is explained by the regression model
or the independent variable, TV.
Therefore, we can conclude that this is a good regression model. Hold on!
𝑅𝑆𝑆/(𝑛 − 𝑝 − 1)
𝐴𝑑𝑗𝑢𝑠𝑡𝑒𝑑 𝑅 2 = 1 − .
𝑇𝑆𝑆/(𝑛 − 1)
The intuition behind the adjusted 𝑅 2 is that once all of the right variables have
been included in the model, adding additional noise variables will lead to only
an exceedingly small decrease in RSS. Since adding noise variables leads to
an increase in 𝑝, such variables will lead to an increase in 𝑅𝑆𝑆/(𝑛 − 𝑝 − 1),
and consequently a decrease in the adjusted 𝑅 2 . Therefore, in theory, the
model with the largest adjusted 𝑅 2 will have only correct variables and no
noise variables. Unlike the 𝑹𝟐 statistic, the adjusted 𝑹𝟐 statistic pays a
price for the inclusion of unnecessary variables in the model.
AIC and BIC are more theoretically justified criteria. However, we shall not
discuss them here further.
1 𝑋̅12
1. 𝛽̂0 ~𝑁 (𝛽0 , 𝑉𝑎𝑟(𝛽̂0 )) , 𝐸(𝛽̂0 ) = 𝛽0 , 𝑉𝑎𝑟(𝛽̂0 ) = 𝜎 2 [𝑛 + ∑𝑛 ̅ 2
],
𝑖=1(𝑋1𝑖 −𝑋1 )
𝑆𝐸(𝛽̂0 ) = √𝑉𝑎𝑟(𝛽̂0 ).
𝜎2
2. 𝛽̂1 ~𝑁 (𝛽1 , 𝑉𝑎𝑟(𝛽̂1 )) , 𝐸(𝛽̂1 ) = 𝛽1 , 𝑉𝑎𝑟(𝛽̂1 ) = [∑𝑛 ̅ 2
],
𝑖=1(𝑋1𝑖 −𝑋1 )
𝑆𝐸(𝛽̂1 ) = √𝑉𝑎𝑟(𝛽̂1 ).
Note that the formulas for 𝑉𝑎𝑟(𝛽̂0 ) and 𝑉𝑎𝑟(𝛽̂1 ) have the term 𝜎 2 . But, 𝜎 2 is
UNKNOWN. All other terms are known. Therefore, 𝑉𝑎𝑟(𝛽̂0 ) and 𝑉𝑎𝑟(𝛽̂1 ) are also
UNKNOWN. So, we estimate them by substituting 𝜎 2 by its estimate, 𝜎̂22 . We use
these estimates for testing of hypothesis regarding 𝛽0 and 𝛽1 . This is a crucial point.
This fact ensures that each standardised 𝜷 ̂ has a 𝒕 distribution.
3. 𝐸(𝜎̂22 ) = 𝜎 2
a. ANOVA Table
Is the model as a whole good enough?
2 2 2
𝐻0 : 𝑅𝑇𝑅𝑈𝐸 = 0 𝑣𝑠 𝐻1 : 𝑅𝑇𝑅𝑈𝐸 ≠ 0. 𝑅𝑇𝑅𝑈𝐸 represents the True 𝑅 2 of the model.
Test Statistic
𝑅𝑒𝑔𝑆𝑆/(𝑛 − 𝑝 − 1)
𝐹=
𝑅𝑆𝑆/(𝑛 − 1)
Therefore, we reject the null hypothesis. So, we conclude that the model is
significant.
Test for 𝜷𝟎
𝐻0 : 𝛽0 = 0 vs 𝐻1 : 𝛽0 ≠ 0.
You may say that as 𝛽̂0 = 6.9748, there is no need for this test. You are right!
However, remember that you may not get such a high value for 𝛽̂0 always, in
every situation and every data.
Test Statistic
𝛽̂0 − 0
𝑡=
̂ (𝛽̂0 )
𝑆𝐸
Test for 𝜷𝟏
𝐻0 : 𝛽1 = 0 vs 𝐻1 : 𝛽1 ≠ 0.
Test Statistic
𝛽̂1 − 0
𝑡=
̂ (𝛽̂1 )
𝑆𝐸
The plot shows that the errors are quite random. They do not show any
systematic patterns. Most of the points are within their 95% confidence
bounds.
𝐻0 : 𝜌1 = 0 vs 𝐻1 : 𝜌1 ≠ 0.
Test Statistics
𝐷𝑊 ≈ 2(1 − 𝜌̂1 ).
Plot of the residuals versus the 𝑋 (𝑇𝑉) values reveals that the variation in the
residuals along the 𝑋 (𝑇𝑉) axis is more or less constant. This implies that the
errors can be assumed to be uncorrelated with 𝑋 (𝑇𝑉) and the errors have
constant variance.
In practice, we plot the empirical quantiles obtained from the data and
super impose them on the theoretical line. If we find a good match, we
conclude that the data have come from a normal distribution.
From the table above, we find that 𝐒𝐤𝐞𝐰𝐧𝐞𝐬𝐬 ≈ 𝟎. Note that Excel
computes “Kurtosis” as standard kurtosis minus 3. Therefore,
𝑲𝒖𝒓𝒕𝒐𝒔𝒊𝒔 ≈ 𝟑. We can, then, conclude that the error distribution is
approximately normal!
a. Prediction
This Sales of 12524.8 units is the estimated average sales. In other words, if
we spend $100,000 repeatedly in one location, or once in separate locations,
the average sales will be 12524.8 units. If we spend in one location
repeatedly, it is possible that sometimes the sales be more than or less than
12524.8 units. Similarly, if we spend in separate locations, it is possible that
the sales could be more or less than 125258 units in some locations.
c. Prediction Error
1 (𝑋10 − 𝑋̅1 )2
𝑌̂ ∓ 𝑡𝑛−2, 𝛼 ∗ 𝜎̂ ∗ √ + 𝑛
2 𝑛 ∑𝑖=1(𝑋1𝑖 − 𝑋̅1 )2
If, on the other hand, we wanted to predict a single value of Sales for a single
expenditure of 𝑇𝑉 advertisement of $100,000, we would have more error in
our estimate. Note that the variable, Sales (𝑌), has a larger variance than that
of the average Sales, 𝑌̅. Also, the TRUE mean of 𝑌 is constant (zero variance).
In this case, the formula for the estimate of the SE is given in the Prediction
Interval below. We call it a prediction interval to avoid confusion with
confidence interval.
1 (𝑋10 − 𝑋̅1 )2
𝑌̂ ∓ 𝑡𝑛−2, 𝛼 ∗ 𝜎̂ ∗ √1 + + 𝑛
2 𝑛 ∑𝑖=1(𝑋1𝑖 − 𝑋̅1 )2
First let us look at the scatter plot between 𝑌 = 𝑆𝑎𝑙𝑎𝑟𝑦 and 𝑋1 = 𝐸𝑥𝑝𝑒𝑟𝑖𝑒𝑛𝑐𝑒:
5000
4000
3000
2000
1000
0
0 10 20 30 40 50
Experience
Though the points seem to have a positive linear direction, they look divided into two
group – one above (males), and the other below (females).
𝑌 = 𝛽0 + 𝛽1 𝑋1 + 𝜀.
The results are briefly summarised in the following graph and the table thereafter:
8000
6000
4000
2000
Adjusted R-Squared = 0.3085
0
0 10 20 30 40 50
Experience
Note that the Adjusted 𝑅 2 is quite low (0.3085). Further, as the fitted regression line
separates the two groups, the salary of the members belonging to the group below,
i.e., for the females, will almost always be overestimated, while that of the members
of the group above, i.e., for the males, underestimated.
Let us now include the variable 𝑋2 representing gender in the regression model.
𝑌 = 𝛽0 + 𝛽1 𝑋1 + 𝛽2 𝑋2 + 𝜀.
𝛽 + 𝛽2 + 𝛽1 𝑋1 + 𝜀, if gender is male
𝑌={ 0
𝛽0 + 𝛽1 𝑋1 + 𝜀, if gender is female
We observe that the lines represented by the two equations are parallel with the same
slope 𝛽1, while the intercepts are different. 𝛽0 is the intercept for the females, and
𝛽0 + 𝛽2 is the intercept for the males. In case 𝛽2 > 0, the males will have a higher
average income than that of females with the same experience level! See the figure
below. More on this in class!
The results of the second regression is summarised in the following graph and the
table thereafter. Note the big improvement in the Adjusted 𝑅 2 . In addition, note that
there are, in fact, two lines and not one. One line goes through the points representing
females, and the other males. Though the lines are parallel (this means that the change
in income with 1 year change in experience is same for both females and males), the
intercepts are quite different (starting salaries are different)!
9000
8000
7000
6000
Actual and Predicted Salary
5000
4000
3000
2000
1000 Adjusted R-Squared =
0
0 10 20 30 40 50
Experience
Let us now include an interaction (between experience and gender) term in the
regression.
𝑌 = 𝛽0 + 𝛽1 𝑋1 + 𝛽2 𝑋2 + 𝛽3 𝑋1 𝑋2 + 𝜀.
𝛽0 + 𝛽2 , if gender is male
𝐼𝑛𝑡𝑒𝑟𝑐𝑒𝑝𝑡 = {
𝛽0 , if gender is female
And the slope (the rate of increase/decrease of salary with experience) will be
𝛽 + 𝛽3 , if gender is male
𝑆𝑙𝑜𝑝𝑒 = { 1
𝛽1 , if gender is female
The results of the third regression are summarised in the graph below and the table
thereafter.
Figure 12 Fitted Regression Line - Salary on Experience, Gender, and interaction between experience and gender
There is a marginal improvement in the Adjusted 𝑅 2 . However, the two lines now
have different slopes.
First, note that the intercept term in the second model (3913.15) is lower than that in
the third (4216.69). This implies that the starting salary of the females is lower
according to the second model than that given by the third. Further, males have a
much higher (by 1222.73) starting salary according to the second model than that
estimated by the third (by 680.59). In other words, if the third model is the TRUE
model, then the second model does very poorly in capturing the true relationship. The
second model, in fact, underestimates the starting salary of the females, while
overestimating that for males.
Let us now focus on the slope coefficients. By the very nature of the two models, the
second regression has a common slope. However, if we compare the slope values of
the third model with that of the second, we find that, in the second model, the slope of
the line representing the females is overestimated (67.84 against 54.50), whereas that
for the males is underestimated (67.84 against 74.53 = 54.50 + 24.03).
14. Caution
a. Outliers
An observation that is substantially different from all other ones can make a
significant difference in the results of regression analysis.
Outliers play a key role in regression. More importantly, distant points can
have a strong influence on statistical models - deleting outliers from a
regression model can sometimes give completely different results.
Although one 𝑌 value is unusual given its 𝑋 value, it has little influence on the
regression line because it is in the middle of the 𝑋-range.
Outliers with respect to the independent variables are called leverage points.
An observation with an unusual 𝑋 value, i.e., it is far from the mean, 𝑋̅, has
leverage on (i.e., the potential to influence) the regression line. The further
away from the mean 𝑋̅, in either direction, the more leverage the observation
has on the fitted regression. High leverage does not necessarily mean that it
influences the regression coefficients.
It is possible to have a high leverage and yet follow the general pattern of the
rest of the data. High leverage observations can affect the regression model,
too. Their response variables need not be outliers.
Figure (c) shows how high leverage observation can influence the regression
line.
The dashed line represents the general pattern of the data. The solid line
represents the fitted regression, having a large influence of the leverage point
and thereby changing the slope drastically.
1. Outliers are points that fall away from the cloud of points.
2. Outliers that fall horizontally away from the centre of the cloud are
called leverage points.
3. High leverage points that actually influence the slope of the regression
line are called influential points.
b. Bootstrapping
When model assumptions are unreliable, one alternative approach to
statistical inference is bootstrapping. Bootstrapping is a robust approach to
statistical inference.
Prepared by Professor Malay Bhattacharyya
Page 24 of 25
In bootstrapping, we use only the data that we have collected and computing
power to estimate the uncertainty surrounding our parameter estimates.
3. Repeat the two steps above a large number of times (say 1000 times).