Regression
Regression
Regression Analysis
• Predictive, for example setting normal quota or baseline sales. We can also use estimated
equation to determine “normal” and “abnormal” or outlier observations.
• Decision purpose,
Data Requirement
• If independent variables are nominal scaled (e.g. brand choice), then appropriate caution
must be maintained so that results from analysis can be interpreted. For example, it may
be necessary to create variables that take values 0 and 1 or dummy variables.
1. Decide on purpose of model and appropriate dependent variable to meet that purpose.
4. Interpret estimated parameters, goodness of fit and qualitative and quantitative assess-
ment of parameters.
6. If some assumptions are not satisfied, modify and revise estimated equation.
We will examine these steps with the assumption that purpose of model is already been decided
and we need to perform remaining steps.
Decision about Independent Variables
Here are some suggestion for variable(s) to be included in regression analysis as independent
variables.
• Based on theory.
• Prior research,
• Statistical approaches.
Estimating Parameters
25
y2 −y1
Slope = x2 −x1
(x2 ,y2 ) 20−10
20
= 20−5
= 0.66
15
10 (x1 ,y1 )
25
y2 −y1
Slope = x2 −x1
(x1 ,y1 )
10−20
20
= 20−5
= −0.66
15
Intercept is y1 − b × x1 = 23.33
5 10 15 20 25
3.0
Explained variation
2.5 R2 = Total variation
y
2.0
y = 1 + 0.7 × x
1.5
1.0
mean
or ȳ
Total
0.5 variation Explained variation (ŷ − ȳ)2
(y − ȳ)2
As you can see from above examples, estimating parameters is nothing more than assigning
appropriate values to parameters. Let us re-write our observations again, in somewhat different
format and see another alternative approach to obtain parameter estimates.
0 −2
0
−1
yi = 1 xi = 0
1 1
3 2
Our regression equation can be written as
yi = a + b × xi + Ei i = 1, · · · , 5.
Suppose we added both sides (over all observations) of above equation, the we could write
5
X 5
X 5
X 5
X
yi = a+ bxi + Ei .
i=1 i=1 i=1 i=1
Suppose now we multiply both sides by (xi − x̄), then we would get a complicated expression
like
(xi − x̄)(yi − ȳ) = b(xi − x̄)(xi − x̄) + Ei (xi − x̄).
Let us now take average of both sides and divide by (5 − 1) or (N − 1) where N is number of
observations. This would lead to
PN PN PN
i=1 (xi− x̄)(yi − ȳ) i=1 (xi − x̄)(xi − x̄) i=1 Ei (xi − x̄)
=b + .
N −1 N −1 N −1
We now have to make our second assumption which states that independent variable and error
P
term are not correlated. That is, N i=1 Ei (xi − x̄) = 0. This is one of the difficult assumption
to test but one that is required, to derive value of b. With this assumption, we are in position
to write estimate of b or b̂. That is,
PN
(xi − x̄)(yi − ȳ)
b̂ = PNi=1 .
i=1 (xi − x̄)(xi − x̄)
We are also assuming that xi − x̄ is not equal to zero. That is, there is some variation in
independent variable, one that is useful to explain variation in dependent variable. Once we
know estimate of b, we can go back to ȳ = a + bx̄ and solve for a. This we will call as â and it
can be obtained by â = ȳ − b̂x̄. Implicit in our effort to compute various averages, we assumed
that each observation is equally weighted. This assumption is satisfied if error variability across
observation is about the same. That is, (yi − ŷi )2 is similar over all the observations.
Let us see applicability of above work to our example. First note that ȳ = 1 and x̄ = 0.
Then, yi − ȳ and xi − x̄ is
0−1 −2 − 0
0−1 −1 − 0
yi − ȳ = 1−1 xi − x̄ = 0−0 .
1−1 1−0
3−1 2−0
This simplifies to
−1 −2
−1
−1
yi − ȳ = 0 xi − x̄ = 0 .
0 1
2 2
This would result in
2
1
(yi − ȳ)(xi − x̄) = 0 and
0
4
4
1
(xi − x̄)2 = 0
1
4
7
This would mean that b̂ = 10 and â = 1. Note that our equation in this case would be
yi = 1 + 0.7 × xi . This is exactly same equation written on our graph as well. Note that we
could also estimate proportion of variability explained by independent variable by computing
R2 and set of other summary measures.
(yi − ȳ)(x1i − x̄1 ) = b1 (x1i − x̄1 )(x1i − x̄1 ) + b2 (x2i − x̄2 )(x1i − x̄1 )
(yi − ȳ)(x2i − x̄2 ) = b1 (x1i − x̄1 )(x2i − x̄2 ) + b2 (x2i − x̄2 )(x2i − x̄2 )
We would sum both sides of both equations and divide by N − 1. Moreover for simplicity, we
could make following substitutions.
PN
i=1 (yi− ȳ)(x1i − x̄1 )
Syx1 =
N −1
PN
i=1 (yi − ȳ)(x2i − x̄2 )
Syx2 =
N −1
PN
i=1 (x1i − x̄1 )(x2i − x̄2 )
S x2 x1 = S x1 x2 =
N −1
PN
i=1 (x1i − x̄1 )(x1i − x̄1 )
S x1 x1 =
N −1
PN
i=1 2i − x̄2 )(x2i − x̄2 )
(x
S x2 x2 = .
N −1
These terms are called averages of sums of squared values of cross products (SSCP). These
are very useful quantities in various multivariate analysis procedures. After substituting these
terms, we may write our earlier equation as
Suppose we assumed that Sx1 x2 = 0, then we could at once write estimates for b1 and b2 . That
is,
Syx1
b̂1 =
S x1 x1
Syx2
b̂2 =
S x2 x2
If Sx1 x2 6= 0, then we need to solve these two equations simultaneously and obtain estimates.
There is also a possibility that Sx1 x2 = Sx1 x1 which would also imply that Sx1 x2 = Sx2 x2 . This
would result in collapse of two unknown to just one, that is, (b1 + b2 ). This condition is called
perfect multicollinearity. Not that
• On an average difference between the observed value (yi ) and the predicted value (ŷi ) is
zero.
• On an average the estimated values of errors and values of independent variables are not
related to each other.
• The squared differences between the observed value and the predicted value are similar.
• There is some variation in independent variable. If there are more than one variable in
the equation, then two variables should not be perfectly correlated.
• Intercept provides a measure about the mean of dependent variable when slope(s) are
zero.
• If slope(s) are not zero then intercept is equal to the mean of dependent variable minus
slope× mean of independent variable.
Slope
• Change is dependent variable as we change independent variable.
• Zero slope means that independent variable does not have any influence on dependent
variable.
• For a linear model, slope is not equal to elasticity. That is because, elasticity is percent
change in dependent variable, as a result one percent change in independent variable.
4. Decide whether to reject or accept null hypothesis. At a particular probability level, if the
tabled3 value is less than the computed statistic, then we should reject the null hypothesis
and vice versa. There is an alternative for this step. Most computer programs, print
statistic as well as probability of the computed statistic. In such a situation, if probability
is less than or equal to 0.05, then we reject the null hypothesis.
Let us apply all this to our small problem. First the SAS input.
options nocenter nodate ps = 70 ls =80 nonumber formchar=|----|+|-----|;
data toy;
input y x;
datalines;
0 -2
0 -1
1 0
1 1
3 2
;;;;
proc reg; model y = x; run;
3
I am here referring to table of t- or F-statistics.
Analysis of Variance
Sum of Mean
Source DF Squares Square F Value Prob>F
Parameter Estimates
Our null hypothesis for this example would state “variable x does not explain statistically
significant variations in y”. Our computed F-statistic is 13.4 with prob. of 0.035 which suggest
that we should reject the null hypothesis. Moreover, R2 is 0.817 which indicates that substantial
proportion of variation in y is accounted by variable x. Since there is only one variable in our
equation, many of conclusions in F-statistic also will be matched by t-statistic. That is, reject
null hypothesis that b = 0.
Evaluating Assumptions
Of the various assumptions in our analysis, following assumption lend to some form test
procedure.
1. The squared differences between the observed dependent variable value and the predicted
value are similar for all observations.
2. Each observation has equal influence on estimated parameters.
3. Independent variables are not correlated, or correlation among them is low.
4. If dependent variable is sorted in ascending or descending order, then the estimated
residuals (yi − ŷi ) are not related to each other.
Let us see how all these things apply to our simple example along with some of statistical
derivations. Suppose our regression equation can be written as
yi = a + b × xi + Ei i = 1, · · · , 5.
For the first observation, then the predicted value is
ŷ1 = â + b̂x1
where â and b̂ are used to denote the estimated intercept and slope respectively. It follows that
the estimated residual for observation i is Êi = yi − (â + b̂xi ) and sum of squared residuals is
Pn 2
i=1 Êi and the standard deviation, often denoted by s is
v
uP
u n Ê 2
t i=1 i
s= .
n−2
Note that under the assumptions of linear regression, it can be shown that
E(â) = a
E(b̂) = b
Pn
s2
x2i i=1
var(â) = Pn
n i=1 (xi − x̄)2
s2
var(b̂) = Pn 2
i=1 (xi − x̄)
−s2 x̄
cov(â, b̂) = Pn 2
i=1 (xi − x̄)
and square root of var(Ê1 ) is usually reported as the standard error of residual. Following
output indicates that SAS generates numbers as we would expect.
INTERCEP X
Obs Dfbetas Dfbetas
1 0.7559 -1.0690
2 -0.2750 0.1945
3 0.0000 0.0000
4 -1.0000 -0.7071
5 2.1213 3.0000
Sum of Residuals 0
Sum of Squared Residuals 1.1000
Predicted Resid SS (Press) 4.4337
and square root of 0.22 results in the standard error of prediction of 0.469 for this observation.
Similarly,
" #
2 1 (x1 − x̄)2
var(Ê1 ) = s 1 − − Pn 2
,
n i=1 (xi − x̄)
1.1 1 4
= 1− −
3 5 10
= 0.14667,
and square root of this is 0.383. Note that column Student Residual is ratio of column
Residual to Std Err Residual. Note that all other remaining measures reported above
(Cook’s D, Rstudent etc.) require estimate based on particular observation being deleted.
For example, estimating a and b when first observation is deleted, denoted by â(1) and b̂(1) . It
is possible to obtain these estimate without actually conducting separate regression analyses.
Thus,
Ê1
â(1) = â −
n(1 − h11 )
xi Ê1
b̂(1) = b̂ − Pn ,
(1 − h11 ) i=1 (xi − x̄)2
where h11 is diagonal elements of hat matrix or H (see notes above). For the first observation,
â(1) and b̂(1) is equal to 0.8 and 0.9 respectively. Similarly, RSTUDENT is normalized residual
when ith observation is excluded from analysis. For the first observation,
E
RSTUDENT(1) = √ 1 ,
s(1) 1 − h11
where s(1) is estimated standard error when the first observation is excluded and that can be
estimated by
" #
1 E12
s2(1) = (n − p)s2 −
n−p−1 1 − h11
1 1.1 0.4 × 0.4
= (4 − 1) −
4−1−1 5−1−1 1 − 0.6
= 0.5 × (1.1 − 0.4) = 0.35.
Then substituting square root of 0.35 in expression of RSTUDENT to obtain
0.4
RSTUDENT(1) = √ = 1.069,
0.5916 0.4
which is reported for the first observation.
A Realistic Example
As you might be aware that computer system vary dramatically in prices. My interest in
following example is to use regression analysis to predict likely prices that may be charged
by retailers. Using variety of sources including retailer websites and local Pennysaver, in
December 2001, I compiled information about 40 Desktop systems. Although each computer
can be characterized by number of features, I focused on four attributes; central processing unit
(CPU) speed in MHz, amount of random access memory in megabytes (RAM), Size of hard
disk in gigabytes (HARDDISK) and size of monitor in inches (smallest screen that one can buy
is 15inches). My SAS input follows:
options nocenter nodate ps=80 ls=80;
data pc;
input price cpu ram harddisk monitor retail $ cpu_type $;
cards;
828.00 1000 128 20 17 Selltek EZ Celeron
949.00 1400 128 20 17 Pctek Pentium 4
969.98 1000 256 40 17 Datamatrix Celeron
978.00 800 256 20 17 Selltek Power 800Mhz Celeron
1009.99 900 128 60 17 FutureShop eMachines Celeron
1068.00 1000 256 20 17 Selltek Power 1000Mhz Celeron
1128.00 1300 256 20 17 Selltek Power 1300Mhz Pentium 4
1149.99 1400 256 20 17 TCC System #1 Pentium 4
1169.99 1200 256 40 17 TCC System #2 AMD K7
1176.53 1100 128 20 15 Gateway 300Cb Celeron
1199.00 1100 128 40 17 Business Depot HP 7917 / Pavilion Celeron
1229.99 1100 256 20 17 FutureShop Compaq 5310 Celeron
1238.53 1000 128 20 15 Gateway E1800 Celeron
1249.00 1100 256 40 17 RadioShack Compaq Presario 5310CA Celeron
1249.98 1500 256 40 17 Datamatrix Pentium 4
1249.99 1000 192 60 17 FutureShop HP XT858 Pentium 3
1249.99 1200 256 40 17 FutureShop Cicero SC2511 Celeron
1269.98 1600 256 40 17 Datamatrix AMD K7
1299.99 1300 128 40 17 FutureShop HP 7935 AMD Athlon
1329.99 1200 256 40 17 FutureShop Compaq 5320 Celeron
1349.00 1200 256 40 17 RadioShack Compaq Presario 5320CA Celeron
1378.00 1200 256 40 17 Selltek Ultimate 1200Mhz Pentium 3
1399.00 1100 128 20 17 Dell Dimension 2100 Celeron
1478.00 1600 256 40 17 Selltek Ultimate 1600Mhz Pentium 4
1549.00 1600 256 20 17 Dell Dimension 4300S Pentium 4
1549.99 1200 256 60 17 FutureShop Sony PC540 Celeron
1628.00 1800 256 40 17 Selltek Ultimate 1800Mhz Pentium 4
1649.99 1500 256 60 17 FutureShop eMachines Pentium 4
1749.00 1500 256 60 17 RadioShack Compaq Presario 5330CA Pentium 4
1749.00 1700 256 40 17 Pctek Pentium 4
1749.00 1000 256 40 17 Business Depot Compaq Presario 5330CA Celeron
1849.99 1700 256 60 19 TCC System #3 Pentium 4
1899.00 1500 256 40 17 RadioShack HP 7955/MX70 Pentium 4
Model: MODEL1
Dependent Variable: PRICE
Analysis of Variance
Sum of Mean
Source DF Squares Square F Value Prob>F
Parameter Estimates
• The null hypothesis states that variation in price can not be explained by CPU speed,
amount RAM, size of hard disk and size of monitor. We reject this hypothesis, because
probability of F-statistic is less than or equal to 0.05.
• Note that the parameter associated with variables CPU and RAM have correct signs4
and statistically significant (probability of t-statistic is less than 0.05).
• The parameters associated with variables HARDDISK and MONITOR have correct sign
but statistically not significant. That means, these parameters could be equal to zero.
• Consider a desktop with 1 Ghz, with 256 Megabytes of RAM, about 40 gigabytes hard
drive and 17 inches MONITOR. For such machine, I should be expected to pay about
$1,218. This is concluded as follows:
Note that holding everything else same, if we decide to purchase desktop computer with 1.5
Ghz CPU, price of computer would go up by $416.5. A constructed equation like this would be
useful tool to understand competitive market behaviour. Let us turn our attention to evaluating
assumptions. First SAS input and then followed by relevant output.
1 2 3 4 5 6
Dep Var Predict Std Err Std Err Student
Obs RETAIL PRICE Value Predict Residual Residual Residual
1 column is values of dependent variable (yi ). This variable is sorted in ascending order to help
us interpret other statistical measures.
2 column is predicted values for dependent variable (ŷi ). For the first observation,
ŷ1 = −526.65 + 0.833 × 1000 + 1.525 × 128 + 2.781 × 20 + 24.098 × 17 = 966.9.
3 column is the standard error associated with predicted values, a larger number indicates that
values of independent variables are farther away from the “average” observation. For the first
observation independent variable vector, x1 is [110001282017]. Then Var(y1 ) = s2 x01 (X 0 X)−1 x.
4 column is residual or error values, (yi − ŷi ).
5 column is the standard error associated with error, and again a larger number indicates that
values of independent variables are farther away from the “average” observation.
6 column the Student residuals are also called normalized (generally normalized means divided by
the standard error) residuals. If residuals are normally distributed then normalized residuals
more than 2 should be considered extreme observations.
7 8 9 10 11
Cook’s Hat Diag Cov
Obs RETAIL -2-1-0 1 2 D Rstudent H Ratio
7 column is a plot of normalized residuals and these numbers generally vary between −2 and 2.
8 column Cook’s D is a summary measure of the influence of a single observation on the total
changes in all other residuals when observation is excluded from the estimation. In our case,
8 8
Cook’s D ≥ N −2(k+1) is 40−10 or 0.267 would be considered influential observation (see obser-
vation number 38).
9 column Rstudent is similar to Cook’s D with the exception that error variances are estimated
using without the ith observation.
10 column Hat Diag H (Diagonal of Hat matrix H, also sometimes denoted as hii ) is a ratio
of variability for an observation to the sample variability in independent variables. If each
observation has equal influence on regression equation, then the average influence would be
k/N and observation with hii ≥ 2k/N ( 2 × 4/40 or 0.2 for our example) would be considered an
influential observation. There are number of observations with such problem, especially towards
the end of dataset or higher priced desktop systems.
11 column Cov ratio (Covariance ratio) is a ratio covariances when ith observation is excluded to
the sample covariances. A value of COVRATIO close to 1 indicates the “average” influence by
an observation while the absolute value of (COVRATIO - 1) ≥ N3(k+1)
−k−1 is considered significant
(3×5)
influential observation. For our case, COVRATIO ≥ 1 + 35 or 1.429 would be observations
with higher than the normal influence.
12 13 14
INTERCEP CPU RAM HARDDISK MONITOR
Obs RETAIL Dffits Dfbetas Dfbetas Dfbetas Dfbetas Dfbetas
12 column Dffits indicates influence of an observation on the overall fit of model. DFFITS outside
p p
of range ±2 (k − 1)/N is considered influential observation. In our case, ±2 3/40 or ±0.548
would be an influential observations.
15
Variance
Variable DF Inflation
INTERCEP 1 0.00000000
CPU 1 1.67434954
RAM 1 1.87908442
HARDDISK 1 1.46192377
MONITOR 1 1.18063429
Durbin-Watson D 1.634 19
(For Number of Obs.) 40
1st Order Autocorrelation 0.177 20
Sum of Residuals 0
Sum of Squared Residuals 2292100.5421
Predicted Resid SS (Press) 3350277.8100
15 column variance inflation is a measure of collinearity among independent variables and a larger
number indicates that variables highly correlated. This does not appear to be a problem in our
illustration.
16 column eigenvalue is another measure of degree to which independent variables are correlated.
(see the next item for interpreting these).
17 column condition index is square root of the ratio of largest eigenvalue to a particular eigenvalue.
18 columns var prop (proportion of variance shared) is degree to which two or more variables have
common variability.
There is a graphical alternative to visualizing various diagonistics discussed above. Consider measure
COVRATIO. If observations are sorted in ascending or descending order, then plot of COVRATIO and
observation number could be used to visually understand nature of violations related to this measure.
Several of such graphs are provided for illustrative purposes.
0.6 b
b b
b
b
3k 15
0.4 N −k−1 = 35
b
C b b b b
b bb bbb
O b b
bbb
0.2 b
V b
b
b
b b
A b b
b b
R 0.0
b b
T b
I b
O −0.2 b
b
-
1
b
3k
−0.4
b − N −k−1 = − 15
35
−.6
0 5 10 15 20 25 30 35 40
Observation
Note that there are seven observations outside the limits.
1.0 b
b
0.8
0.6
q q
b b k−1 3
2 N =2 40
b
0.4
b
b
0.2 b
D b b b b
F b
b b
F b b
bb
0.0
b b b
I b
b
T b b b b b
S −0.2 b
b b
b
b
−0.4 b b b
b q q
k−1 3
−.6
−2 N = −2 40
0 5 10 15 20 25 30 35 40
Observation
Note that there are four observations outside the limits and observation number 38 is particularly
noteworthy.
0.6
√2 = √2
N 40
0.4
D
F b
B 0.2 bb
b b
E b b bb
b b b b
T b b
b
b
bbbb b
b b b b b
A 0.0
b b
b b b
b
S b
- b
−0.2
C b
P b
b
U −0.4
b
− √2N = − √240
−.6
0 5 10 15 20 25 30 35 40
Observation
Note that there are two observations outside the limits.
Testing Normality
The purpose of this material is to provide procedures that can be used to evaluate the univariate
normality. If tests reveal problems, then it is advisable to turn to the alternative approaches to
analysis, including transformation or weighted least squares.
The moments around the mean of a distribution reveal departures from normality. Suppose we
have a random variable y with a population mean of µ1 , then the rth moment about the mean is
defined as
µr = E(y − µ1 )r , for r > 1,
where E is used to denote the expected value or the average. If we know mean (µ1 ) and its variance
(µ2 ), then it is possible to describe the univariate normal distribution. This is because its higher-order
moments are either zero or can be written as functions of mean or variance. Consequently, if we
examine and test higher order moments, it should be possible to detect departures from normality.
We will look at the second, third and fourth moments for a sample and population below.
The population variance (µ2 ) is the expected value of the squared difference of the values from the
population mean:
µ2 = E(y − µ1 )2 .
The sample variance (s2 ) is usually computed as
XN
1
s2 = (yi − ȳ)2 .
(N − 1) i=1
Skewness is a measure of the tendency of the deviations to be larger in one direction than in the other.
The heaviness of the tails is measured by kurtosis or the coefficient of kurtosis (b2 ). The population
kurtosis is defined as
E(y − µ1 )4
µ4 = − 3.
µ22
The sample fourth moment is calculated as
PN
N (N + 1) i=1 (yi − ȳ)4 3(N − 1)2
g2 = − .
(N − 1)(N − 2)(N − 3) s4 (N − 2)(N − 3)
To convert fourth moment to kurtosis (b2 ) we need to compute
N − 1 (N − 2)(N − 3)
b2 = 3 + g2.
N + 1 (N + 1)(N − 1)
For a normally distributed variable, b2 is equal to 3. In large samples, hypothesis test for b2 can be
performed by converting b2 as a unit normal deviate. That is,
s
h 6 i (N + 1)2 (N + 3)(N + 5)
zb2 = b2 + .
(N + 1) 24N (N − 2)(N − 3)
5
PROC UNIVARIATE in SAS reports the third and fourth moments but not coefficent of skewness and
kurtosis as indicated below.
and this estimate is approximately normally distributed under the null hypothesis of population nor-
mality. Note that values less than zero indicate that the distribution is more peaked with longer tails
than the normal distribution; values greater than zero indicate flatter distribution in the centre and
with shorter tails than the normal distribution.
Omnibus Tests of Normality
It is possible to combine test of skewness and kurtosis into one test that detects departure from
normality due to either of these measures. Such tests are called omnibus. The test statistic
K 2 = z√
2
b1
+ zb22
where the K 2 statistic has approximately a chi-square (χ2 ) distribution, with 2 degrees of freedom
when the population is normally distributed.
There are many other tests to determine departure of a variable from normality. The program
NORMTEST also prints statistic called Shapiro-Wilk test6 . It is based on assumption that ordered
observations of normally distributed variable will have equal and similar weights. Thus, if weight
assigned to the first observation (the lowest value of yi , let us call it y(1) ) is 1/N and the second
observation (one that is more than or equal to y(1) , let us call it y(2) ) will have weight of 2/N and so
on7 . The test statistic of Shapiro-Wilk (W) is
PN 2
i=1 ai y(i)
W= PN
i=1 (yi − ȳ)2
where ai is weight associated with i observation and variable y is ordered such that y(1) ≤ y(2) ≤ · · · ≤
y(N ) . Small values of W correspond to departure from normality.
We will examine below SAS input and output to conduct these tests. As you have seen above,
numerical calculations involved in above are extensive. To assist you with these calculations, I have a
SAS macro8 To access this macro, I would use following SAS input.
%include "c:\sas6_12\normtest.sas";
%normtest(stprc,predpc);
In this instance predpc is name of SAS dataset and stprc is a variable whose normality is being
tested. SAS will produce two sorts of outputs; one graphical and another textual. These follow here.
First SAS and then graphical output.
6
Shapiro, S. S. and Wilk M. B. (1965) “A analysis of variance test for Normality”, Biometrika, vol. 52,
591–611.
7
This is intuitive description of the statistic and not the exact method.
8
This macro is modified version of as it appeared in American Statistician and it was originally written
by D’Agostino Ralph B., Albert Belanger and Ralph B. D’Agostino Jr. (1990) “A Suggestion for Using Pow-
erful and informative Tests of Normality”, Vol. 44, pp. 316–321. The macro for your usage is kept in file
G:\courses\COST6060\NORMTEST.SAS.
b
2
b
S b
t b
a b
n b
d 1 b
b
a b
b
r b
b
d b
b
i b
b
b
z b
b
e 0 b
b
b
d b
b
b
b
b
b
R b
b
e b
b
s b
b
i −1 b
d b
u b
a b
l b
−2
b
−3
−3 −2 −1 0 1 2 3
Normalized Rank
• Presence of Collinearity
1. Create new index variables that may capture correlations among independent variables
either conceptually (for example SES, instead income, occupation and education etc.)
2. Determine stability of parameters by excluding one or more variables.
3. Use statistical procedures for dealing with this problem, for example, transformation,
alternative criterion to minimize.
1. May be caused by missing variables, competitive variables or customer loyalty, then include
missing variables.
2. Re-estimate model with autocorrelated errors.
1. Use limited number of explanatory variables. Avoid using all variables to be included in your
regression model. If there are large number of variables, then create indices, groupings with
conceptual idea. Then, use selected such variables to estimate models.
2. Use a large sample, 40 - 50 observations per variable included will have better stability to
estimates than 5 - 10 observations.
1. By group differences,
2. Interaction effects,
3. Effect occur only at certain level.
• Mediating effects of variables. I will indicate first by picture that variable x affects y and variable
w affects x. If you include, say variable w and regressed on y, we may get unexpected results.
w x y
y = a + bx + ey
x = c + dw + ex
• Not-linear effects.
1. Measurement errors,
2. Response effects,
3. Truncation of variables.
y = Xβ + u (1)
In the least squares method, I want to find β̂ of the regression parameter β so as to minimize the sum
of squared residuals. Mathematically I may write
To minimize this function, I obtain the first derivative of f (β) with respect to β and set equal to
zero. Thus, I may write
∂f
= −2X 0 y + 2X 0 Xβ = 0 or
∂β
β̂ = (X 0 X)−1 X 0 y (3)
It is can be shown that that E(β̂) = β and V(β̂) = σ 2 (X 0 X)−1 where E and V denote statistical
expectation and variance respectively.
I made four important assumptions in deriving these estimates. First, it is assumed that E(u) = 0
and implies that the mean of random noise is zero. Second, it is also assumed that E(X 0 u) = 0
and implies that random noise values and independent variable values are not correlated. Third
assumption requires that E(uu0 ) = σ 2 I N where I N denotes an identity matrix of size N × N . In
words, this assumption requires that each element of random noise vector u be independent and
identically distributed. This assumption is clearly violeted if the observed dependent variable takes
either 0 or 1 values. (As an excercise you may show this). Similarly, if sucessive values of dependent
variable are related, as in case of time series data, then this assumption is also violeted. Finally,
matrix (X 0 X) is nonsingular, which is equivalent to stating rank of matrix X is k. Note that a mere
presence of high correlation among the set of independent variables does not violet this assumption.
It is also possible to show (with lot of algebraic manipulation) that the estimated value of σ 2 is
0
(û û)/(N − k). Note also that second derivatives of f (β) with respect to β are positive. This assures
me that I have actually minimized the function.
Suppose I assume further that u vector is normally distributed. This is an extension to the third
assumption that I have written above. Then, likelihood of observing u1 is given by
1 u21
f (u1 ) = √ exp(− ) (4)
2πσ 2 2σ 2
If there are N independent observations, then the joint likelihood of observing f (u1 ), f (u2 ), · · · , f (uN )
will be denoted by L and may be written as
∂ log L 1
= − (−2X 0 y + 2X 0 Xβ) = 0
∂β 2σ 2
∂ log L N 1
= − 2 + 4 (y − Xβ)0 (y − Xβ) = 0
∂σ 2 2σ 2σ
Solving for β̂ and σˆ2 I may obtain
β̂ = (X 0 X)−1 X 0 y and
(y − X 0 β)0 (y − X 0 β)
σˆ2 =
N
Although the estimate of vector β using the least squares and maximum likelihood method is same,
the estimate of σ 2 is not equal. In fact σ 2 estimate based on the maximum likelihood method is
biased and the estimate based on the least squares method is unbiased. Finally, note also that second
derivatives of log L with respect to β and σ 2 are negative. This assures me that I have actually
maximized the function.
Finally, it is possible to obtain logL value if u0 u is known from the least squares estimation
procedure. To obtain this, substitute unbiased value of σ̂ 2 in the expression of logL. Thus, we may
write
N N (u0 u) N −k 0
log L = − log(2π) − log − (u u)
2 2 N −k 2u0 u
N N (u0 u) N −k
= − log(2π) − log − (7)
2 2 N −k 2
In expression (7) u0 u is sums of squares of residuals and remaining terms contain known constants.
Thus, it is possible to obtain logarithm of likelihood, if one knows sums of squares, criterion used in
the least squares method.
Durbin-Watson Statistics is commonly used statistics to test whether successive values of random
noise are related to each other. It is estimated by
PN 2
i=2 (ûi − ûi−1 )
dw = PN 2
,
i=1 ûi
and expected value of this statistics for a normally distributed random variable is 2.
and expected value of this statistics for a normally distributed random variable is 0.
F-statistics is used to test whether β vector is significantly different from zero and it is the ratio of
mean sums of squares due regression to the error mean sums of squares, i.e.
0
β̂ X 0 y/k
.
û0 û/(N − k)
β̂i − βi
t-statistics is SEC i
and this is distributed according to t-distribution with (N − 1) degrees of
freedom. Note that expected value of βi in above expression is zero.
Cook’s Distance (CDi ) is a measure of the change in the regression coefficents that would occur
if a ith case is omitted. The measure reveals observations that are most influential in affecting
estimated regression equation. It is affected by both the case being an outlier on dependent
variable and on the set of predictors. It is computed as
where β̂ (−i) is the vector of estimated regression coefficients with the ith observation deleted,
and M Sres is the residual variance for all the observations. It is easier to compute Cook’s D by
1 2 hii
CDi = r ,
k + 1 i 1 − hii
where ri is standardized residual when ith observation is excluded and hii is diagonal of
X i (X 0 X)−1 X 0i
Standard Error of Prediction If x0 is vector associated with independent variable values and y0
is value of dependent variable, then the standard error of prediction is given by
q q
var(ŷ0 ) = x00 (X 0 X)−1 x0 s2 ,
Rstudent Residuals are normalized residuals with ith observation excluded and it is computed as
ri
RSTUDENT = √ ,
si 1 − hii
where ri is normalized residual, si is standard error when ith observation is excluded from
analysis and hii is diagonal of X i (X 0 X)−1 X 0i . Observations with RSTUDENT larger than 2
in absolute value may be considered extreme observation.
COVRATIO is ratio of determinants of covariances when the ith observation is deleted (denoted by
s2(−i) (X(i) 0 X(i) )−1 to covariance using all the data, s2 (X 0 X)−1 . That is,
h i
det s2(−i) (X(i) 0 X(i) )−1
COVRATIO = .
det [s2 (X 0 X)−1 ]
HAT matrix H is
H = X(X 0 X)−1 X 0
or covariation within an observation to the average covariation. The diagonal entries of this
matrix (hii ) often are used for detecting influential observations.
DFFITS measures change in fit when ith observation is deleted, or DFFITS = xi [β − β (−1) ].
DFBETA is change in estimated coefficients when ith observations is deleted. DFBETAi = β−β(−1) .
VIF If Ri2 is the multiple correlation coefficient of X i regressed on the remaining explanatory vari-
1
ables, VIFi = 1−R 2.
i
Condition Index If λmax , λ2 · · · λk denotes eigenvalues associated with matrix (X 0 X), then
s
λmax
Condition Index = .
λi
Proportions of variance of the kth regression coefficient shared with jth components. If eigenvec-
tors are represented by vkj and jth eigenvalue as λj , then shared variance kth variable is given
by
Xk
vkj
var(βk ) = s2 .
λ
j=1 j