Normal Probability Plot: Shibdas@isical - Ac.in
Normal Probability Plot: Shibdas@isical - Ac.in
Normal Probability Plot: Shibdas@isical - Ac.in
Shibdas Bandyopadhyay
[email protected]
Indian Statistical Institute
Abstract
Normal probability plots are made to graphically verify normality assumption for
data from a univariate population that are mutually independent and
identically distributed. Normal probability plot is very common option in
most statistical packages. In the context of design of experiments or
regression, though the observations are assumed to be mutually independent
and homoscedastic, they have different unknown expectations. So the raw
data are inappropriate for normality check. To overcome the problem of
unequal expectations, it is common to use residuals of a fitted regression
model. The residuals have zero expectation, but these are heteroscedastic,
and also mutually dependent. It is thus inappropriate to use the residuals for
normality check. In this study, mutually independent homoscedastic
components with zero mean are extracted from residuals through principle
component analysis; these are then used for normal probability plot. The
technique is illustrated with data.
Key words and phrases: Normal probability plot, principal component analysis.
AMS (1991) subject classification: 62P.
Normal Probability plot
Shibdas Bandyopadhyay
[email protected]
Indian Statistical Institute
1. Introduction
Let Y1 , Y2 , ... .., Yn be mutually independent with common mean and standard
deviation . To check graphically if the data are from a common normal
distribution, one plots Y(i ) , the ith ordered statistic of Y1 , Y2 , ... .., Yn , against
1 (ci ) , i = 1,2, …., n; if the line plot is nearly linear, one is satisfied with the
normality assumption. In the plot, happens to be the slope of the straight line of
Y(i ) on 1 (ci ) ; c i ' s are chosen to estimate ‘efficiently’ (David and Nagaraja,
2003). Currently used c i ' s (Blom, 1958) in statistical packages like in Minitab are:
ci = (i- 3
8 )/( n 14 ), i=1,2,…, n.
(1.1)
Y = X + (1.2)
where Y is n1 response vector, X is np design matrix of rank r p, is p1 vector
of unknown parameters, and is n1 unobservable vector of error components;
error components are assumed to be mutually independent and identically
distributed with zero mean and standard deviation .
Though the n components of Y are independently distributed with common
standard deviation , components of Y do not have a common mean . The ith
component Y i of Y has the mean i = X i' , where X i' is the ith row of X, i=1,2,…, n.
1
So, a line plot of Y(i ) on (ci ) is not meaningful to check the normality of Y i ’s.
It has become a standard practice, as in Minitab, to work with ˆ , the n1 residuals:
ˆ = Y – X ̂ (1.3)
and make a line plot of ˆi , the i component of ˆ , on (ci ) , ci ’s given by (1.1).
th 1
We use a match factory data (Roy et al, 1959) for illustration. Data are scores of n =
25 workers on three psychological tests U 1 , U 2 , U 3 and also their efficiency index
Y.
Components of ˆ ' after fitting the regression
Y = 1 + 2 U 1 + 3 U 2 + 4 U 3
(1.4)
(with X 1 1, X 2 =U 1 , X 3 =U 2 and X 4 =U 3 ) is 1 25:
( 3.33 –0.18 –0.88 –3.62 –5.16 –2.24 0.92 3.42 –0.22 –0.52
–1.61 –1.37 –1.27 1.31 0.12 1.16 2.17 0.66 0.88 –3.07
0.055 –2.28 0.69 3.87 3.84).
6.00
4.00
2.00
0.00
-3 -2 -1 -2.00 0 1 2 3
-4.00
-6.00
Phi-inverse(Ci)
But this line plot of 1 (ci ) on ˆi with ci = (i- 83 )/( n 14 ), i=1,2,…, n is not
appropriate to check normality of Y i ’s. It true that, when the mutually
independent Y i ’s are normally distributed with mean i = X i and
'
Consider the regression model Y = X + along with what follows (1.1). One may
write ˆ as, for ̂ = (X'X) X'Y,
ˆ = Y – X ̂ = (I n -X(X'X) X') Y HY (2.1)
where (X'X) is a g-inverse of X'X and H = I n -X(X'X) X'. It follows that ˆ has
I nr 0
H=P P', PP' = P'P = I n .
0 0
P is non-stochastic orthogonal matrix and depends only on the design matrix X. P' ˆ
has a singular normal distribution, mass of the joint density of n components of P' ˆ
I n r 0
lies in (n-r) dimension, with zero mean and covariance matrix 2 . Thus, if
0 0
we write P = ( P (1) P ( 2 ) ), where P (1) consists of the first (n-r) columns of P (the
characteristic vectors corresponding to the (n-r) non-zero characteristic roots of H )
and P ( 2 ) consists of the remaining r columns of P, (n-r) components of P' (1) ˆ are
independent and identically distributed normal with zero mean and standard
deviation while the remaining r components of P' ( 2 ) ˆ are identically zero ( zero
mean and zero variance).
Fig.2 is a line plot of ith ordered statistic of the 21 components of P’ (1) ˆ with
1
ci = (i- 3
8 )/(21.25), (since r=p=4, n-r= 21) on (ci ) , i=1,2,…, 21.
6.00
PC of residuals
4.00
2.00
0.00
-3 -2 -1 -2.00 0 1 2 3
-4.00
-6.00
Phi-inverse(Ci)
We do not wish to compare the two figures. We only want to point out that the
analysis suggested with principal components is an appropriate method and is not
difficult to implement in packages that have eigen analysis module.
References