Normal Probability Plot: Shibdas@isical - Ac.in

Download as doc, pdf, or txt
Download as doc, pdf, or txt
You are on page 1of 6

Normal Probability plot

Shibdas Bandyopadhyay
[email protected]
Indian Statistical Institute

Abstract

Normal probability plots are made to graphically verify normality assumption for
data from a univariate population that are mutually independent and
identically distributed. Normal probability plot is very common option in
most statistical packages. In the context of design of experiments or
regression, though the observations are assumed to be mutually independent
and homoscedastic, they have different unknown expectations. So the raw
data are inappropriate for normality check. To overcome the problem of
unequal expectations, it is common to use residuals of a fitted regression
model. The residuals have zero expectation, but these are heteroscedastic,
and also mutually dependent. It is thus inappropriate to use the residuals for
normality check. In this study, mutually independent homoscedastic
components with zero mean are extracted from residuals through principle
component analysis; these are then used for normal probability plot. The
technique is illustrated with data.

Key words and phrases: Normal probability plot, principal component analysis.
AMS (1991) subject classification: 62P.
Normal Probability plot
Shibdas Bandyopadhyay
[email protected]
Indian Statistical Institute

1. Introduction
Let Y1 , Y2 , ... .., Yn be mutually independent with common mean  and standard
deviation . To check graphically if the data are from a common normal
distribution, one plots Y(i ) , the ith ordered statistic of Y1 , Y2 , ... .., Yn , against
 1 (ci ) , i = 1,2, …., n; if the line plot is nearly linear, one is satisfied with the
normality assumption. In the plot,  happens to be the slope of the straight line of
Y(i ) on  1 (ci ) ; c i ' s are chosen to estimate  ‘efficiently’ (David and Nagaraja,
2003). Currently used c i ' s (Blom, 1958) in statistical packages like in Minitab are:

ci = (i- 3
8 )/( n  14 ), i=1,2,…, n.
(1.1)

Line plot of Y(i ) on  1 (ci ) is called Normal Probability Plot.

While testing for , it is natural to check normality assumption using normal


probability plot. Use of normal probability plot to check normality assumption has
been common in other situations also. In this study, we shall consider the use of
normal probability plot to check normality assumption for response in the context
of regression and design of experiments.

Consider the standard linear regression model:

Y = X +  (1.2)

where Y is n1 response vector, X is np design matrix of rank r  p,  is p1 vector
of unknown parameters, and  is n1 unobservable vector of error components;
error components are assumed to be mutually independent and identically
distributed with zero mean and standard deviation .
Though the n components of Y are independently distributed with common
standard deviation , components of Y do not have a common mean . The ith
component Y i of Y has the mean  i = X i' , where X i' is the ith row of X, i=1,2,…, n.
1
So, a line plot of Y(i ) on  (ci ) is not meaningful to check the normality of Y i ’s.
It has become a standard practice, as in Minitab, to work with ˆ , the n1 residuals:

ˆ = Y – X ̂ (1.3)
and make a line plot of ˆi , the i component of ˆ , on  (ci ) , ci ’s given by (1.1).
th 1

We use a match factory data (Roy et al, 1959) for illustration. Data are scores of n =
25 workers on three psychological tests U 1 , U 2 , U 3 and also their efficiency index
Y.
Components of ˆ ' after fitting the regression
Y =  1 +  2 U 1 + 3 U 2 + 4 U 3
(1.4)
(with X 1 1, X 2 =U 1 , X 3 =U 2 and X 4 =U 3 ) is 1  25:

( 3.33 –0.18 –0.88 –3.62 –5.16 –2.24 0.92 3.42 –0.22 –0.52
–1.61 –1.37 –1.27 1.31 0.12 1.16 2.17 0.66 0.88 –3.07
0.055 –2.28 0.69 3.87 3.84).

Fig.1 is a line plot of ˆi on  (ci ) , with ci = (i-


1 3
8 )/(25.25), i=1,2,…, 25.
Regression residuals

6.00
4.00
2.00
0.00
-3 -2 -1 -2.00 0 1 2 3
-4.00
-6.00
Phi-inverse(Ci)

Fig.1 : Normal Probability Plot with regression residuals

But this line plot of  1 (ci ) on ˆi with ci = (i- 83 )/( n  14 ), i=1,2,…, n is not
appropriate to check normality of Y i ’s. It true that, when the mutually
independent Y i ’s are normally distributed with mean  i = X i  and
'

common standard deviation , ˆi ’s are distributes as normal with mean


zero but standard deviations are different multiples (depending on X) of
. Also ˆi ’s are not mutually independent. So, one needs modification
(Hocking, 2003).
This study suggests a natural modification by extracting independent and
identically distributed normal components from ˆ = Y – X ˆ using
principal component analysis. It will not be possible to carry out the
suggested modification by using statistical tables and calculators or on
PC; it computer intensive. One would need principal component analysis
module, which is common in most statistical packages such as Eigen
Analysis in Minitab.

2. Extraction of independent and identically distributed components using


principal component analysis

Consider the regression model Y = X +  along with what follows (1.1). One may
write ˆ as, for ̂ = (X'X)  X'Y,
ˆ = Y – X ̂ = (I n -X(X'X)  X') Y  HY (2.1)
where (X'X) is a g-inverse of X'X and H = I n -X(X'X) X'. It follows that ˆ has
 

a singular normal distribution, mass of the joint density of n components of ˆ lies


in
(n-r) dimension, with zero mean and covariance matrix  2 H with rank (H)= (n-r).
Since H is symmetric and idempotent of rank (n-r), characteristic roots of H are 1 of
multiplicity (n-r) and 0 of multiplicity r. Using spectral decomposition of H we may
write

 I nr 0 
H=P   P', PP' = P'P = I n .

0 0
P is non-stochastic orthogonal matrix and depends only on the design matrix X. P' ˆ
has a singular normal distribution, mass of the joint density of n components of P' ˆ

 I n r 0 
lies in (n-r) dimension, with zero mean and covariance matrix  2   . Thus, if

 0 0
we write P = ( P (1) P ( 2 ) ), where P (1) consists of the first (n-r) columns of P (the
characteristic vectors corresponding to the (n-r) non-zero characteristic roots of H )
and P ( 2 ) consists of the remaining r columns of P, (n-r) components of P' (1) ˆ are
independent and identically distributed normal with zero mean and standard
deviation  while the remaining r components of P' ( 2 ) ˆ are identically zero ( zero
mean and zero variance).

For the match factory data, (P' ˆ )' is 1  25,


(P' ˆ )' = ( 1.84 –0.36 1.30 1.40 0.59 –4.58
–2.55 –1.93 0.26 –1.56 0.82 2.83
0.68 2.77 3.97 –2.59 –3.63 2.11
5.54 –0.86 –0.061 0 0 0 0). (2.2)
Notice that each of the last four components of P' ˆ , P' ( 2 ) ˆ , is 0, as these should
be.

Fig.2 is a line plot of ith ordered statistic of the 21 components of P’ (1) ˆ with
1
ci = (i- 3
8 )/(21.25), (since r=p=4, n-r= 21) on  (ci ) , i=1,2,…, 21.

6.00
PC of residuals

4.00
2.00
0.00
-3 -2 -1 -2.00 0 1 2 3
-4.00
-6.00
Phi-inverse(Ci)

Fig.2 : Normal Probability Plot with PC of regression residuals

We do not wish to compare the two figures. We only want to point out that the
analysis suggested with principal components is an appropriate method and is not
difficult to implement in packages that have eigen analysis module.

References

Blom, G. (1958). Statistical Estimates and Transformed Beta-Variables. Wiley, New


York.
David, H.A. and Nagaraja, H. N. (2003). Order Statistics. Wiley – Interscience.
Hocking R. R.(2003). Methods and Applications of Linear Models. Wiley –
Interscience.
Roy, J., Chakravarty, I.M. and Laha, R.G.(1959). Handbook of Methods of
Applied Statistics, Vol. 1. John Wiley & Sons, Inc.

You might also like