(A. C. Bajpai, Irene M. Calus, J. A. Fairley
(A. C. Bajpai, Irene M. Calus, J. A. Fairley
(A. C. Bajpai, Irene M. Calus, J. A. Fairley
DATA ANALYSIS
Environmental Management Series
Edited by
and
This series hJS been established [0 meet the need for a set of in-depth volumes
dealing with environmental issues. particufarly with regard to a sustainable future.
The series provides a uniform and quality coverage. building up to form a library
of reference books spanning major tOPICS within this diverse tield.
Please contact the Publisher or one of the Series' Editors if you would
like to contribute to the Series.
Edited by
C.N. HEWITT
INSTITUTE OF ENVIRONMENTAL & BIOLOGICAL
SCIENCES, LANCASTER UNIVERSITY,
LANCASTER LA1 4YQ, UK
Apart from any fair dealing for the purposes of research or private
study, or criticism or review, as permitted under the UK Copyright
Designs and Patents Act, 1988. this publication may not be
reproduced, stored, or transmitted, in any form or by any means.
without the prior permission In writing of the publishers. or in the case
of reprographic reproduction only in accordance with the terms of the
licences issued by the Copyright Licensing Agency in the UK, or in
accordance with the terms of licences issued by the appropriate
Reproduction Rights Organization outside the UK. Enquiries concerning
reproduction outside the terms stated here should be sent to the
publishers at the London address printed on this page.
The publisher makes no representation, express or implied, with
regard to the accuracy of the information contained in this book and
cannot accept any legal responsibility or liability for any errors or
omissions that may be made.
A catalogue reGord for this book is available from the British Library
In recent years there has been a dramatic increase in public interest and
concern for the welfare of the planet and in our desire and need to
understand its workings. The commensurate expansion in activity in the
environmental sciences has led to a huge increase in the amount of data
gathered on a wide range of environmental parameters. The arrival of
personal computers in the analytical laboratory, the increasing automation
of sampling and analytical devices and the rapid adoption of remote
sensing techniques have all aided in this process. Many laboratories and
individual scientists now generate thousands of data points every month
or year.
The assimilation of data of any given variable, whether they be straight-
forward, as for example, the annual average concentrations of a pollutant
in a single city, or more complex, say spatial and temporal variations of
a wide range of physical and chemical parameters at a large number of
sites, is itself not useful. Raw numbers convey very little readily assimilated
information: it is only when they are analysed, tabulated, displayed and
presented can they serve the scientific and management functions for
which they were collected.
This book aims to aid the active environmental scientist in the process
of turning raw data into comprehensible, visually intelligible and useful
information. Basic descriptive statistical techniques are first covered, with
univariate methods of time series analysis (of much current importance as
the implications of increasing carbon dioxide and other trace gas concen-
trations in the atmosphere are grappled with), regression, correlation and
multivariate factor analysis following. Methods of analysing and deter-
mining errors and detection limits are covered in detail, as are graphical
methods of exploratory data analysis and the visual representation of
VII
VllI PREFACE
NICK HEWITT
Lancaster
Contents
Foreword . . . . . v
Preface . . . . . . Vll
List of Contributors. Xl
IX
List of Contributors
M.J. ADAMS
School of Applied Sciences, Wolverhampton Polytechnic, Wulfruna
Street, Wolverhampton WVIISB, UK
A.C. BAJPAI
Department of Mathematical Sciences, Loughborough University of
Technology, Loughborough LEll 3TU, UK
I.M. CALUS
72 Westfield Drive, Loughborough LEll 3QL, UK
K.A. CARLBERG
29 Hoffman Place. Belle Mead, New Jersey 08502, USA
A.C. DAVISON
Department of Statistics, University of Oxford, 1 South Parks Road,
Oxford OXI 3TG, UK
J.A. FAIRLEY
Department of Mathematical Sciences, Loughborough University of
Technology, Loughborough LEll 3TU, UK
A.A. LIABASTRE
Environmental Laboratory Division, US Army Environmental Hygiene
Activity-South, Building 180, Fort McPherson, Georgia 30330-5000,
USA
M.S. MILLER
Automated Compliance Systems, 673 Emory Valley Road, Oak Ridge,
Tennessee 37830, USA
xi
XII LIST OF CONTRIBUTORS
1.M. THOMPSON
Department of Biomedical Engineering and Medical Physics,
University of Keele Hospital Centre, Thornburrow Drive, Hartshill,
Stoke-on-Trent, Staffordshire ST4 7QB, UK. Present address:
Department of Medical Physics and Biomedical Engineering, Queen
Elizabeth Hospital, Birmingham BJ5 2TH, UK
P.e. YOUNG
Institute of Environmental and Biological Sciences, Lancaster
University, Lancaster, Lancashire LAJ 4YQ, UK
T. YOUNG
Institute of Environmental and Biological Sciences, Lancaster
University, Lancaster, Lancashire LAJ 4YQ, UK. Present address:
Maths Techniques Group, Bank of England, Threadneedle Street,
London EC2R 8AH, UK
Chapter 1
Descriptive Statistical
Techniques
A.C. BAJPAI,a IRENE M. CALUSb and J.A. FAIRLEyli
GDepartment of Mathematical Sciences, Loughborough University of
Technology, Loughborough, Leicestershire LE11 3TU, UK; b72 Westfield
Drive, Loughborough, Leicestershire LE11 3QL, UK
1 RANDOM VARIATION
The air quality in a city in terms of, say, the level of sulphur dioxide
present, cannot be adequately assessed by a single measurement. This is
because air pollutant concentrations in the city do not have a fixed value
but vary from one place to another. They also vary with respect to time.
Similar considerations apply in the assessment of water quality in a river
in terms of, say, the level of nitrogen or number of faecal coliforms
present, or in assessing the activity of a radioactive pollutant. In such
situations, while it may be that some of the variation can be attributed to
known causes, there still remains a residual component which cannot be
fully explained or controlled and must be regarded as a matter of chance.
It is this random variation that explains why, for instance, two samples of
water, of equal volume, taken at the same point on the river at the same
time give different coliform counts, and why, in the case of a radioactive
source, the number of disintegrations in, say, a I-min time interval varies
from one interval to another.
Random variation may be caused, wholly or in part, by errors in
measurement or it may simply be inherent in the nature of the variable
under consideration. When a die is cast, no error is involved in counting
the number of dots on the uppermost face. The score is affected by a
multitude of factors-the force with which the die is thrown, the angle at
which it is thrown, etc.-which combine to produce the end result.
2 A.C. BAJPAI, IRENE M. CALUS AND J.A. FAIRLEY
2 TABULAR PRESENTATION
4 2 2 2 033
In Table I the data are presented as a frequency table. It shows the
number of dishes with no colony, the number with one colony, and so on.
To form this table you will probably find it easiest to work your way
systematically through the data, recording each observation in its appro-
priate category by a tally mark, as shown. In fact, these particular data
may well be recorded in this way in the first place.
The variate here is 'number of colonies' and the column headed 'number
of dishes' shows the frequency with which each value of the variate
TABLE 1
Frequency table for colony counts
o J.Hf 5
1 J.Hf IIII 9
2 1Hf1Hf II 12
3 J.HfII 7
4 1Hf 5
5 II 2
40
4 A.C. BAJPAI, IRENE M. CALUS AND J.A. FAIRLEY
TABLE 2
Frequency table for nitrate ion concentration
measurements
0·45 I
0·46 2
0·47 4
0·48 8
0·49 8
0·50 10
0·51 5
0·52 2
40
DESCRIPTIVE STATISTICAL TECHNIQUES 5
only 8 different values were taken by the variate. If, however, the deter-
minations had been made on different water specimens, there would have
been a greater amount of variation. Giving the frequency corresponding
to each value taken by the variate would then make the table too unwieldy
and grouping becomes advisable. This is illustrated by the next example,
where a similar situation exists.
TABLE 3
Frequency table for lead concentration measurements
2·0-2·9 1·95-2·95 1
3·0-3·9 2,95-3,95 2
4'~'9 3·95-4·95 1
5·0-5·9 4·95-5·95 6
6·0-6,9 5·95-6·95 16
7·0-7-9 6·95-7·95 8
8,0-8,9 7,95-8,95 8
9·0-9·9 8·95-9·95 4
10·0-10·9 9,95-10·95 3
11·0-11·9 10·95-11·95 1
50
6 A.C. BAIPAI, IRENE M. CALUS AND I.A. FAIRLEY
mean the retention of too much detail and show little improvement on the
original data.
As with the measurements of nitrate ion concentration in Example 2, we
are here dealing with a variate which is, in essence, continuous. In this
case, readings were recorded to I decimal place. Thus 2·9 represents a
value between 2·85 and 2·95, 3·0 represents a value between 2·95 and 3·05,
and so on. Hence there is not really a gap between the upper end of one
group and the lower end of the next. The true boundaries of the first group
are 1·95 and 2·95, of the next 2·95 and 3·95, and so on, as shown in Table
3. Notice that no observation falls on these boundaries and hence no
ambiguity arises when allocating an observation to a group. If the boun-
daries were 2·0, 3·0, 4·0, etc., then the problem would arise of whether 3·0,
say, should be allocated to the group 2·0-3·0 or to the group 3·0-4·0.
Various conventions exist for dealing with this problem but it can be
avoided by a judicious choice of boundaries, as seen here.
2.2 Table of cumulative frequencies
It may be of interest to know on how many days the lead concentration
was below a stated level. From Table 3 it is readily seen that there were
no observations below 1·95, 1 below 2·95, 3 (= I + 2) below 3·95,
4 (= 1 + 2 + 1) below 4·95 and so on. The complete set of cumulative
frequencies thus obtained is shown in Table 4.
In the case of a frequency table, it has been mentioned that it may be
more useful to think in terms of the relative frequency, i.e. the proportion
of observations falling into each category. Similar considerations apply in
TABLE 4
Table of cumulative frequencies for data in Table 3
1·95 0 0
2·95 1 2
3·95 3 6
4·95 4 8
5·95 10 20
6·95 26 52
7·95 34 68
8·95 42 84
9·95 46 92
10·95 49 98
11·95 50 100
DESCRIPTIVE STATISTICAL TECHNIQUES 7
3 DIAGRAMMATIC PRESENTATION
For some people a diagram conveys more than a table of figures, and ways
of presenting data in this form will now be considered.
(a)~. I.. .. •5
234
BOD (mg/litre)
(b) u1....L.
I .1. II •• • II
234 5
BOD (mg/litre)
Fig. 1. Dot diagrams for data recorded at Station 12A on River Clyde:
(a) 1988; (b) 1987 and 1988.
o 2 3 5
Number of Colonies
the heights of the lines (or bars) represent the frequencies. This type of
diagram is appropriate here as it emphasises the discrete nature of the
variate.
3.3 Histogram
For data in which the variate is of the continuous type, as in Examples 2
and 3, the frequency distribution can be displayed as a histogram. Each
frequency is represented by the area of a rectangle whose base, on a
horizontal scale representing the variate, extends from the lower to the
upper boundary of the group. Taking the distribution in Table 3 as an
example, the base of the first rectangle should extend from 1·95 to 2'95, the
next from 2·95 to 3'95, and so on, with no gaps between the rectangles, as
shown in Fig. 3.
In the case of the distribution of nitrate ion concentration readings in
Table 2, 0·45 is considered as representing a value between 0·445 and
0'455, 0·46 as representing a value between 0·455 and 0'465, and so on.
Thus, when the histogram is drawn, the bases of the rectangles would have
DESCRIPTIVE STATISTICAL TECHNIQUES 9
5 6 7 8 9 10 11 12
8 9 10 11 12
Lead Concentration ().tg/m')
100
90
>-
(J
c 80
Q)
~
0-
Q)
70
u:
Q)
> 60
~
:;
E 50
~
u 40
Q)
Cl
.;g
c 30
Q)
f:!
Q) 20
11.
10
0
0 2 3 4 5 6 7 8 9 10 11 12
.1 ....-... Day
......-'.IL-..,
-'1L.l.L-a .....".........~_....._ _ _ _ _ _ Night
2 3 4 5 6 7 8
CO Concentration (ppm)
beginning at 20.00 h.
Day: 5·8 6·9 6·7 6·7 6·3 5·8 5·5 6·1 6·8 7·0 7·4 6·4
Night: 5·0 3·8 3·5 3·3 3·1 2·4 1·8 1·5 1·3 1·3 2·0 3·4
From a glance at the two sets of data it is seen that higher values were
recorded during the daytime period. (This is not surprising because the
density of traffic, which one would expect to have an effect on CO con-
centration, is higher during the day.) This difference between the two sets
of data is highlighted by the two dot diagrams shown in Fig. 6.
4. 1. 1 Definition
One way of indicating where each set of data is located on the scale of
measurement is to calculate the arithmetic mean. This is the measure of
location in most common use. Often it is simply referred to as the mean,
as will sometimes be done in this chapter. There are other types of mean,
e.g. the geometric mean of which mention will be made later. However, in
that case the full title is usually given so there should be no misunderstand-
ing. In common parlance the term 'average' is also used for the arithmetic
mean although, strictly speaking, it applies to any measure of location:
.h
A nt . Sum of all observations
metIc mean = . (1)
Total number of observatIons
Applying eqn (l) to the daytime recordings gives the mean CO concen-
tration as
5·8 + 6·9 + ... + 7·4 + 6·4 77·4
12
= 12 = 6·45ppm
Similarly, for the night recordings the mean is 32·4/12, i.e. 2·7 ppm.
DESCRIPTIVE STATISTICAL TECHNIQUES 13
n
n
Using I: to denote 'the sum from i = 1 to i = n of' this can be written
;=1
as
1 n
X = - L X; (2)
n ;=1
Where there is no possibility of misunderstanding, the use of the suffix i
on the right-hand side can be dropped and the sum simply written as I:x.
= 0
We have, therefore, the general result that the sum of the deviations from
the arithmetic mean is always zero.
The second property concerns the values of the squares of the deviations
from the mean, i.e. (x - i)2. Continuing with the same set of 4 values of
x, squaring the deviations and summing gives
I:(x - i)2 = (- 3)2 + 12 + 42 + (- 2)2 = 30
Now let us look at the sum of squares of deviations from a value other
than i. Choosing 2, say, gives
I:(x - 2)2 = (l - 2)2 + (5 - 2)2 + (8 - 2)2 + (2 - 2)2 = 46
Choosing 10 gives
(l lW + (5 - 10)2 + (8 - 10)2 + (2 - lW
174
Both these sums are larger than 30, as also would have been the case if
numbers other than 2 and 10 had been chosen. This illustrates the general
result that the sum of the squares of the deviations from the arithmetic mean
is less than the sum of the squares of the deviations taken from any other
DESCRIPTIVE STATISTICAL TECHNIQUES 15
L (x;
n
S = - a)2 = (XI - a)2 + (Xl - a)l . .. + (xn - a)l
;=1
whence
TABLE 5
Calculation of arithmetic mean for data in Table 1
0 5 0
1 9 9
2 12 24
3 7 21
4 5 20
5 2 10
40 84
(1) is given by
5 x 0 + 9 x 1 + 12 x 2 + 7 x 3 + 5 x 4 + 2 x 5 84
The calculations can be set out as shown in Table 5.
Mean number of colonies per dish = ~~ = 2·1
It will be noted that "i.fx is the sum of all the observations (in this case the
total number of colonies observed) and "i.fis the total number of observa-
tions (in this case the total number of dishes).
An extension to the general case is easily made. If the variate takes
values XI' X2, ... ,Xn with frequencies J;, J;, ... ,In respectively, the
arithmetic mean is given by
J;XI + J;X2 + ... + Inxn
x = (3)
J;+J;+···+1n
"i.fx
- - or, more simply,
LA
n
"i.f
;=1
On examination of eqn (3), it will be seen that XI' X2, ... , Xn are weighted
according to the frequencies J;, J;, ... , In with which they occur. The
mean for the data in Example 2 can be found by a straightforward
application of this formula to Table 2, thus giving 0·49. In Examples 1 and
2 no information was lost by putting the data into the form of a frequency
table. In each case it would be possible, from the frequency table, to say
what the original 40 observations were. The same is not true of the
frequency table in Example 3. By combining values together in each class
interval, some of the original detail was lost and it would not be possible
DESCRIPTIVE STATISTICAL TECHNIQUES 17
to reproduce the original data from Table 3. To calculate the mean from
such a frequency table, the values in each class interval are taken to have
the value at the mid-point of the interval. Thus, in Table 3, observed values
in the first group (1·95-2'95) are taken to be 2'45, in the next group
(2·95-3·95) they are taken to be 3'45, and so on. This is, of course, an
approximation, but an unavoidable one. It may be of interest to know that
the mean calculated in this way from Table 3 is 7'15, whereas the original
50 recordings give a mean of 7'08, so the approximation is quite good.
(a) (b)
A~
Fig. 7. Histograms showing distributions that are (a) symmetric; (b) skew.
Rewriting this as
10gGM
leads to
GM = ':jX I X 2 • •• Xn
Both the GEMS' report on air quality and the UK Blood Lead Monitor-
ing Programme 5- 7 include geometric means in their presentation of results.
5 MEASURES OF DISPERSION
Both sets of data give a mean of 14·13. (This would be unlikely to happen
in practice, but the figures have been chosen to emphasise the point that
is being made here.) It can, however, be seen at a glance that B's results
show a much smaller degree of scatter, and thus better precision, than
those obtained by A. Although the terms precision and accuracy tend to
be used interchangeably in everyday speech, the theory of errors makes a
clear distinction between them. A brief discussion of the difference
between these two features of a measuring technique or instrument will be
useful at this point.
Mention has already been made of the occurrence of random error in
measurements. Another type of error that can occur is a systematic error
(bias). Possible causes, cited by Lee & Lee,1 are instrumental errors such
as the zero incorrectly adjusted or incorrect calibration, reagent errors
such as the sample used as a primary standard being impure or made up
22 A.C. BAJPAI, IRENE M. CALUS AND J.A. FAIRLEY
The exceptionally high value at the upper extreme of the Clyde conductiv-
ity data (for which the median was found in Section 4.3) would have a
huge effect on the range, but none at all on the interquartile range, which
will now be calculated.
In the calculation of the median from a set of data there is a universal
convention that when there is no single middle value, the median is taken
to be midway between the two middle values. There is, however, no
universally agreed procedure for finding the quartiles, or, indeed, percen-
tiles generally. It can be argued that, in the same way that the median is
found from the ordered data by counting halfway from one extreme to the
other, the quartiles should be found by counting halfway from each
extreme to the median. Applying this to the conductivity data, there are
two middle values, the 4th and 5th, which are halfway between the I st and
8th (the median). They are 260 and 279, so the lower quartile would
then be taken as (260 + 279)/2, i.e. 269·5. Similarly, the upper quartile
would then be (380 + 488)/2, i.e. 434, giving the interquartile range as
434 - 269·5 = 164·5.
Another approach is based on taking the Pth percentile of n ordered
observations to be the (P/lOO)(n + l)th value. For the conductivity data,
n = 15 and the quartiles (P = 25 and P = 75) would be taken as the 4th
24 A.C. BAJPAI, ffiENE M. CALUS AND I.A. FAffiLEY
and 12th values, i.e. 260 and 488. For a larger set of data, the choice of
approach would have less effect.
This measure of spread is, however, of limited usefulness and is not often
met with nowadays.
figure of 240 V stated for the UK electricity supply is, in fact, the r.m.s.
value of the voltage.
The calculations required when finding a standard deviation are more
complicated than for other measures of dispersion, but this disadvantage
has been reduced by the aids to computation now available. A calcu-
lator with the facility to carry out statistical calculations usually offers
a choice of two possible values. To explain this we must now make a slight
digression.
s = )_1-
n-I
I:(x _ X)2 (5)
TABLE 6
Calculation of s for Laboratory 8's measurements of %Fe
x x - X (x - X)2 x2
(6)
The %Fe results obtained by Laboratory B are again chosen for illustra-
tion.
From Table 6, LX = 70·65 and Lx2 = 998·2879. Substitution in eqn (6)
gives
998·2879 - 70.65 2/5
998·2879 - 998·2845
0·0034
s is then found as before.
4 digits, the process of squaring and adding nearly doubled this number.
The example will serve to show a hazard that may exist, not only in the
use of the shortcut formula but in similar situations when a computer or
calculator is used.
In the previous calculations all the figures were carried in the working
and there was no rounding off. Now, let us look at the effect of working
to a maximum of 5 significant figures. The values of ~ would then be taken
as 199'37, 198'81,200'22, 199·09 and 200'79, giving ~X2 = 998·28. Sub-
stitution in eqn (6) gives ~(x - X)2 = 998·28 - 998·28 = 0, leading to
s = O. What has happened here is that the digits discarded in the rounding
off process are the very ones that produce the correct result. This occurs
when the two terms on the right-hand side of eqn (6) are near to each other
in value. Hence, in the present example, 7 significant figures must be
retained in the working in order to obtain 2 significant figures in the
answer.
Calculators and computers carry only a limited number of digits in their
working. Having seen what can happen when just 5 observations are
involved, only a little imagination is required to realise that, with a larger
amount of data, overflow of the capacity could easily occur. The conse-
quent round-off errors can then lead to an incorrect answer, of which the
unsuspecting user of the calculator will be unaware (unless an obviously
ridiculous answer like s = 0 is obtained).
The number of digits in the working can be reduced by coding the data,
i.e. making a change of origin and/or size of unit. For Laboratory B's %Fe
data, subtracting 14·1 from each observation (i.e. moving the origin to
x = 14'1) gives the coded values 0'02, 0'00, 0·05, 0·01 and 0·07. The
spread of these values is just the same as that of the original data and they
will yield the same value of s. The number of digits in the working will,
however, be drastically reduced. This illustrates how a convenient number
can be subtracted from each item of data without affecting the value of s.
Apart from reducing possible risk arising from round-off error, fewer
digits mean fewer keys to be pressed, thus saving time and reducing
opportunities of making wrong entries. In fact, the %Fe data could be
even further simplified by making 0·0 I the unit, so that the values become
2,0, 5, I and 7. It would then be necessary, when s has been calculated,
to convert back to the original units by multiplying by 0·01. A more
detailed explanation of coding is given in Ref. 4.
DESCRIPTIVE STATISTICAL TECHNIQUES 29
TABLE 7
Calculation of s for frequency distribution in Table 1
_ 84 = J73.60 = 1·37
.~ = -40 = 2·1 s
39 -
- 2 2 I 2
'Lf(x - x) = 'Lf'K - 'Lf('Lf'K) (8)
Applying eqn (7) to the colony count data in Table I gives the calculations
shown in Table 7. Applying eqn (8) to the same data,
I fx 2 = 0 + 9 + 48 + 63 + 80 + 50 = 250
and hence
If(x - .~)2 = 250 - 842/40 = 73·60
Calculation of s then proceeds as before.
6 MEASURES OF SKEWNESS
TABLE 8
Stem-and-Ieaf display for River Clyde (Station 20)
alkalinity data
4 2 1
5 0 1
6 88266706008 11
7 406 3
8 5982 4
9 02 2
10 28 2
11 527 3
12 26 2
13 3 1
TABLE 9
Ordered stem-and-Ieaf display
for data in Table 8
Stem Leaf
4 2
5 0
6 00026667888
7 046
8 2589
9 02
10 28
11 257
12 26
13 3
DESCRIPTIVE STATISTICAL TECHNIQUES 33
TABLE 10
Back-to-back stem-and-Ieaf display for River Clyde
alkalinity data
9 2
4 3
688 4 2
9885234 5 0
20205 6 88266706008
21 7 406
780 8 5982
62 9 02
6 10 28
40 11 527
12 26
13 3
_.-----II-I-,---I_~--.
Fig. 8. Box-and-whisker display for River Clyde (Station 20) alkalinity data.
to compare various data sets, e.g. blood lead concentrations in men and
women, in different years or in various areas where surveys were carried
out. Figure 9 gives an indication of how this can be done, enabling a rapid
visual assessment of any differences between sets of data to be made quite
easily.
YEAR 1 YEAR 2
E
o
o 15
0,
3
c:
o
.~
C
~ 10
c:
o
(,)
-g
Ql
~
"0
g 5
iIi
o
Fig. 9. Box-and-whisker displays comparing blood lead concentrations in
men and women in successive years.
DESCRIPTIVE STATISTICAL TECHNIQUES 35
ACKNOWLEDG EM ENT
REFERENCES
I. Lee, J.D. & Lee, T.D., Statistics and Numerical Methods in BASIC for Biolo-
gists. Van Nostrand Reinhold, Wokingham, 1982.
2. GEMS: Global Environment Monitoring System, Global Pollution and
Health. United Nations Environment Programme and World Health Or-
ganization, London, 1987.
3. GEMS: Global Environment Monitoring System, Air Quality in Selected
Urban Areas 1975-1976. World Health Organization, Geneva, 1978.
4. Bajpai, A.C., Calus, I.M. & Fairley, J.A., Statistical Methods for Engineers
and Scientists. John Wiley, Chichester, 1978.
5. Department of the Environment, UK Blood Lead Monitoring Programme
1984-1987 Resultsfor 1984 Pollution Report No 22. Her Majesty's Stationery
Office, London, 1986.
6. Department of the Environment, UK Blood Lead Monitoring Programme
1984-1987 Resultsfor 1985 Pollution Report No 24. Her Majesty's Stationery
Office, London, 1987.
7. Department of the Environment, UK Blood Lead Monitoring Programme
1984-1987 Resultsfor 1986 Pollution Report No 26. Her Majesty's Stationery
Office, London, 1988.
8. Marriott, F.H.C., A Dictionary of Statistical Terms. 5th edn, Longman
Group, UK, Harlow, 1990.
9. Tukey, J.W., Exploratory Data Analysis. Addison-Wesley, Reading, MA,
1977.
10. Erickson, B.H. & Nosanchuk, T.A., Understanding Data. Open University
Press, Milton Keynes, 1979.
Chapter 2
Environmetric Methods of
Nonstationary Time-Series
Analysis: Univariate Methods
PETER YOUNG and TIM YOUNG*
Centre for Research in Environmental Systems, Institute of
Environmental and Biological Sciences, Lancaster University, Lancaster,
Lancashire, LA] 4YQ, UK
1 INTRODUCTION
352 .----------------~
350 (a)
348
346
344
342
340
160.--------------------,
(b)
140
120
100
80
60
40
20
I.
1750 1800 1850 1900 1950
en 60
5 50
6z 40
'"
~ 30
~ 20
UJ
~ 10
UJ
Q.
7 13 19 25 31 37 43 49 55 61 67
SAMPLE NUMBER
Tim. (kyrs.B.P.l
" -----"-----,,,-----,--- ,
250 200 150 100 50
PERMIAN ITRIASSlcl JURASSIC CAETACEOUS I TERTIARY
We must start at the beginning: the first step in the evaluation of any data
set is to look at it carefully. This presumes the availability of a good
computer filing system and associated plotting facilities. Fortunately,
most scientists and engineers have access to microcomputers with appro-
priate software; either the IBM-PC-AT/IBM PS2 and their compatibles,
or the Macintosh SE/H family. In this chapter, we will use mainly the
micro CAPTAIN program developed at Lancaster/ which is designed for
the IBM-type machines but will be extended to the Macintosh in the near
future. Other programs, such as StatGraphics@, can also provide similar
facilities for the basic analysis of time-series data, but they do not provide
the recursive estimation tools which are central to the micro CAPTAIN
approach to time-series analysis.
Visual appraisal of time-series data is dependent very much on the
background and experience of the analyst. In general, however, factors
such as nonstationarity of the mean value and the presence of pronounced
periodicity will be quite obvious. Moreover, the eye is quite good at
analysing data and perceiving underlying patterns, even in the presence of
background noise: in other words, the eye can effectively 'filter' off the
NONSTATIONARY TIME-SERIES ANALYSIS 41
effects of stochastic (random) influences from the data and reveal aspects
of the data that may be of importance to their understanding within the
context of the problem under consideration. The comments above on the
CO 2 and sunspot data are, for example, typical of the kind of initial
observations the analyst might make on these two time-series.
Having visually appraised the data, the next step is to consider their
more quantitative statistical properties. There are, of course, many dif-
ferent statistical procedures and tests that can be applied to time-series
data and the reader should refer to any good text on time-series analysis
for a comprehensive appraisal of the subjectY Here, we will consider only
those statistical procedures that we consider to be of major importance in
day-to-day analysis; namely, correlation analysis in the time-domain, and
spectral analysis in the frequency-domain.
A discrete time-series is a set of observations taken sequentially in time;
thus N observations, or samples, taken from a series y(t) at times t l ,
t2 , ••• , tk , ••• , tN' may be denoted by y(tl), y(t2), ... ,y(td, ... ,y(tN)'
In this chapter, however, we consider only sampled data observed at some
fixed interval M: thus we then have N successive values of the series
available for analysis over the observation interval of N samples, so that we
can use y(1), y(2), ... , y(k), ... , y(N) to denote the observations made
at equidistant time intervals to, to + bt, to + 2bt, ... , to + kbt, ... ,
to + Nbt. If we adopt to as the origin and bt as the sampling interval, then
we can regard y(k) as the observation at time t = tk •
A stationary time-series is one which can be considered in a state of
statistical equilibrium; while a strictly stationary time-series is one in which
its statistical properties are unaffected by any change in the time origin to.
Thus for a strictly stationary time-series, the joint distribution of any set
of observations must be unaffected by shifting the observation interval. A
nonstationary time-series violates these requirements, so that its statistical
description may change in some manner over any selected observation
interval.
What do we mean by 'statistical description'? Clearly a time-series can
be described by numerous statistical measures, some of which, such as the
a;
sample mean y and the sample variance are very well known, i.e.
I k=N I k=N
Y = - L
N k=1
y(k); a; = -
N
L
k=1
[y(k) - YF
If we are to provide a reasonably rich description of a fairly complex
time-series, however, it is necessary to examine further the temporal
patterns in the data and consider other, more sophisticated but still
42 PETER YOUNG AND TIM YOUNG
cn
r(n) = -; n = 0, I, 2, ... , r - I
Co
NONSTATIONARY TIME-SERIES ANALYSIS 43
---
Autoregressive
e(k) I y(k)
+ a1z- + a2z-2 + ...
1 + anz- n
Filter
quency-domain: after all, we observe the series and plot them initially in
temporal terms, so that the first patterns we see in the data are those which
are most obvious in a time-domain context. An alternative way of analys-
ing a time-series is on the basis of Fourier-type analysis, i.e. to assume that
the series is composed of an aggregation of sine and cosine waves with
different frequencies. One of the simplest and most useful procedures
which uses this approach is the periodogram introduced by Schuster in the
late nineteenth century.2
The intensity of the periodogram I(.f;) is defined by
2
I(/;) = N {[k=N
k~l y(k)cos(2n/;k) J2 + [k=N
k~l y(k)sin(2n/;k) J2} ;
i = 1,2, ... , q
where q = (N - 1)/2 for odd Nand q = N/2 for even N. The periodo-
gram is then the plot of I(/;) against /;, where /; = i/ N is ith harmonic of
the fundamental frequency liN, up to the Nyquist Frequency of 0·5 cycles
per sampling interval (which corresponds to the smallest identifiable wave-
length of two samples). Since I(.f;) is obtained by multiplying y(k) by sine
and cosine functions of the harmonic frequency, it will take on relatively
large values when this frequency coincides with a periodicity of this
frequency occurring in y(k). As a result, the periodogram maps out the
spectral content of the series, indicating how its relative power varies over
the range of frequencies between /; = 0 and 0·5. Thus, for example,
pronounced seasonality in the series with period T = 1/;; samples will
induce a sharp peak in the periodogram at /; cycles/sample; while if the
seasonality is amplitude modulated or the period is not constant then the
peak will tend to be broader and less well defined.
The sample spectrum is simply the periodogram with the frequency /;
allowed to vary continuously over the range 0 to 0·5 cycles, rather than
restricting it to the harmonics of the fundamental frequency (often, as in
later sections of the present chapter, the sample spectrum is also referred
to as the periodogram). This sample spectrum is, in fact, related directly
to the autocovariance function by the relationship,
in this regard. It is true that the sample spectrum obtained in this manner
has high variance about the theoretical 'true' spectrum, which has led to
the computation of 'smoothed' estimates obtained by choosing Ak in the
above expression to have suitably chosen weights called the lag window.
However, the raw sample spectrum remains a fundamentally important
statistical characterisation of the data which is complementary in the
frequency-domain with the autocovariance or autocorrelation function in
the time-domain.
There is also a spectral representation of the data which is complemen-
tary with the partial autocorrelation function, in the sense that it depends
directly on autoregression estimation: this is the autoregressive spectrum.
Having identified and estimated an AR model for the time-series data in
the time-domain, its frequency-domain characteristics can be inferred by
noting that the spectral representation of the backward shift operator, for
a sampling interval (jt, is given by,
z-r = exp(-jrl;(jt) = cos(rl;(j() + jsin(rl;(jt); o ~ I; ~ 0·5
so that by substituting for z-', r = I, 2, ... , n, in the AR transfer
function, it can be represented as a frequency-dependent complex number
of the form A(I;) + jB(I;). The spectrum associated with this represen-
tation is then obtained simply by plotting the squared amplitude A(I;)2 +
B(I;)2, or its logarithm, either against I; in the range 0 to O· 5 cycles/sample
interval, or against the period 1/1; in samples. This spectrum, which is
closely related to the maximum entropy spectrum,4 is much smoother than
the sample spectrum and appears to resolve spectral peaks rather better
than the more directly smoothed versions of sample spectrum.
3 RECURSIVE ESTIMATION
and,
P(k) = P(k - I) + P(k - I)z(k)
x [0- 2 + Z(k)TP(k - l)z(k)r'zT(k)P(k - 1) (2)
In this algorithm, y(k) is the kth observation of the time-series data; a(k)
is the recursive estimate at the kth recursion of the autoregressive para-
meter vector a(k), as defined for the AR model; and P(k) is a symmetric,
n x n matrix which provides an estimate of the covariance matrix associ-
ated with the parameter estimates. A full derivation and description of this
algorithm, the essence of which can be traced back to Gauss at the
beginning of the nineteenth century, is given in Ref. 9. Here, it will suffice
to note that eqn (1) generates an estimate a(k) of the AR parameter vector
a at the kth instant by updating the estimate a(k) obtained at the previous
(k - I)th instant in proportion to the prediction error e(k/k - I), where
e(k/k - I) = y(k) - zT(k)a(k - I)
is the error between the latest sample y(k) and its predicted value
zT(k)a(k - I), conditional on the estimate a(k) at the (k - I)th instant.
This update is controlled by the vector g(k), which is itself a function of
the covariance matrix P(k). As a result, the magnitude of the recursive
update is seen to be a direct function of the confidence that the algorithm
associates with the parameter estimates at the (k - I )th sampling instant:
the greater the confidence, as indicated by a P(k - 1) matrix with
elements having low relative values, the smaller the attention paid to the
prediction error, since this is more likely to be due to the random noise
input e(k) and less likely to be due to estimation error on a(k - 1).
In the RLS algorithm shown above, there is an implicit assumption that
the parameter vector a is time-invariant. In the recursive TVP version of
the algorithm, on the other hand, this assumption is relaxed and the
parameter may vary over the observation interval to reflect some changes
in the statistical properties of the time-series y(k), as described by an
assumed stochastic model of the parameter variations. In the present
chapter, we make extensive use of recursive TVP estimation. In particular,
we exploit the excellent spectral properties of certain recursive TVP esti-
mation and smoothing algorithms to develop a practical and unified
approach to adaptive time-series analysis, forecasting and seasonal adjust-
ment, i.e. where the results of TVP estimation are used recursively to
update the forecasts or seasonal adjustments to reflect any nonstationary
or nonlinear characteristics in y(k).
The approach is based around the well-known 'structural' or 'com-
NONSTATIONARY TIME-SERIES ANALYSIS 49
below; while '1il (k) and '1i2(k) represent zero mean, serially uncorrelated,
discrete white noise inputs, with the vector '1i(k) normally characterised by
a covariance matrix Qi' i.e.
I for k = j
(\. = {
.J 0 for k oF j
where, unless there is evidence to the contrary, Qi is assumed to be
diagonal in form with unknown diagonal elements qill and qi22' respectively.
This GRW model subsumes, as special cases,9 the very well-known
random walk (RW: P = y = 0; '1i2(k) = 0); and the integrated random
walk (IRW: P = y = 1; '1il (k) = 0). In the case of the IRW, we see that
ci(k) and ~(k) can be interpreted as level and slope variables associated
with the variations of the ith parameter, with the random disturbance
entering only through the di(k) equation. If '1il (k) is non-zero, however,
then both the level and slope equations can have random fluctuations
defined by '1il (k) and '1i2(k), respectively. This variant has been termed the
'Linear Growth Model' by Harrison & Stevens. 19•22
The advantage of these random walk models is that they allow, in a very
simple manner, for the introduction of nonstationarity into the regression
model. By introducing a parameter variation model of this type, we are
assuming that the time-series can be characterised by a stochastically
variable mean value, arising from co(k) = t(k) and a perturbational com-
ponent with potentially very rich stochastic properties deriving from the
TVP regression terms. The nature of this variability will depend upon the
specific form of the GRW chosen: for instance, the IRW model is par-
ticularly useful for describing large smooth changes in the parameters;
while the RW model provides for smaller scale, less smooth variationsY4
As we shall see below, these same models can also be used to handle large,
abrupt changes or discontinuities in the level and slope of either the trend
or the regression model coefficients.
The state space representation of this dynamic regression model is
obtained simply by combining the GRW models for the n + I parameters
into the following composite state space form,
x(k) = Fx(k - 1) + G'1(k - I) (6a)
y(k) = Hx + e(k) (6b)
where the composite state vector x is composed of the ci(k) and ~(k)
parameters, i.e.
x(k) = [co(k) do(k) cl(k) d1(k) ... cn(k) dn(kW
52 PETER YOUNG AND TIM YOUNG
the stochastic input vector r,(k) is composed of the disturbance terms to the
GRW models for each of the time-variable regression coefficients, i.e.
r,(k)T = [r,OI (k) r,02(k) r,11 (k) r,12(k) ... r,nl (k) r,n2(kW
while the state transition matrix F, the input matrix G and the observation
matrix H are defined as,
F; 0 0 0 Gt 0 0 0
0 F; 0 0 0 Gt 0 0
F G
0 0
0 0 0 F; 0 0 0 Gt
H = [I 0 xl(k) 0 x 2(k) o ... xn(k) 0]
In other words, the observation eqn (6b) represents the regression model
(4); with the state equations in (6a) describing the dynamic behaviour of
the regression coefficients; and the disturbance vector r,(k) in eqn (6a)
defined by the disturbance inputs to the constituent GRW sub-models.
We have indicated that one obvious choice for the definition of the
regression variables x;(k) is to set them equal to the past values of y(k), so
allowing the perturbations to be described by a TVP version of the AR(n)
model. This is clearly a sensible choice, since we have seen in Section 2 that
the AR model provides a most useful description of a stationary stochastic
process, and we might reasonably assume that, in its TVP form, it provides
a good basis for describing nonstationary time-series. If the perturbational
component is strongly periodic, however, the spectral analysis in Section 2
suggests an alternative representation in the form of the dynamic harmonic
regression (DHR) model. 35 Here t(k) is defined as in the DAR model but
p(k) is now defined as the linear sum of sin and cosine variables in F
different frequencies, suitably chosen to reflect the nature of the seasonal
variations, i.e.
;=F
p(k) L OI;(k)cos(2n/;k) + 02;(k)sin(2n/;k) (7)
;=1
where the regression coefficients OJ;(k),j = 1,2 and i = 1,2, ... , Fare
assumed to be time-variable, so that the model is able to handle any
nonstationarity in the seasonal phenomena. The DHR model is then in the
form of the regression eqn (4), with co(k) = t(k), as before, and appro-
priate definitions for the remaining c; (k) coefficients, in terms of OJ; (k). The
integer n, in this case, has to be set equal to 2F, so that the regression
NONSTA TIONARY TIME-SERIES ANALYSIS 53
·Such as the use of centralised moving averaging or smoothing spline functions for
trend estimation; and constant parameter harmonic regression (HR) or equivalent
techniques for modelling the seasonal cycle; see Ref. 39 which uses such procedures
with the HR based on the first two harmonic components associated with the
annual CO 2 cycle.
58 PETER YOUNG AND TIM YOUNG
NVR:
.00001
~Q.Q.(l! __
~
J!L._
I I
0.0 0.1 0.2 0.3 0.4 0.5
Cycle per interval
(b) ~!:;:. .
/I
5
c
0.6 I, ii \\ I \ NVR:
o .00001
'i:
~ 0.4 Ii \i \ \ ~Q.Q.Q.L_
/. ,Iili:\\\ '\ \\
~99.L ...
0.2
j .01
.J /
.1
j \ \. .. ~~~.~r_----_.
o+-______~~--~/~·~:~IL-~I~.~ \ '" 1.
has only a single white noise input disturbance. It is clear that this scalar
NVR, which is the only parameter to be specified by the analyst, defines
the 'bandwidth' of the smoothing filter. The phase characteristics are not
shown, since the algorithm is of the 'two-pass' smoothing type and so
exhibits zero phase lag for all frequencies. We see from Fig. 2(a) that the
IRWSMOOTH algorithm is a very effective 'low-pass' filter, with par-
ticularly sharp 'cut-off' properties for low values of the NVR. The rela-
tionship between the loglO(F5O ), where F50 is the 50% cut-off frequency,
and 10gI0(NVR) is approximately linear over the useful range of NVR
values, so that the NVR which provides a specified cut-off frequency can
be obtained from the following approximate relationship,35
NVR = 1605[F50 ]4 (17)
In this manner, the NVR which provides specified low-pass filtering
characteristics can be defined quite easily by the analyst. For an inter-
mediate value of NVR = 0·0001 (cut-off frequency = 0·016cycles/sample),
for example, the estimated trend reflects the low frequency movements in
the data while attenuating higher frequency components; for NVR = 0
the bandwidth is also zero and the algorithm yields a linear trend with
constant slope; and for large values greater than 10, the estimated 'trend'
almost follows the data and the associated derivative estimate dj(k) pro-
vides a good smoothed numerical differentiation of the data. The band-
pass nature of the DHR recursive smoothing algorithm (DHRSMOOTH)
is clear from Fig. 2(b) and a similarly simple relationship once again exists
between the bandwidth and the NVR value. These convenient bandwidth-
NVR relationships for IRWSMOOTH and DHRSMOOTH are useful in
the proposed procedure for spectral decomposition discussed below.
Clearly, smoothing algorithms based on other simple random walk and
TVP models can be developed: for instance, the double integrated random
walk (DIRW, see Refs. 9 and 40) smoothing algorithm has even sharper
cut-off characteristics than the IRW, but its filtering characteristics exhibit
much higher levels of distortion at the ends of the data set. 40
7 PRACTICAL EXAMPLES
0.10
0.09
0.08 -
0.07 -
0.06 -
~0.05 -
0.04 -
0.03 -
*
0.02 -
0.01 -
0.00 -. I -. •• ,--, • -·-·--r·-···-.---.. ·.-··· .. I .••• I ••. ""'":""""i ,~,~-,-• • ;-~~;-. -:-:-:-: I
0.05 0.10 0.15 0.20 0.25 0.30 0.35 0.40 0.45
354
352
(a)
350
348
346
344
342
340
338
336
334
332
330
328
326
20 40 60 80 100 120 140
0.128
(b)
"'
I~"
0.127
!
/
0.126
/ \
!
0.125
0.124
J
0.123
0.122
0.121
0.120
0.119
0.118
0.117
V
20 40 60 80 100 120 140
t (c)
5
4
I
3
~
~
\ ~
2 f\
o
-1
-2
-3
-4
-5
20 40 60 80 100 120 140
Fig. 4. Initial analysis of the Mauna Loa CO 2 series: (a) the IRWSMOOTH
estimated trend superimposed on the data; (b) the IRWSMOOTH estimated
CO 2 slope or derivative of the trend; (c) the detrended CO 2 series data.
NONSTATIONAR Y TIME-SERIES ANALYSIS 65
(a)
350 -+- One-Step-Ahead Forecasts
345
340
335
Forecasting
Origin (FO)
330
355~--------------------~----------~
(b)
350
345
340
330
I I
20 40 60 80 100 120 140
354
352 (c)
350
348
346
344
342
340
338
336
334
332
330
328
326~~~~~~~~__-T__-r~~~~
20 40 60 80 100 120 140
0.6 .......- - - - - - - - - - - - - - - - - ,
0.5 (d)
0.4
0.3
0.2
0.1
0.0 +--'--++--t-'---+--i1t--'"--'\---+---+1'11t-A-r.,.--l
-0.1
-0.2
-0.3
-0.4
-0.5
20 40 60 80 100 120 140
8~----------------------------~
(e)
7
Fig. 5. Continued.
68 PETER YOUNG AND TIM YOUNG
ing pass: note the improvement in the residuals in this case compared with
the forward, filtering pass. Smoothed estimates of the Aj(k) amplitudes of
the seasonal components are also available from the analysis, if required.
The seasonally adjusted CO 2 series, as obtained using the proposed
method of DHR analysis, is compared with the original CO 2 measure-
ments in Fig. 5(c), while Figs 5(d) and (e) show, respectively, the residual
or 'anomaly' series (i.e. the series obtained by subtracting the trend and
total seasonal components from the original data), together with its associ-
ated amplitude periodogram. Here, the trend was modelled as an IRW
process with NVR = 0·0001 and the DHR parameters (associated, in this
case, with all the possible principal harmonics at 12, 6, 4, 3, 2-4 and
2 month periods) were also modelled as IRW processes with NVR =
0·001. These are the 'default' values used for the 'automatic' seasonal
analysis option in the micro CAPT AIN program; they are used here to
show that the analysis is not too sensitive to the selection of the models
and their associated NVR values.
The residual component in Fig. 5(d) can be considered as an 'anomaly'
series in the sense that it reveals the movements of the seasonally adjusted
series about the long term 'smooth' trend. Clearly, even on visual apprai-
sal, this estimated anomaly series has significant serial correlation, which
can be directly associated with the interesting peak at 40 months period on
its amplitude periodogram. This spectral characteristic obviously corres-
ponds to the equivalent peak in the spectrum of the original series, which
we noted earlier on Fig. 3. The seasonal adjustment process has nicely
revealed the potential importance of the relatively low, but apparently
significant power in this part of the spectrum. In this last connection, it
should be noted that Young et al. 38 discuss further the relevance of the CO 2
anomally series and consider a potential dynamic, lagged relationship
between this series and a Pacific Ocean Sea Surface Temperature (SST)
anomaly series using multivariate (or vector) time-series analysis.
50
40
w
U
Z
<
it 30
~
w
~
~ 20
z
w
30.65 MYRS
U
a:
w
0-
10
0
0.0 1.0 2.0 3.0 4.0 5.0
CYCLES PER INTERVAL X10- 1
C/)
4
~
~1= 3
~
i1
~ 2
9
7 13 19 25 31 37 43 49 55 61 67
Sl\MPLE NUMBER
Fig. 7. IRWSMOOTH estimated trend superimposed on the log-transformed
Extinctions series.
70 PETER YOUNG AND TIM YOUNG
PERIOOOGRAM ANALYSIS
30T---------------------------------------~
30.65 MYRS
20
40.87 MYRS
3~
AR SPECTRAL ANALYSIS
____________________________________-,
34.91 MYRS
oU
\ 20.10 MYRS
0.0
I
1.0
~----
I ,
2.0
I •
3.0
I
4.0 5.0
5.0
4.0
3.0
2.0
1.0
0.0
0.5 2.5 4.5 6.5
SAMPLE NUMBER
Fig. 10. Contour spectrum based on the smoothed DAR estimates of the
AR(7) parameter values at each of the 63 sample points between 8 and 70.
72 PETER YOUNG AND TIM YOUNG
-1 I
I
-2 -I-r-~-j-,...,...., ,I, , 1 I
I
I I I I
I
I I I
I
I I r I I
I I I i I I
7131925313743 556167
SAMPLE NUMBER
Fig. 11. Additional smoothing of the detrended, log-transformed series using
the IRWSMOOTH algorithm with NVR = 0·1.
ing further critical evaluation of this important series, rather than provid-
ing an accurate estimate of the cyclic period. The small number of com-
plete cycles in the record, coupled with the uncertainty about the geologic
dating of the events, unavoidably restricts our ability to reach unam-
biguous conclusions about the statistical properties of the series. The
authors hope, however, that the example helps to illustrate the extra
dimension to time-series analysis provided by recursive time variable
parameter estimation.
9 CONCLUSIONS
ACKNOWLEDG EM ENTS
Some parts of this chapter are based on the analysis of the Mauna Loa
CO2 data carried out by one of the authors (P.Y.) while visiting the
Institute of Empirical Macroeconomics at the Federal Reserve Bank of
Minneapolis, as reported in Ref. 38; the author is grateful to the Institute
for its support during his visit and to the Journal of Forecasting for
permission to use this material.
REFERENCES
I. Raup, D.M. & Sepkoski, 1.1., Periodicity of extinctions in the geologic past.
Proc. USA Nat. A cad. Sci., 81 (1984) 801-05.
2. Box, G.E.P. & Jenkins, G.M., Time Series Analysis, Forecasting and Control.
Holden-Day, San Francisco, 1970.
3. Young, P.e. & Benner, S., microCA PTA IN Handbook: Version 2.0, Lancaster
University, 1988.
4. Priestley, M.B., Spectral Analysis and Time Series. Academic Press, London,
1981.
5. Young, P.e., Recursive approaches to time-series analysis. Bull. of Inst. of
Math. and its Applications, 10 (1974) 209-24.
6. Akaike, H., A new look at statistical model identification. IEEE Trans. on Aut.
Control, AC-19 (1974) 716-22.
7. Young, P.C., The Differential Equation Error Method of Process Parameter
Estimation. PhD Thesis, University of Cambridge, UK, 1969.
8. Young, P.C., The use of a priori parameter variation information to enhance
the performance of a recursive least squares estimator. Tech. Note 404-90,
Naval Weapons Center, China Lake, CA, 1969.
9. Young, P.e., Recursive Estimation and Time-Series Analysis. Springer-Verlag,
Berlin, 1984.
10. Kopp, R.E. & Orford, R.J., Linear regression applied to system identification
for adaptive control systems. AIAA Journal, 1 (1963) 2300-06.
11. Lee, R.C.K., Optimal Identification, Estimation and Control. MIT Press, Cam-
bridge, MA, 1964.
12. Kalman, R.E., A new approach to linear filtering and prediction problems.
ASME Trans., J. Basic Eng., 83-D (1960) 95-108.
NONSTATIONARY TIME-SERIES ANALYSIS 75
Control and Dynamic Systems, vol. 30, ed. e.T. Leondes. Academic Press, San
Diego, 1989, pp. 119-66.
32. Young, P.e., Ng, e.N. & Armitage, P., A systems approach to economic
forecasting and seasonal adjustment. International Journal on Computers and
Mathematics with Applications, 18 (1989) 481-501.
33. Ng, e.N. & Young, P.e., Recursive estimation and forecasting of nons tat ion-
ary time-series. J. of Forecasting, 9 (1990) 173-204.
34. Norton, J.P., Optimal smoothing in the identification of linear time-varying
systems. Proc. lEE, 122 (1975) 663-8.
35. Young, TJ., Recursive Methods in the Analysis of Long Time Series in Meteo-
rology and Climatology. PhD Thesis, Centre for Research on Environmental
Systems, University of Lancaster, UK, 1987.
36. Schweppe, F., Evaluation of likelihood function for Gaussian signals. IEEE
Trans. on In! Theory, 11 (1965) 61-70.
37. Harvey, A.e. & Peters, S., Estimation procedures for structural time-series
models. London School of Economics, Discussion Paper No. A28 (1984).
38. Young, P.e., Ng, e.N., Lane, K. & Parker, D., Recursive forecasting,
smoothing and seasonal adjustment of nonstationary environmental data. J.
of Forecasting, 10 (1991) 57-89.
39. Bacastow, R.B. & Keeling, e.D., Atmospheric CO 2 and the southern oscil-
lation effects associated with recent EI Nino events. Proceedings of the WMOj
ICSUjUNEP Scientific Conference on the Analysis and Interpretation of
Atmospheric CO 2 Data. Bern, Switzerland, WCP-14, 14-18 Sept. 1981, World
Meteorological Organisation, pp. 109-12.
40. Ng, C.N., Recursive Identification, Estimation and Forecasting of Non-
Stationary Time-Series. PhD Thesis, Centre for Research on Environmental
Systems, University of Lancaster, UK.
41. Box, G.E.P. & Tiao, G.e., Intervention analysis with application to economic
and environmental problems. J. American Stat. Ass., 70 (1975) 70-9.
42. Tsay, R.S., Outliers, level shifts and variance changes in time series. J. of
Forecasting, 7 (1988) 1-20.
43. Young, P.e. & Ng, e.N., Variance intervention. J. of Forecasting, 8 (1989)
399-416.
44. Shiskin, J., Young, A.H. & Musgrave, J.e., The X-II variant of the Census
Method II seasonal adjustment program. US Dept of Commerce, Bureau of
Economic Analysis, Tech. Paper No. 15.
45. Manley, G., Central England temperatures: monthly means: 1659-1973.
Quart. J. Royal Met. Soc., 100 (1974) 387-405.
46. WCRP, Proceedings of the WMOjICSUjUNEP Scientific Conference on the
Analysis and Interpretation of Atmospheric CO 2 Data, Bern, Switzerland,
WCP-14, 14-18 Sept., World Meteorological Organisation, 1981.
47. Schnelle et al. (1981) In Proceedings of the WMOjlCSUjUNEP Scientific
Conference on the Analysis and Interpretation of Atmospheric CO 2 Data. Bern,
Switzerland, WCP-14, 14-18 Sept., World Meteorological Organisation, 1981,
pp. 155-62.
48. Bacastow, R.B., Keeling, e.D. & Whorf, T.P., Seasonal amplitude in atmos-
pheric CO 2 concentration at Mauna Loa, Hawaii, 1959-1980. Proceedings of
the WMOjlCSUjUNEP Scientific Conference on the Analysis and Interpre-
NONSTATIONARY TIME-SERIES ANALYSIS 77
1 BASIC IDEAS
TABLE 1
Annual maximum sea levels (cm) in Venice, 1931-1981
maximum likelihood. Routines for least squares have been widely avail-
able for many years and the method provides generally satisfactory
answers over a range of settings. Some loss of sensitivity can result from
their use, however, and with the computational facilities now available,
maximum likelihood estimates-sometimes previously avoided as being
harder to calculate-are used for their good theoretical properties. The
methods coincide in an important class of situations (Section 3.1).
In many situations interest is focused primarily on one variable, whose
variation is to be explained. In problems of correlation, however, the
relationships between different variables to be treated on an equal footing
are of interest. Perhaps as a result, examples of successful correlation
analyses are rarer than successful examples of regression, where the focus
on a single variable gives a potentially more incisive analysis. Another
reason may be the relative scarcity of flexible and analytically tractable
distributions with which to model the myriad forms of multivariate
dependence which can arise in practice. In some cases, however, variables
must be treated on an equal footing and an attempt made to unravel their
joint structure.
1.2 Examples
Some examples are given below to elaborate on these general comments.
200 -
+
E
()
180
CD +
~ 160
(\l
+
+ +
~ 140~
+
+ + + +
+ + +
:l + ++
E 120 + + + +++
+ + + + +
x ++ + + + + +
(\l +
E ++ +
+
100 + ++ +
iii +
:l + +
C
C
« 80 +
Year
TABLE 2
Ozone data for two locations in Texas, 1981-1984
A Number of days per month with exceedances over 0·08 ppm ozone.
B Number of days of data per month.
C Daily maximum temperature eC) for the month.
plot of the proportion of days on which 0·08 ppm was exceeded against
temperature. There is clearly a relation between the two, but the discre~e-
ness of the data and the large number of zero proportions mean that a
simple relation of the same form as eqn (1) is not appropriate.
E
e.
0.5
e.
co
0
c:i 0.4
Q;
15
.c;
0.3
. ,.
c0 +
E
+ +
Q; 0.2 + + + ++
e.
(f)
>- ~ t
'"
+ +
"0 + +
0.1
15 ... + + + + +
.,c
0
0.0
+ + ++
Fig. 2. Proportion of days per month on which ozone levels exceed 0·08 ppm
for the years 1981-1984 at Beaumont and North Port Arthur, Texas, plotted
against maximum daily average temperature.
and Colorado Rivers at and near San Saba, data from Llano and near
Castell were combined for the Llano River, and data from near Johnson
City and Spicewood were combined for the Pedernales River (see Fig. 3).
If prevention of flooding downstream at Austin is of interest, it will be
important to explain joint variation in the peak discharges, which will
have a combined effect at Austin if they arise from the same rainfall
episode.
There are many possible joint distributions which might be fitted to the
data. About the simplest is the multivariate normal distribution, whose
probability density function is
f(y; Il, U) = (2n)-p/2IUI- 1/ 2 exp {-t(y - Il)TU- 1 (y - Il)} (2)
and so on. In this example the vector observation for each year consists
of the peak discharges for the p = 4 rivers. There are n = 59 independent
copies of this if we assume that discharges are independent in different
TABLE 3
Annual peak discharges (cusecs x 103 ) for four Texas rivers. 1924-1982
Llano
59·50 21·40 14·40 33·90 27·30 49·00 122·00 92·50 22·80 19·30
388·00 130·0 3·72 110·00 55·50 28·20 26·70 23·40 50·60 10·10
8·50 18·20 8·60 108·00 14·60 7·77 13·90 23·20 16·50 3·46
72-00 1·85 47·20 83·70 35·60 103·00 57·60 1·70 1-81 67·20
2·48 15·40 27·40 44·40 4·52 154·00 28·00 24·50 11·40 154·00
19·40 61·50 67·50 139·00 25·80 210·00 32·90 116·00 5·98
~
Pedernales
~
1·36 28·30 16·40 6·94 155·00 36·60 13·90 18·50 2·02 11·40 ~
105·00 85·30 10·00 14·80 2·39 42·90 21·10 26·60 27·00 104·00 ~
25·50 9·68 10·20 8·38 8·17 29·10 11·80 441·00 32·20 5·34 0
13-60 0·16 125·00 50·20 47·00 142·00 15·60 8·55 5·27 10·60
z
32·30 7·55 58·30 27·90 12·70 28·90
>
Z
9·07 35·70 21·40 44·40
90·10 16·80 98·00 127·00 64·20 2·79 49·60 32·30 62·60 C")
San Saba
"
0
~
~
6·50 4·66 8·64 8·64 8·46 7·46 8·64 44·80 34·00 7·35 m
r"'
27·20 64·00 67·20 5·11 203·00 2·19 5·57 27·20 25·20 20·40 >
4·67 4·78 14·70 2·49 4·66 6·29 2·72 12·50 70·40 1·50 g
7·15 41·30 35·60 27·50 32·00 3·16 10·30 11·50 10·20 0·68 Z
20·20 20·40 1·92 4·67 17·40 4·25 36·70 25·60 5·99 1·05
40·50 3·20 10·30 10·50 27·00 1·81 40·70 3·56 3·59
Colorado
17·40 30·30 29·60 189·00 27·20 35·00 31·60 78·90 39·80 26·60
45·30 86·00 179·00 115·00 224·00 20·40 23·40 42·60 25·00 23·20
19·20 32·30 16·30 19·40 34·10 32·80 8·01 22·40 69·00 20·70
24·90 57·20 54·10 66·20 44·40 20·90 43·00 23·40 15·60 12·60
00
29·90 42·40 16·00 13-20 34·80 15·00 44·50 30·90 11·80 7·08 v.
46·20 13-30 10·70 18·10 28·10 9·11 36·00 4·36 21·40
86 A.C. DAVISON
-~~
/ Llano river Castell
; Austin
~
2 CORRELATION
WII WI2)
(
W21 W22
1982
Year
TABLE 4
Correlation matrices for the Texas rivers data: original data and data after
log transformation
Original data
Colorado I
San Saba I 0·70
Pedemales I 0·00 -0,14
Llano 0·29 -0·14 -0·08
. ~~ .t.~
o· .. 1982
.4.~ o~d'
o Q \"
r? 0 ·o'\.· o
•
°t·It(i
o· ~o Year
o .~;B
0
:~\,
~ 0 0
0
o _~" 0 0 0 ••
0
~~9"~ 1924
0° o. o 0 5.412
#
0 0 06
J. 0 0
.!
,:
o
,:.;~ Colorado ,J(~60
~ , 0 o0 5 •
o • 0• 00
1,472 .~
0 0 5.313 0 0
o 0 f °iM (~~';I
00".,'Iffd •
0~~0 San 8aba
o~~ o .. ",. oe
.".0 00
.
• • 0
... 0 0\ 0 ~o 0 •
0 Go 0 -0.3975 0 0
6.089 ° 0 0
. . .
0' 0 o 0 0 0 0 0 0 00 •
-1.808
5.981 0
• 0./k!>
Llano
~~. 09·
of."
~ J'4. 0
"~09" •• ~Co .'fl~
00.,.. ~
~ ./'
0.5306 0
•• 0
0
• ~o • o •• A. 0
Fig. 5. Scatterplot matrix for Texas rivers data after log transformation.
90 A.C. DAVISON
3 LINEAR REGRESSION
3.1 Basics
Consider again the data on sea levels in Venice pictured in Fig. 1. If we
suppose that model (1) is appropriate, we need to estimate the parameters
130 and 131' We suppose that the errors et are uncorrelated with mean zero
and variance (J2, also to be estimated. We write eqn (l) in matrix form as
1';931 0 el931
YI932 el932
Yl933 2
(~~) + el933 (5)
YI981 50 el981
or
y = XfJ + I: (6)
in an obvious notation.
More generally we take a model written in form (6) in which Y is an
n x 1 vector of responses, X is an n x p matrix of explanatory variables
or covariates, fJ is a p x 1 vector of parameters to be estimated, with
p < n, and Il is an n x I vector of unobservable random disturbances.
The sum of squares which corresponds to any value fJ is
xi fJ)2
n
(Y - XfJ)T(y - XfJ) = L (Yj - (7)
j=1
92 A.C. DAVISON
(10)
S
2 1 ~
= -- ~ (lj - Tj)2
Xjp) (11)
n - p j=1
is unbiased as an estimate of (J2. The divisor n - p can be thought of as
compensating for the estimation of p parameters from n observations and
is called the degrees of freedom of the model.
The estimators fJ and; have many desirable statistical properties. I I The
estimators can be derived under the second-order assumptions made above,
which concern only the means and covariance structure of the errors.
These assumptions are that
var(e) = (J2, (k =f j)
A stronger set of assumptions for the linear model are the normal-theory
REGRESSION AND CORRELA nON 93
For a given set of data, i.e. having observed values of Yj and Xj' this can
be regarded as a function, the likelihood, of the unknown parameters /J and
(J2. We now seek the value of /I which makes the data most plausible in the
sense of maximizing eqn (13). It is numerically equivalent but algebraically
more convenient to maximize the log likelihood
n l~
2
L(/I, (J) = - 2 2
log (2n(J ) - 2(J2 j~ (Yj -
T2
Xj /I)
which is clearly equivalent to minimizing the sum of squares (eqn (7)) and
so the maximum likelihood estimate of /I equals the least squares estimate
p. The maximum likelihood estimate of (J2 is
n
&2 = n- I I (Yj - XJjJ)2
j=1
200
+
180
E
.£ +
i 160
+
'"
Q)
til
140
+ +
+
+
+ +
+
~ + ++
.~ 120 + +
+
++ + +
~c: +
++ + +
+ +
~ 100 + +
++ +
+ +
80 +
50, I I I I I I I I I I I I I ,1 I I I I ,
Fig. 6. Linear and cubic polynomials fitted to Venice sea level data.
and this has an estimated value 389·78; the standard error of Y1990 is 19·74.
Thus most of the prediction uncertainty for the future value is due to the
intrinsic variability of the maximum sea levels. An approximate 95%
predictive confidence interval is given by 138·85 ± 1·96 x 19·74 =
(100,2, 177'5). This is very wide.
Apart from uncertainty due to the variability of estimated model par-
ameters, and intrinsic variability which would remain in the model even if
the parameters were known, there is phenomenological uncertainty which
makes it dangerous to extrapolate the model outside the range of the data
observed. The annual change in the annual maxima may change over the
years 1981-1990 for reasons that cannot be known from the data. This
makes a prediction based on the data alone possibly meaningless and
certainly risky.
y
y
x
Fig. 7. The geometry of least squares. The X-V plane is spanned by the columns
of the covariate matrix X, and the least squares estimate minimizes the distance
between the fitted value, 9, which lies in the X-V plane, and the data, Y. The
x-axis is spanned by a column of ones, and the overall average, 9, minimizes
the distance between that axis and the fitted value 9. The extra variation
accounted for by the model beyond that in the x-direction is orthogonal to the
x-axis.
TABLE 5
Analysis of variance table for linear regression model
SSReg/(P - I)
Residual P = 1: (Yj - Yj) p)
A
11 - SSRes SSRes/(1l -
Mean
Total n
that
yTy = yTy + (Y _ y)T(y _ Y) + (y __ y)T(y _ Y)
or equivalently
L lj2 = ny2 + L (~ - y)2 + L (lj - ~)2 (14)
The degrees of freedom (df) of a sum of squares is the number of par-
ameters to which it corresponds, or equivalently the dimension of the
subspace spanned by the corresponding columns of X. The degrees of
freedom of the terms on the right-hand side of eqn (14) are respectively I,
p - 1, and n - p.
These sums of squares can be laid out in an analysis of variance table as
shown in Table 5. The sum of squares for the regression can be further
broken down into reductions due to individual parameters or sets of them.
An analysis of variance (ANOV A) table displays concisely the relative
contributions to the overall variability accounted for by the inclusion of
the model parameters and their corresponding covariates. The last two
rows of an analysis of variance table are usually omitted, on the grounds
that the reduction in sum of squares due to an overall mean is rarely of
interest.
A good model is one in which the fitted value Y is close to the observed
y, so that a high proportion of the overall variability is accounted for by
the model and the residual sum of squares is small relative to the regression
sum of squares. The ratio of the regression sum of squares to the adjusted
total sum of squares can be used to express what proportion of the overall
variability is accounted for by the fitted regression model. This is numeri-
cally equal to the square of the correlation coefficient between the fitted
values and the data.
98 A.C. DAVISON
TABLE 6
Analysis of variance for fit of cubic regression model to Venice sea level data
R -
y-y
J J
j - s(l - hjjy/2
where hjj is thejth diagonal element of the hat matrix H = X(XT X)-I XT.
If the model is correct, the Rj should have zero means and approximately
unit variance, and should display no forms of non-randomness, the most
usual of which are likely to be
(i) the presence of outliers, sometimes due to a misrecorded or
mistyped data value which may show up as lying out of the pattern
of the rest, and sometimes indicating a region of the space of
covariates in which there are departures from the model. Single
outliers are likely to be detected by any of the plots described below,
whereas multiple outliers may lead to masking difficulties in which
each outlier is concealed by the presence of others;
102 A.C. DAVISON
Figure 8 shows some possible patterns and their causes. Parts (a)-(d)
should show random scatter, although allowance should be made for
apparently non-random scatter caused by variable density of points along
the abscissa. Plots (e) and (f) are designed to check for outliers and
non-normal errors. The idea is that if the Rj are roughly a random sample
from the normal distribution, a plot of the ordered Rj against approximate
normal order statistics <I>~ I {(j - 3/8)/(n + 1/2)} should be a straight line
of unit gradient through the origin. Outliers manifest themselves as
extreme points lying off the line, and skewness of the errors shows up
through a nonlinear plot.
The value of Xj may give a case unduly high leverage or influence. The
distinction is a somewhat subtle one. An influential observation is one
whose deletion changes the model greatly, whereas deletion of an obser-
vation with high leverage changes the accuracy with which the model is
determined. Figure 9 makes the distinction clearer. The covariates for a
point with high leverage lie outwith the covariates for the other obser-
vations. The measure of leverage in a linear model is the jth diagonal
element of the hat matrix, hjj' which has average value pin, so an obser-
vation with leverage much in excess of this is worth examining.
One measure of influence is the overall change in fitted values when an
observation is deleted from the data. Let Y(j) denote the vector of fitted
values when lJ_ is deleted from the data. Then one simple measure of the
change from Y to Y(j) is Cook's distance 13 defined as
R R
x x
x x
x x
x x x
x x x X X X
X X X X X
X X
0 0
X X X X X X X l(
Y
x x x
x x x
x
x x
x
x
(a) (b)
R x
x R
x x
x x
x x x x
x x x
x x x x
x Xx X X x x x x x x
0 0
x><xxxxxxxx x "y or x xx x
)(
x x
x
x x y
"
X X
x xX x X X X X
X X X X X X
X X X '(
X X
X X X X
(c) (d)
')(
x x
x
x X
X X
X
X X
o x xX o xXX
Normal order x Normal order
statistic
)(
statistic
x><E xl'
x x xxxx
x
x
(e) (f)
Y x y
. x
~
.
~
.
x x
(a) (b)
y x
x
(c)
Fig. 9. The relation between the leverage and the influence of a point. The
light line shows the fitted regression with the point x included, and the heavy
line shows the fitted regression with it excluded. In (a) the point has little
leverage but some influence on the intercept. though not on the estimate of
slope; in (b) the point has high leverage but little influence; and in (c) both the
leverage and the influence of the point are high.
but if one or two values of Cj are large relative to the rest it is worth
refitting the model without such observations in order to see if there are
major changes in the interpretation or strength of the regression.
Even though it is an outlier, an observation with high leverage may have
a small standarized residual because the regression line passes close to it.
This problem can be overcome by use of jackknifed residuals
, y _ x T P(jl ( n _ p _ I )1/2
Rj = S(jl {I _ ~jj} 1/2 = Rj n - p - Rf
which measure the discrepancy between the observation lj and the model
obtained when lj is not included in the fitting. The Rj are more useful than
the ordinary residuals Rj for detecting outliers.
4 +
-'-
cr
~ 2 +
+ +
OJ
U
en
~ 0
u
2
c: -2 +
-'"
-'"
0
.!ll, -4
-6
100 110 120 130 140
"-
Ii tted value Yj
Fig. 10. Plot of jackknifed residuals, Rj, against fitted values, Yi , for Venice
data.
model of linear trend fitted to the sea level data. Figure 10 shows the
jackknifed residuals R;
plotted against the fitted values ~. There is one
outstandingly large residual, for the year 1966, but the rest seem reason-
able compared to a standard normal distribution. Figure II shows the
values of Cook's distance Cj plotted against case number. The largest
value is for observation 36, which corresponds to 1966, but cases 6 (1936)
and 49 (1979) also seem to have large influence on the fitted model.
IT 3
0 +
+
en
+
~en 2 +
-'"
0
0 +
u +
+ +
+
+ + +
+ + + +
+ + +
+ + + + + + + +
++ + + +
+ + + +
+ ++
+
0
0 10 20 30 40 50
case number,j
Fig. 11. Plot of Cook distances, Ci , against case numbers, j, for Venice data.
106 A.C. DAVISON
0.10
0.08
+ +
+ +
0.06
.s::
0.04
0.02
Case.j
Fig. 12. Plot of leverages. hli • against case numbers. j. for Venice data.
1 (1: Z2 1: Z.)
n 1: (Zj- Z)2 1: ~ nJ
where z = n- I 1: Zj' and so hkb which is the kth diagonal element of the
4
+
3
2 +
++
<ii .v
/
:J
"0
en
~ 0
"0
-1
./'
Gl
CD +
"0 +
(; -2 +
-3
-4 Fig. 13. Plot of ordered residuals
-3 -2 -1 0 2 3 against normal order statistics for
normal order statistic Venice data.
REGRESSION AND CORRELA nON 107
matrix X(XT X)-I X T and so equals xl(XT X)-I Xb can be written in the
form
hkk =
T T
X k (X X)
_I
Xk
I
= -n + ~
(Zk -
(
zi-)2
""j Zj - Z
3.5 Transformations
One requirement of a successful model is consistency with any known
asymptotic behaviour of the phenomenon under investigation. This is
especially important if the model is to be used for prediction or forecast-
ing. Many quantities are necessarily non-negative, for example, so a linear
model for them can lead to logically impossible negative predictions. One
remedy for this is to study the behaviour of the data after a suitable
transformation. Even where considerations such as these do not apply, it
may be sensible to investigate the possibility of transformation to satisfy
model assumptions more closely.
The interpretation and possible usefulness of a linear model are com-
plicated by such factors as
log y (A- = 0)
f(Yj; p, (,z, A-) = (2no.2)-1/2 exp {- 2~2 (y]') - xJ P?} y}-i, (15)
where the Jacobian y} -.I is needed for eqn (15) to be a density for Yj rather
than yY). The optimum value of A- is chosen by maximizing the log
likelihood
n
L(P, (12, A-) = L 10gf(Yj; p, (12, A-)
j=1
Let y = (Ilj=, yy/n denote the geometric mean of the data, and define zj'l
to be equal to yJ!.) Il-I. Then the profile log likelihood for A- is
-285
-
...<
~
Cl. -290
-l
"0
0
0
~ -295
Q)
-"
Ol -300
.2
~
'E -305
0:
-310
-2 -1 0 2
}..
Fig. 14. Profile log likelihood for power transformation for Venice data.
y = 0, 1, ... (16)
the mean and variance of which are both J1.. In the case of non-constant
variance we might aim to find a transformation h( Y) whose variance is
constant. Taylor series expansion gives heY) ~ h(J1.) + (Y - J1.)h'(J1.), so
E{h(Y)} ~ h(J1.) and var{h(Y)} ~ h'(J1.)2 var(Y) = J1.h'(J1.f If this is to
be constant, we must have h( Y) ex yl/2. A more refined calculation shows
that heY) = (Y + 1/4)1/2 has approximate variance 1/4. The procedure
would now be to analyse the transformed counts as variables with known
variance. Difficulties of interpretation like those experienced for the
Venice data arise, however, because a linear model for the square root of
the count suggests that the count itself depends quadratically on the
explanatory variables through the linear part of the model. This would
seem highly artificial in most circumstances, and a more satisfactory
approach is given in Section 5.
REGRESSION AND CORRELA nON III
S
2
= --
I In W (Y - X
TA2
fJ)
n _ P j=1 J J J
but in applications the parameters of direct interest are usually the mean
and variance
112 A.C. DAVISON
"I "2
is the gamma function. Since the mean and variance and are generally
more capable of physical interpretation than YI and Y2' it makes sense to
model their variation directly. A form whereby "I "2
and depend on the
covariates may be suggested by exploration of the data, but more satisfac-
tory co variates are likely to be derived if a heuristic argument for their
form based on physical considerations can be found. Once suitable com-
binations of covariates are found, the estimates mean and variance Kjl and
Kj2 can be regressed on them with weights mj • A more refined approach
would use the covariance matrices ~ and multivariate regression '7 for the
pairs (Kjl , Kj2 ).
This example illustrates two general points. The first is that in combin-
ing estimates from different samples with differing samples sizes it is
necessary to take the sample sizes into account. The second is that when
empirical distributions are replaced by fitted probability models it is wise
to present results in a parametrization of the model which is readily
interpreted on subject-matter grounds, and this may not coincide with a
mathematically convenient parametrization.
4 ANALYSIS OF VARIANCE
where the Gjr are supposed to be independent errors with zero means and
variances a 2, and the parameter Pr represents the mean at the rth site. In
REGRESSION AND CORRELA nON 113
o o
0 eml
0 /31 el2
/32
= + (18)
0 e m2
/3p
el p
Ymp 0 0 e mp
TABLE 7
Analysis of variance table for a one-way layout
Between sites (adjusted for mean) p-I SSReg = Ljr (Y.r - y .. )2 SSReg/(P - I)
o>
Cl
Within sites p(m - 1) SSRes = Ljr (Yjr - Yr)2 SSRes/{p(m - I)} >
adjusted for the overall mean, and SSRes represents variability at sites
adjusted for their different means. This corresponds to the decomposition
SSReg/(P - 1)
SSRes/{p(m - I)}
in which the overall mean is represented by IX and the parameters YI, ... , YP
represent the differences between the site means and the overall mean. This
formulation of the problem contains p + I parameters, whereas eqn (17)
contains p parameters. Plainly p + I parameters cannot be estimated
from data at p sites, and at first sight it seems that eqns (17) and (19) are
incompatible. This difficulty is resolved by noting that only certain linear
combinations of parameters can be estimated from the data; such com-
binations are called estimable. In eqn (17) the parameters {31, ... , {3p are
estimable, but in eqn (19) only the quantities IX + YI, IX + Y2, ... , IX + Yp
are estimable, so that estimates of the parameters IX, YI' ... , Yp cannot all
be found without some constraint to ensure that the estimates are unique.
Possible constraints that can be imposed are:
0 0
0 0
X (20)
0 0
0 0
0 0
which has rank p because its first column equals the sum of the other
columns. Thus the (p + I) x (p + l) matrix X T X is not invertible and
P = (XT X) -I X T Y cannot be found. This difficulty can be overcome by
the use of generalized inverse matrices, but it is more satisfactory in
practice to force X to have rank p by dropping one of its columns.
Constraints (a) and (b) above, correspond respectively to dropping the
first and second columns of eqn (20).
It is important to appreciate that the parametrization in which a model
is expressed has no effect on the fitted values obtained from a model, which
are the same whatever parametrization is used. Thus the fitted value for
an observation at the rth site in the example above is the average at that
site, ji" for parametrization (17) or (19) and any of the constraints (a), (b)
or (c). The parametrization used affects only the interpretation of the
resulting estimates and not any aspect of model fit.
TABLE 8
Hourly average toluene levels (ppb) in London street
Start of hour
8 9 10 11 12 13 14 15
Week 1
Sunday 4·56 4·05 4·22 9·82 4·84 6·59 4·95 5·41
Monday 19·55 9·48 7·24
Tuesday 11·86 15·13 12·13 7·67 7·08 9·55 7·21 8·31
Wednesday 21·68 22·13 18·40 12·16 13·35 11·42 13-80 10·13
Thursday 20·01 28·15 19·59 24·39 25·18 20·44 23·06 26·96
Friday 45·42 57·78 42·45 74·78 75·16 61·90 46·25 52·38
Saturday 4·23 8·65 15·64 14·29 17·02 23·22 18·35 14·14
Week 2
Sunday 4·46 5·59 6·48 5·86 6·27 3·68 3·93 6·29
Monday 5·41 5·41 6·37 7·51 6·40 6·01 6·01 6·22
Tuesday, 10·80 11·27 10·34 11·62 11·41 11·32 12·43 15·60
Wednesday 5·90 29·20 21·81 32·61 18·19 18·95 19·05 19·41
Thursday 12·62 14·33 10·24 8·66 14·69 10·50 14·65 14·83
Friday 8·46 11·61 11·90 11·94 10·12 11·41 9·28 11·51
Saturday 15·88 14·37 12·81 12·17 12·97 13·09 12·72
One approach to detailed analysis of such data would be via time series
analysis, but this is problematic here since the data consist of a number
of short series with large gaps intervening. The aim instead is to sum-
marize the sources of variability by investigating the relative import-
ance of hourly variation and daily variation. Hourly variation might be
due to variable traffic intensity, as well as diurnal changes in temperature
and other meteorological variables, whereas day-to-day variation may be
due to different traffic intensity on different days of the week as well as the
weather.
Some possible models for the data in Table 8 are that Ywdo the toluene
level observed at time t of day d on week w, has an expected value given
by
Q(
Q( + f3d
(21)
Q( + f3d + Yt
Q( + f3wd + Yt
118 A.C. DAVISON
TABLE 9
Analysis of variance for toluene data
60
:0 50
0.
E-
OJ
40
t:J)
~ 30
OJ
>
en
;>, 20
en
0
10
0'--.l.--...1.---'---'---'----L.-..i.--I_'--.l...-.L.-...I.--'--....J
o 1 2 3 4 5 6 7 8 9 10 11 12 13 14
Day
Fig. 15. Daily average toluene level (ppb), Sundays (days 1 and 8) have the
lowest values,
4
+
+
3
'-
a: 2
+
++
+
(ij +
::I
U + + +
Ui ++ t.+ ++
~ + +,"+ +
u
Ql 0 ~+:tr + ++
N
u +~+~+ "!t +
+
(;j + + ++
u
-1 ~ + + +
c ++ +
+
'"
Ci5 -2 +
+
++
+
-3 +
-4
0 10 20 30 40 50 60 70
• A
Fitted value Y j
Fig. 16. Residual plot for fit of linear model with normal errors to toluene data.
The tendency for variability to increase with fitted value suggests that the
assumption of constant variance is inappropriate.
from which it can be shown that the mean and variance of Yare
p. = b'(O) and a(c/J)b"(O); and
(3') the mean p. of Yis connected to the linear predictor '1 by '1 = g(p.),
where the link function g(.) is a differentiable monotonic increasing
function.
The Poisson, binomial, gamma and normal distributions are among
those which can be expressed in the form (22). These allow for the
modelling of data in the form of counts, proportions of counts, positive
continuous measurements, or unbounded continuous measurements
respectively.
We write a(c/J) = c/J/w, where c/J is the dispersion parameter and w is a
weight attached to the observation. The quantity a(c/J) is related to the
second parameter which appears in the binomial, normal, and gamma
REGRESSION AND CORRELAnON 121
in terms of '1 = xT fl, and 0('1) denotes that 0 is regarded here as a function
of '1 and through '1 of fl. This expression arises below in connection with
estimation of the parameters fl.
For each density which can be written in form of eqn (22), there is one
link function for which the model is particularly simple. This, the canonical
link junction, for which 0('1) = '1, is of some theoretical importance.
The value of the link function is that it can remove the need for the data
to be transformed in order for a linear model to apply. Consider data
where the response consists of counts, for example. Such data usually have
variance proportional to their mean, which suggests that suitable models
may be based on the Poisson distribution (eqn (16)). Direct use of the
linear model (eqn (6)) would, however, usually be inappropriate for two
reasons. Firstly, if the counts varied substantially in size the assumption
of constant variance would not apply. This could of course be overcome
by use of a variance-stabilizing transformation, such as the square root
transformation derived in Section 3.5, but the difficulties of interpretation
mentioned there would arise. Second, a linear model fitted to the counts
themselves would lead to the unsatisfactory possibility of negative fitted
means. The link function can remove these difficulties by use of the
Poisson distribution with mean J.I. = e~, which is positive whatever the
value of '1. This model corresponds to the logarithmic link function, for
which '1 = log J.I.. This is the canonical link function for the Poisson
distribution. When this link function is used, the effect of increasing '1 by
one unit is to increase the mean value of the response by a factor e. Such
a model is known as a log-linear model. Since the Poisson probability
122 A.C. DAVISON
The two parameters Jl. and v are respectively the mean and shape parameter
of the distribution, which is flexible enough to take a variety of shapes. For
v < I it has the shape of a reversed 'J', but with a pole at y = 0; for
v = I, the distribution is exponential; and when v > I the distribution is
peaked, approaching the characteristic bell-shaped curve of the normal
distribution for large v. The gamma distribution has variance /1 2/v, so its
variance function is quadratic, i.e. V(Jl.) = Jl.2• Comparison with eqn (22)
shows that a(c/J) = I/v, 9 = -1/Jl., b(9) = -log (- 9), and c(y; c/J) =
v log (vy) - log (y) - log r(v). Various link functions are possible, of
which the logarithmic, for which g(Jl.) = log Jl. = '1, or the canonical link,
the inverse, with g{J1.) = I/Jl. = '1, are most common in practice.
5 +
4 +
+
Ql
()
+
c 3 +
as
to +
>
OJ
2 +
0 +
...J
+
+
+ +
+
0
-1 +
1 2 3 4 5
Log mean
Fig. 17. Plot of log variance against log average for each day for toluene data.
The linear form of the plot shows a relation between the average and variance.
The slope is about 2, indicating that the coefficient of variation is roughly
constant.
variation is not important is again reached when the models with gamma
error distribution, log link function and linear predictors in eqn (21) are
fitted to the data. Residual plots show that this model gives a better fit to
the data than a model with normal errors.
Apart from the normal, Poisson and gamma distributions, the most
common generalized linear models used in practice are based on the
binomial distribution, an application of which is described in Section 5.5.
5.2 Estimation
The parameters fJ of a generalized linear model are usually estimated by
the method of maximum likelihood. The log likelihood is
n
L(fJ) = L log f { Yj; ()j (fJ), <p} (24)
j=l
(25)
which must be solved iteratively. For this problem and many others,
124 A.C. DAVISON
a"T aL = 0 (26)
ap a"
evaluated at the overall maximum likelihood estimates p. Thejth element
of the n x 1 vector of linear predictors, ", is xJ p, SO" = Xp, and there-
fore the p x n matrix a"Tjap equals X. Also the n x 1 vector aLja" has
jth element
a 10gf{Yj; OJ(1'/j), ¢} aOj Yj - b'(Oj)
(27)
a1'/j = a1'/j a( ¢)
Newton's method applied to eqn (25) involves a first-order Taylor series
expansion of aLjap about an initial value of p. Thus
aL aL 02 L(P)
o = 8fJ (p) == 8fJ (fJ) + 8P8fJT (p - P) (28)
where 02 L(p)jopapT is the p x p matrix whose (r, s) element is
a2 L(p) _ ~ a2 10gf{Yj; (}j(11j), ¢}
apr aps - L XjrXjs
)= I
0..2 ./)
and x jr is the rth element of the p x I covariate vector Xj' and so forth.
In the derivation of Newton's method, the right-hand side of eqn (28) is
now rearranged by writing
(29)
- afJapT
and it is replaced by its expected value, which can be written in the form
XT WX, where W is the n x n matrix with zeros off the diagonal andjth
REGRESSION AND CORRELATION 125
diagonal element
W'
J
= E[- jPlogf{Yj; 8j (t/j), cp}]
at/I
For distributions whose density function can be written in th- form
eqn (22), it turns out that Wj equals (aJlj/at/j)2/{a(cp)V(Jlj)} in term!; of the
mean and variance function Jlj and V(Jl) of the jth observation.
From eqns (26), (27) and (29) we see that
p == p + (XT WX)-I XT aL
a"
(XTWX)-I (XT Wxp + xT ~~ )
= (XT WX)-I xTWz (30)
where
z Xp + w-I aL
a"
= ,,+ W-1(y - II)/a(cp)
is an n x I vector known as the adjusted dependent variable. We see that
Pis obtained as the result of applying the weighted least squares algorithm
of Section 3.6 with matrix of co variates X, weight matrix W, and response
variable z. The solution to eqn (25) is not usually obtained in a single step,
and the value of Pis obtained by repeated application of eqn (30). This
iterative weighted least squares algorithm can be set out as follows:
(la) First time through, calculate initial values for the mean vector II
and the linear predictor vector" based on the observed y. Go to (2).
(1 b) If not the first time through, calculate II and" from the current p.
(2) Calculate the weight matrix Wand the adjusted dependent variable
z from II and II·
(3) Regress z on the columns of X using weights W, to obtain a current
vector of parameter estimates p.
(4) Decide whether to stop based on the change in L(P) or in Pfrom
the previous iteration; if decide to continue, go to (lb) and repeat
until the change in L(P) or in P between two successive iterations
is sufficiently small.
This algorithm gives a flexible and rapid method for maximum like-
lihood estimation in a wide variety of regression problems. '9
126 A.C. DAVISON
Here ~(fJ) = 10gf(Yj; fJ), from eqn (24), ¢ is the dispersion parameter,
and ~ is the biggest possible log likelihood attainable by the jth obser-
vation. The deviance is non-negative, and for a normal linear model it
equals the residual sum of squares for the model, SSRes' The deviance is
used to compare models, and to judge the overall adequacy of a model.
As in Section 4, model Mo is said to be nested within model MI if MI can
be reduced to Mo by restrictions on the values of the parameters. Consider
for example a model with Poisson error distribution, log link function, and
linear predictor '10 = Po + PI ZI' This is nested within the model with
linear predictor'll = Po + PIZI + P2Z2, which reduces to '10 if P2 = O.
However'll cannot be reduced to '1b = Po + P3Z3 by restrictions on Po, PI
and P2' The corresponding models are not nested, although '1b has fewer
parameters than'll'
If there are Po unknown parameters in Mo and PI in MI and the models
are nested, the degrees of freedom n - Po of Mo exceeds the degrees of
freedom n - PI of MI' General statistical theory then indicates that for
binomial and Poisson data the difference in deviances D(Mo) - D(M I),
which is necessarily non-negative, has a chi-squared distribution on
PI - Po degrees of freedom if model Mo is in fact adequate for the data.
A difference D(Mo) - D(M I) which is large relative to that distribution
REGRESSION AND CORRELA nON 127
may indicate that MI fits the data substantially better than Mo. For
normal models, inference proceeds based on F-statistics, as outlined in
Section 3.
The forward selection, backwards elimination, and stepwise regression
algorithms described in Section 3.4 can be used for selection of a suitable
generalized linear model, though the caveats there continue to apply. The
role of the residual sum of squares is taken by the deviance, and reductions
in deviance are judged to be significant or not, relative to the appropriate
chi-squared distribution.
Under some circumstances the deviance has a X~-PI distribution if model
MI is in fact adequate for the data. For Poisson data the deviance has an
approximate chi-squared distribution if a substantial number of the indi-
vidual counts are fairly large. For binomial data the distribution is
approximately chi-squared if the denominators mj are fairly large and the
binomial numerators are not all very close to zero or to the denominator.
For normal models with known (J2 the deviance has an exact chi-squared
distribution when the model is adequate, and the distribution is approxi-
mately chi-squared for gamma data when v is known.
In eqn (32) ~2 is the contribution to the deviance from the jth obser-
vation, i.e. ~2 =2g - Ij(P)}, and hjj is given by thejth diagonal element
of H = W I/2 X(X T WX)-I X T W 1/2, where W is the diagonal matrix of
128 A.C. DAVISON
Yj - {J.j
(33)
{a(<jJ)V(jJ.)(1 - hij)}1/2
is known as a standardized Pearson residual.
The Rj can be calibrated by reference to a normal distribution with
unit variance and mean zero. Observations whose residuals have values
that are unusual relative to the standard normal distribution merit
the close scrutiny that such observations would get in a linear model.
For most purposes Rj may be used to construct the plots described in
Section 3.4.
A useful measure of approximate leverage in a generalized linear model
is hij as defined above, and the measure of influence is the approximate
Cook statistic
h-.
(j= lJ tf,
p(l - hij) j
where p is the dimension of the parameter vector /3. Both hjj and Cj may
be used as described in Section 3.4.
The definitions of Rj , hij and Cj given here reduce to those given in
Section 3.4 for normal linear models.
5.5 An application
The data in Table 2 are in the form of counts rj of the numbers of days out
of mj on which ozone levels exceed 0·08 ppm. One model for this is that the
rj have a binomial distribution with probability 7tj and denominator mj:
ations in ozone levels, and a possible form for the dependence is suggested
by the following argument.
If exceedances over the threshold during month j occur at random as a
Poisson process with a positive constant rate Aj exceedances per day, the
number of exceedances Zj occurring on any day that month has the
Poisson distribution (16) with mean Aj • Thus the probability of one or
more exceedances that day is
1tj = pr (Zj ~ I) = 1 - exp (-Aj )
If Aj = exp (xJ fl), corresponding to a log-linear model for the rate of the
Poisson process, we find that the expected number of days per month on
which there are exceedances is
p.j = mj 1tj = mj[1 - exp {-exp (xJfl}}]
which corresponds to the complementary log-log link function, " =
xJ fl = log { -log (l - p.j/mj)}. This model automatically makes allow-
ance for the different numbers of days of recorded data in each month, and
would enable the data analyst to impute numbers of days with exceedances
even for months for which very little or even no data are recorded. 2
Other link functions that are useful for the binomial distribution are the
probit, for which " = ~-I {Ji./m}, and the logistic, for which " = log
{p./(m - p.}}. The canonical link function is the logistic, which is widely
used in applications but does not have the clear-cut interpretation that can
be given to the complementary log-log model in this context.
The ozone levels in this example are thought to depend on several
effects. There is an overall effect due to differences between sites, which
vary in their proximity to sources of pollutants. There is possibly a similar
effect due to differences between years, which can be explained on the basis
of different weather patterns from one year to another. There may be
interaction between sites and years, corresponding to the overall ozone
levels showing different patterns at the two sites over the four years
involved. There is likely to be a strong effect of temperature; here used as
a surrogate for insolation. Furthermore an effect for the 'ozone season',
the months May-September, may also be required. Of these effeCts only
that due to temperature is quantitative; the remainder are qualitative
factors of the type discussed in Section 4.
Table 10 shows the contributions of these different terms towards
explaining the variation observed in the data. The table is analogous to an
analysis of variance table, but with the distinction that the deviance
explained by a term would be approximately distributed as chi-squared on
130 A.C. DAVISON
TABLE 10
Texas ozone data: deviance explained by model terms when entered in the
order given
Site I 0·17
Year 3 1·00
Site-year interaction 3 13-34
Daily maximum temperature 1 103·16
Year-ozone season interaction 4 18·83
Residual 74 143·55
the corresponding degrees of freedom if that term were not needed in the
model, and otherwise would tend to be too large. If the data showed no
effect of daily maximum temperature, for example, the deviance explained
in the table would be randomly distributed as chi-squared on one degree
of freedom. In fact lO3·16 is very large relative to that distribution:
overwhelming evidence of a strong effect of daily maximum temperature.
The deviance explained by each set of terms except for differences
between sites and years is statistically significant at less than the 5% level.
The site and year effects are retained because of the interaction of site and
year: it makes little sense to allow overall rates of exceedance to vary with
year and site but to require that overall year effects and overall site effects
must necessarily be zero.
The coefficient of the monthly daily maximum temperature is approxi-
mately 0·07, which indicates that the rate at which exceedances occur
increases by a factor eO·07! for each rise in temperature of t°c. Since the
temperature variation over the course of a year is about 15°C, the rate of
exceedances varies by a factor 2· 7 or so due to annual fluctuations in air
temperature, regarded in this context as a surrogate for insolation. Other
parameter estimates can be interpreted in a similar fashion.
The size of the residual deviance compared to its degrees of freedom
suggests that the model is not adequate for the data. Standard theory
would suggest that if the model was adequate, the residual deviance would
have a chi-squared distribution on 74 degrees offreedom, but the observed
residual deviance of 143·55 is very large compared to that distribution.
However, in this example there is doubt about the adequacy of a chi-
squared approximation to the distribution of the residual deviance,
because many of the observed counts rj are rather small. Furthermore
some lack of fit is to be expected, because of the gross simplifications made
REGRESSION AND CORRELA nON 131
0.5
.I::
C
0
E 0.4
~
Q)
Q.
'"
Q)
()
0.3
c
'"
"C
Q)
Q)
() 0.2
x
Q)
'0
Q)
(ii
0.1
c:
0.0
0 12 24 36 48
Month
Fig. 18. Fitted exceedance rate and observed frequency of exceedance for
Beaumont, 1981-1984. The fitted rate, lj (solid curve), smooths out fluctu-
ations in the frequencies, rJmj (dots).
0.5
.c
C
0
E 0.4
Q;
Q.
(J)
Q)
tl
0.3
C
<Il
"0.
Q)
Q)
tl
x 0.2
Q)
'0
Q)
0; 0.1
a:
0.0
0 12 24 36 48
Month
Fig. 19. Fitted exceedance rate and observed frequency of exceedance for
North Port Arthur, 1981-1984. The fitted rate, Ij (solid curve) smooths out
fluctuations in the frequencies, rJmj (dots).
over the range of y that k( y - rt) < t/I, where t/I > 0 and k, Jl are
REGRESSION AND CORRELATION 133
holds for the data, but that the 'errors' Ilt have the Gumbel distribution.
The log likelihood for this model is
1981
L(P) = L (-log 1/1 - (YI - tit)/I/I - e -(Yt-qt)/tII)
1=1931
3
+
2
+
++
OJ +
::l +++
!? .r+
'"~ "If'"
~/
"0 0
~
Q)
"0 #
0 -1
+
......
+++
+
-2
+
+
-3
-3 -2 -1 0 2 3
Normal order statistic
Fig. 20. Plot of ordered residuals against normal order statistics for fit of model
with Gumbel errors to Venice data.
matches the asymmetry of the data, and this precludes the use of the
normal distribution.
6 COMPUTING PACKAGES
Almost every statistical package has facilities for fitting the linear regression
models described in Section 3, and most now provide some form of
diagnostic aids for the modeller. At the time of writing, only two packages,
GUM and GENSTAT, have facilities for direct fitting of generalized
linear models. There is no space here for a comprehensive discussion of
computing aspects of linear models. Some general issues to consider when
choosing a package are:
(a) the flexibility of the package in terms of interactive use, plotting
facilities, model fitting, and control over the level of output;
(b) the ease with which standard and non-standard models can be fitted;
(c) the ease with which models can be checked; and
(d) the quality of documentation and the level of help available, both in
terms of on-line help and in the form of local gurus.
As ever, there is likely to be a trade-off between immediate ease of use
REGRESSION AND CORRELATION 135
models are uncertain, and it is then valuable to detect the ways in which
the model departs from the data. Robust and resistant methods can make
this harder precisely because of their insensitivity to changes in assump-
tions or in the data. Such methods are considered in more detail by
Green,19 Li,28 Rousseeuw & Leroy,29 and Hampel et al. 30
The full parametric assumption that" = x T fJ for all values of x and fJ
can be relaxed in various ways. One possibility is non-parametric smooth-
ing techniques, which aim to fit a smooth curve to the data. This avoids
the full parametric assumptions made above, while giving tractable and
reasonably fast methods of curve-fitting. Hastie & Tibshirani31 give a full
discussion of the topic.
ACKNOWLEDG EM ENTS
This work was supported by a grant from the Nuffield Foundation. The
author is grateful to L. Tierney for a copy of his statistical package
XLISP-STAT.
REFERENCES
I. Smith, R.L., Extreme value theory based on the r largest annual events. J.
Hydrology, 86 (1986) 27-43.
2. Davison, A.C. & Hemphill, M.W., On the statistical analysis of ambient ozone
data when measurements are missing. Atmospheric Environment, 21 (1987)
629-39.
3. Lindley, D.V. & Scott, W.F., New Cambridge Elementary Statistical Tables.
Cambridge University Press, Cambridge, 1984.
4. Pearson, E.S. & Hartley, H.O., Biometrika Tables for Statisticians, 3rd edn,
vol. I and 2. Biometrika Trust, University College, London, 1976.
5. Crowder, MJ., A multivariate distribution with Weibull connections. J. Roy.
Statist. Soc. B, 51 (1989) 93-107.
6. McCullagh, P. & Neider, J.A., Generalized Linear Models, 2nd edn. Chapman
& Hall, London, 1989.
7. Edwards, D., Hierarchical interaction models (with Discussion). J. Roy.
Statist. Soc. B, 52 (1990) 3-20, 51-72.
8. Jensen, D.R., Multivariate distributions. In Encyclopedia of Statistical Sci-
ences, vol. 5, ed. S. Kotz, N.L. Johnson & C.B. Read. Wiley, New York, 1985,
pp.43-55.
9. Tawn, J.A., Bivariate extreme value theory: Models and estimation. Bio-
metrika, 75 (1988) 397-415.
10. Tawn, J.A., Modelling multivariate extreme value distributions. Biometrika,
77 (1990) 245-53.
REGRESSION AND CORRELAnON 137
II. Seber, G.A.F., Linear Regression Analysis. Wiley, New York, 1977.
12. Weisberg, S., Applied Linear Regression, 2nd edn. Wiley, New York, 1985.
13. Cook, R.D., Detection of influential observations in linear regression. Tech-
nometrics, 19 (1977) 15-18.
14. Box, G.E.P. & Cox, D.R., An analysis of transformations (with Discussion).
J. Roy. Statist. Soc. B, 26 (1964) 211-52.
15. ApSimon, H.M. & Davison, A.C., A statistical model for deriving probability
distributions of contamination for accidental releases. Atmospheric Environ-
ment, 20 (1986) 1249-59.
16. Holland, D.M. & Fitz-Simons, T., Fitting statistical distributions to air
quality data by the maximum likelihood method. Atmospheric Environment,
16 (1982) 1071-6.
17. Seber, G.A.F., Multivariate Observations, Wiley, New York, 1985.
18. Aitkin, M., Anderson, D., Francis, B. & Hinde, J., Statistical Modelling in
GLIM. Clarendon Press, Oxford, 1989.
19. Green, PJ., Iteratively reweighted least squares for maximum likelihood
estimation and some robust and resistant alternatives (with Discussion). J.
Roy. Statist. Soc. B, 46 (1984) 149-92.
20. Davison, A.C. & Snell,EJ., Residuals and diagnostics. In Statistical Theory
and Modelling: In Honour of Sir David Cox, ed. D.V. Hinkley, N. Reid & EJ.
Snell. Chapman & Hall, London, 1990, pp. 83-106.
21. }",rgensen, B., The delta algorithm and GUM. Int. Statist. Rev., 52 (1984)
283-300.
22. Draper, N. & Smith, H., Applied Regression Analysis, 2nd edn. Wiley, New
York, 1981.
23. Seber, G.A.F. & Wild, CJ., Nonlinear Regression. Wiley, New York, 1989.
24. Atkinson, A.C., Plots, Transformations, and Regression. Clarendon Press,
Oxford, 1985.
25. Cook, R.D. & Weisberg, S., Residuals and Influence in Regression. Chapman
& Hall, London, 1982.
26. Carroll, R.J. & Ruppert, D., Transformation and Weighting in Regression.
Chapman & Hall, London, 1988.
27. Firth, D., Generalized linear models. In Statistical Theory and Modelling: In
Honour of Sir David Cox, ed. D.V. Hinkley, N. Reid & E.J. Snell. Chapman
& Hall, London, 1990, pp. 55-82.
28. Li, G., Robust regression. In Exploring Data Tables, Trends, and Shapes, ed.
D.C. Hoaglin, F. Mosteller & J.W. Tukey. Wiley, New York, 1985, pp.
281-343.
29. Rousseeuw, PJ. & Leroy, A.M., Robust Regression and Outlier Detection.
Wiley, New York, 1987.
30. Hampel, F.R., Ronchetti, E.M., Rousseeuw, P.I. & Stahel, W.A., Robust
Statistics. Wiley, New York, 1986.
31. Hastie, T. & Tibshirani, R.I., Generalized Additive Models. Chapman & Hall,
London, 1990.
Chapter 4
1 INTRODUCTION
All of the eigenvector methods have the same basic objective; the
compression of data into fewer dimensions and the identification of the
structure of interrelationships that exist between the variables measured or
the samples being studied. In many chemical studies, the measured proper-
ties of the system can be considered to be the linear sum of the term
representing the fundamental effects in that system times appropriate
weighing factors. For example, the absorbance at a particular wavelength
of a mixture of compounds for a fixed path length, z, is considered to be
a sum of the absorbencies of the individual components
(1)
where ej is the molar extinction coefficient for the ith compound at wave-
length A, and Cj is the corresponding concentration. Thus, if the absorben-
cies of a mixture of several absorbing species are measured at m various
wavelengths, a series of equations can be obtained:
(2)
If we know what components are present and what the molar extinction
coefficients are for each compound at each wavelength, the concentrations
of each compound can be determined using a multiple linear regression fit
to these data. However, in many cases neither the number of compounds
nor their absorbance spectra may be known. For example, several com-
pounds may elute from an HPLC column at about the same retention time
so that a broad elution peak or several poorly resolved peaks containing
these compounds may be observed. Thus, at any point in the elution curve,
there would be a mixture of the same components but in differing propor-
tions. If the absorbance spectrum of each of these different mixtures could
be measured such as by using a diode array system, then the resulting data
set would consist of a number of absorption spectra for a series of n
different mixtures of the same compounds.
j = I,m
(3)
= l,n
For such a data set, factor analysis can be employed to identify the number
of components in the mixture, the absorption spectra of each component,
and the concentration of each compound for each of the mixtures. Similar
problems are found throughout analytical and environmental chemistry
FACTOR AND CORRELATION ANALYSIS 141
where ~a is. the chemical shift of solute i in solvent a, ~j refers to the jth
solute factor of the ith solvent, and fja refers to the jth solvent factor of
the ath solvent with the summation over all of the physical factors that
might give rise to the measured chemical shifts. Similar examples have
been found for a variety of chemical problems and are described in
Ref. 2.
Finally, similar problems arise in the resolution of environmental mix-
tures to. their source contributions. For example, a sample of airborne
particulate matter collected at a specific site is made up of particles of soil,
motor vehicle exhaust, secondary sulfate particles, primary emissions from
industrial point sources, etc. It may be of interest to determine how much
of the total collected mass of particles comes from each source. It is then
assumed that the measured ambient concentration of some species, Xi'
where i = 1, m measured elements, is a linear sum of contributions from
p independent sources of particles. These species are normally elemental
concentrations such as lead or silicon and are given in J.Lg of element per
cubic meter of air. Each kth source emits particles that have a profile of
elemental concentrations, aik> and the mass contribution per unit volume
of the kth source is/k • When the compositions are measured for a number
of samples, an equation of the following form is obtained.
P
Xij = L
k=1
aidkj (5)
The use of factor analysis for this type of study is reviewed in Ref. 3.
Thus, a factor analysis can help compress multivariate data to sufficiently
few dimensions to make visualization possible and assist in identifying the
interrelated variables. Depending on the approach used, the results can be
interpreted statistically or they can be directly related to the structural
variables that describe the system being studied. Examples of both types
of results will be presented in this chapter. However, in all cases, the
investigator must then interpret the interrelationships determined through
142 PHILIP K. HOPKE
the analysis within the context of his problem to provide the more detailed
understanding of the system being studied.
2 EIGENVECTOR ANALYSIS
(t
rik = j=1 (6)
(xij - xy)1/2(.± (xkj _ Xk)2)1/2
)=1 )=1
(7)
The standardized variables have several other benefits to their use. Each
standardized variable has a mean value of zero and a standard deviation
of 1. Thus, each variable carries 1 unit of system variance and the total
variance for a set of measurements of m variables would be m.
There are several other measures of interrelationship that can also be
utilized. These measures include covariance about the mean as defined by
n
Cik = L
j=1
dijdki (9)
where
d ij = xii - Xi (10)
FACTOR AND CORRELATION ANALYSIS 143
are called the deviations, and Xi is the average value of the ith variable. The
covariance about the origin is defined by
L" XijXkj
zij (IS)
" )1/2
( Ixij
'=1
This normalized variable also carries a variance of I, but the mean value
is not zero. The covariance about the mean is given as
(17)
where D is the matrix of deviations from the mean whose elements are
calculated using eqn (10) and D' is its transpose. The covariance about the
origin is
(18)
144 PHILIP K. HOPKE
the simple product of the data matrix by its transpose. As written, these
product matrices would be of dimension m by m and would represent the
pairwise interrelationships between variables. If the order of the multi-
plication is reversed, the resulting n by n dispersion matrices contain the
interrelationships between samples.
The relative merits of these functions to reflect the total information
content contained in the data have been discussed in the literatureY
Rozett & Petersen 3 argue that since many types of physical and chemical
variables have a real zero, the information regarding the location of the
true origin is lost by using the correlation and covariance about the mean
that include only differences from the variable means. The normalization
made in calculating the correlations from the co variances causes each
variable to have an identical weight in the subsequent analysis. In mass
spectrometry where the variables consist of the ion intensities at the
various m/e values observed for the fragments of a molecule, the nor-
malization represents a loss of information because the variable metric is
the same for all of the m/e values. In environmental studies where mea-
sured species concentrations range from the trace level (sub part per
million) to major constituents at the per cent level, the use of covariance
may weight the major constituents too heavily in the subsequent analy-
ses. The choice of dispersion function depends heavily on the nature of the
variables being measured.
Another use of the correlation coefficient is that it can be interpreted in
a statistical sense to test the null hypothesis as to whether a linear relation-
ship exists between the pair of variables being tested. It is important to
note that the existence of a correlation coefficient that is different from
zero does not prove that a cause and effect relationship exists. Also, it is
important to note that the use of probabilities to determine if correlation
coefficients are 'significant' is very questionable for environmental data. In
the development of those probability relationships, explicit assumptions
are made that the underlying distributions of the variables in the correla-
tion analysis are normal. For most environmental variables, normal distri-
butions are uncommon. Generally, the distributions are positively skewed
and heavy tailed. Thus, great care should be taken in making probability
arguments regarding the significance of pairwise correlation coefficients
between variables measured in environmental samples.
Another problem with interpreting correlation coefficients is that en-
vironmental systems are often truly multivariate systems. Thus, there may
be more than two variables that covary because of the underlying nature
of the processes being studied. Although there can be very strong correla-
FACTOR AND CORRELATION ANALYSIS 145
tions between two variables, the correlation may arise through a causal
factor in the system that cannot be detected.
For each of the equations previously given in this section, the resulting
dispersion matrix provides a measure of the interrelationship between the
measured variables. Thus, in the use of a matrix of correlations between
the pairs of variables, each variable is given equal weight in the subsequent
eigenvector analysis. This form of factor analysis is commonly referred to
as an R-mode analysis. Alternatively, the order of multiplication could be
reversed to yield covariances or correlations between the samples obtained
in the system. The eigenvector analysis of these matrices would be referred
to as a Q-analysis. The differences between these two approaches will be
discussed further after the approach to eigenvector analysis has been
introduced.
that the rank of a product moment matrix is the same as the data matrix
from which it is formed.
An eigenvector of R is a vector U such that
Ru = UA (23)
where Ais an unknown scalar. The problem then is to find a vector so that
the vector Ru is proportional to u. This equation can be rewritten as
Ru - UA = 0 (24)
or
(R - AI)U = 0 (25)
implying that u is a vector that is orthogonal to all of the row vectors of
(R - AI). This vector equation can be considered as a set of p equations
where p is the order of R:
u, (1 - A) + u2r'2 + U3r13 + ... + upr,p = 0
u,r2' + u2(1 - A) + U3r23 + ... + upr2p = 0
(26)
positive. These elements are called the singular values of R. The column
values of the U matrix are the eigenvectors of XX'. The column values of
the V matrix are the eigenvectors of X'X. Zhou et al. 8 show that the R- and
Q-mode factor solutions are interrelated as follows
AQ FQ
r! r----J
X U
I
D
I
V'
L...J
(38)
FR AR
Although there has been discussion of the relative merits of R- and
Q-mode analyses in the literature,3.9 the direction of multiplication is not
the factor that alters the solutions obtained. Different solutions are ob-
tained depending on the direction in which the scaling is performed. Thus,
different vectors are derived depending on whether the data are scaled by
row, by column, or both. Zhou et al. 8 discuss this problem in more detail.
By making appropriate choices of A and F in eqn (38), the singular value
decomposition is one method to partition any matrix. The singular value
decomposition is also a key diagnostic tool in examining collinearity
problems in regression analysis. 1O The application of the singular value
decomposition to regression diagnostics is beyond the scope of this
chapter.
In the discussion of the dispersion matrix, it becomes necessary to
discuss some of the seman tical problems that arise in 'factor' analysis. If
one consults the social science literature on factor analysis, a major
distinction is made between factor analysis and principal components
analysis. Because there are substantial problems in developing quantita-
tive models analogous to those given in the introduction to this chapter in
eqns (I) to (5), the social sciences want to obtain 'factors' that have
significant values for two or more of the measured variables. Thus, they
are interested in factors that are common to several variables. The model
then being applied to the data is of the form:
p
where the standardized variables, zij' are related to the product of the
common factor loadings, aik, by the common factor scores, fkj' plus the
unique loading and score. The system variance is therefore partitioned
into the common factor variance, the specific variance unique to the
150 PIDLIP K. HOPKE
(41)
The multiple correlation coefficients for each variable against all of the
remaining variables are often used as initial estimates of the com-
munalities. Alternatively, the eigenvector analysis is made and the com-
munalities for the initial solution are then substituted into the diagonal
elements of the correlation matrix to produce a communality matrix. This
matrix is then analyzed and the process repeated until stable communality
values are obtained.
The principal components analysis simply decomposes the correlation
matrix and leads to the model outlined in eqn (39) without the d; U;j term.
It can produce components that have a strong relationship with only one
variable. The single variable component could also be considered to be the
unique factor. Thus, both principal components analysis and classical
factor analysis really lead to similar solutions although reaching these
solutions by different routes. Since it is quite reasonable for many environ-
mental systems to show factors that produce such single variable behavior,
it is advisable to use a principal components analysis and extend the
number offactors to those necessary to reproduce the original data within
the error limits inherent in the data set.
Typically, this approach to making the eigenvector analysis compresses
the information content of the data set into as few eigenvectors as possible.
Thus, in considering the number of factors to be used to describe the
system, it is necessary to carefully examine the problems of reconstructing
both the variability within the data and reconstructing the actual data
itself.
FACTOR AND CORRELATION ANALYSIS 151
t = I I
n m (
Xij -
-)2
Xi (42)
j=1 i=1 aij
where xij is the reconstructed data point using p factors and (1ij is the
uncertainty in the value of xij' The Exner function 16 is a similar measure
and is calculated by
EP = [t f (x~
j=1 i=1
-
(x - Xi)
~i)~JI/2 (43)
RSD = [t j=p+1
Aj
n(m - p)
] for n > m (44)
RSD
IND (45)
(m _ p)2
RSD [± j=p+1
Aj
m(n - p)
] for n < m (46)
RSD
IND = (n _ p)2
(47)
where Aj are the eigenvalues from the diagonalization. This function has
proven very successful with spectroscopy results.'7.'7a However, it has not
proven to be as useful with other types of environmental data. IS. Finally,
the root-mean-square error and the arithmetic average of the absolute
values of the point-by-point errors are also calculated. The data are next
reproduced with both the first and second factors and again compared
point-by-point with the original data. The procedure is repeated, each time
with one additional factor, until the data are reproduced with the desired
precision. If p is the minimum number of factors needed to adequately
reproduce the data, then the remaining n - p factors can be eliminated
from the analysis. These tests do not provide unequivocal indicators of the
number of factors that should be retained. Judgement becomes necessary
in evaluating all of the test results and deciding upon a value of p. In this
manner the dimension of the A and F matrices is reduced from n to p.
The compression of variance into the first factors will improve the ease
with which the number of factors can be determined. However, their
FACTOR AND CORRELATION ANALYSIS 153
/
Fig. 2. Illustration of the rotation of a coordinate system (x,. x 2 ) to a new
system (y,. Y2) by an angle O.
nature has now been mixed by the calculational method. Thus, once the
number offactors has been determined, it is often useful to rotate the axes
in order to provide a more interpretable structure.
between the ith original reference axis and thejth new axis. Assuming that
this is a rotation that maintains orthogonal axes (rigid rotation), then, for
two dimensions,
Oil = 011 + 90°
0 21 = 011 + 90° (50)
0 22 011
yielding
(53)
A transformation matrix T can then be defined such that its elements are
(54)
Then, for a collection of N row vectors in a matrix X with n columns,
Y = XT (55)
and Y has the coordinates for all N row vectors in terms of the n rotated
axes. For the rotation to be rigid, T must be an orthogonal matrix. Note
that the column vectors in Y can be thought of as new variables made up
by linear combinations of the variables in X with the elements of T being
the coefficients of those combinations. Also a row vector of X gives the
properties of a sample in terms of the original variables while a row vector
ofY gives the properties of a sample in terms of the transformed variables.
3 APPLICATIONS
TABLE 1
Factor loadings of the geological data with a decimal point error
were then examined starting with the two factor solution to determine the
nature of the identified factors. For the two factor solution, one factor of
many highly correlated variables was identified in addition to a second
factor with high correlations by Dy, Ba and Sb. For the three factor
solution, the above factors were again observed in addition to a factor
containing most of the variance in thorium and little of any other variable
(see Table 1). This factor that reflects a high variance for only a single
variable shows the typical identifying characteristics of a possible data
error. To further investigate the nature of this factor, the factor scores of
the three factor solution are calculated (Table 2). As can be seen, there is
FACTOR AND CORRELATION ANALYSIS 157
TABLE 2
Factor scores of geological data with a decimal point error
TABLE 3
Factor loadings of the geological data with a random variable X1
reason for this variable being unique, then the investigator should explore
his analytical method for possible errors.
TABLE 4
Factor loadings of the geological data with two correlated random variables,
Y1 and Y2
TABLE 5
Factor loadings of the geological data with errors present in two previously
uncorrelated variables, Th and Eu
TABLE 6
Factor loadings of the geological data with errors present in two previously
correlated variables, Th and Sm
either an error has been made in assuming only two phases or there are
errors in the data set. Thus, knowledge of the nature of the system under
study would be needed to find this type of error. The two variables
involved could also be an indication of a problem. If two variables that
would not normally be thought to be interrelated appear together in a
factor, it could indicate a correlated error.
In both of the previous two examples, there were two interfering varia-
bles present, either thorium and samarium or thorium and europium.
Potential problems in the data were recognized after using factor analysis,
i.e. samarium and europium. However, no associated problem was noticed
with thorium in either case because of the relative concentrations of the
two variables. Since thorium was much less sensitive to a change (because
of its large magnitude relative to samarium or europium), the added error
in thorium was interpreted as unnoticeable increases in the variance of the
thorium values. For an actual common interference, consider the Mn-Mg
problem in neutron activation analysis. If the spectral analysis program
used was having a problem properly separating the Mn and Mg peaks, the
factor analysis would usually identify the problem as being in the Mg
values since the sensitivity of Mg to neutron activation analysis is so much
less than that of Mn. However, if the actual levels of Mn were so low that
the Mg peak was relatively the same size as the Mn peak, then the problem
could show up in both variables.
reciprocal of the median and mean grain sizes in millimeters. The squared
diameters provide a measure of the available surface area assuming that
the particles are spherical. The reciprocals of the diameter should help
indicate the surface area per unit volume of the particles. Seventy-nine
samples were available for which there were complete data for all of the
variables.
An interpretation can be made of the factor loading matrix (Table 7) in
terms of both the physical and chemical variables. The variables describ-
ing particle diameter in millimeters and per cent sand, have high positive
loadings for factor one. This factor could be thought of as the amount of
coarse grain source material in the sediment. The amount of sand (coarse-
grained sediment) is positively related only to this first factor. Of the
elements determined, only sodium and hafnium have positive coefficients
for this factor.
The second factor is related to the available specific surface area of the
sedimental particles. The amount of surface area per unit volume would
be given by
4n,J 3 6
R = ~nr3 r d
3
The only variable having a high loading for factor-4 is the per cent silt.
Since the silt represents a mixture of size fractions that can be carried by
streams, this factor may represent active sedimentation at or near stream
deltas. The silty material is deposited on the delta and then winnowed
away by the action of lake currents and waves.
The final factor has a high loading for sorting. The sorting parameter
is a measure of the standard deviation of the particle size distribution. A
large standard deviation implies that there is a wide mixture of grain sizes
in the sample. Wave action sorts particles by carrying away fine-grained
particles and leaving the coarse-grained. Therefore, the fifth factor may
represent the wave and current action that transport the sedimental mat-
erial within the lake.
~
~
~
~
~
168 pmLIP K. HOPKE
cases where the sources were located very close together. 46 A compilation
of the selected impacted samples was made so that target transformation
factor analysis could be employed to obtain elemental profiles for these
sources at the various receptor sites. 45
Thus, TTF A may be very useful in determining the concentration of
lead in motor vehicle emission as the mix of leaded fuel continues to
change. Multivariate methods can thus provide considerable information
regarding the sources of particles including highway emissions from only
the ambient data matrix. The TTF A method represents a useful approach
when source information for the area is lacking or suspect and if there is
uncertainty as to the identification of all of the sources contributing to the
measured concentrations at the receptor site. TTF A has been performed
using F ANTASIA. 18a,40
Further efforts have recently been made by Henry & Kim 47 on extending
eigenvector analysis methods. They have been examining ways to incor-
porate the explicit physical constraints that are inherent in the mixture
resolution problem into the analysis. Through the use of linear program-
ming methods, they are better able to define the feasible region in which
the solution must lie. There exists a limited region in the solution space
because the elements of the source profiles must all be greater than or
equal to zero (non-negative source profiles) and the mass contributions of
the identified sources must also be greater than or equal to zero. Although
there has only been limited applications of this expanded method, it offers
an important additional tool to apply to those systems where a priori
source profile data are not available. These methods provide a useful
parallel analysis with eMB to help insure that the profiles used are
reasonable representations of the sources contributing to a given set of
samples.
3.4.2 Results
The eigenvector analysis provided the results presented in Table 8. Exami-
nation of the eigenvectors suggests the presence of four major sources,
possibly two weak sources, and noise. To begin the analysis, a four-vector
solution was obtained. The iteratively refined source profiles are given in
Table 9. The first three vectors can be easily identified as motor vehicles
FACTOR AND CORRELATION ANALYSIS 173
TABLE 8
Results of eigenvector analysis of July and August 1976 fine fraction data at
site 112 in St Louis, Missouri
(Pb, Br), regional sulfate, and soiljflyash (Si, AI) based on their apparent
elemental composition.
However, the fourth vector showed high K, Zn, Ba and Sr and was not
initially obvious as to its origin. The resulting mass loadings were then
calculated and the only significant values were for the sampling periods of
TABLE 9
Refined source profiles for the four source solution at RAPS Site 112, July-
August 1976
TABLE 10
Comparison of data with and without samples from 4 and 5 July (ng/m3)
RAPS Station 112, July and August 1976, fine fraction
Al 220 ± 30 200 ± 30
Si 440 ± 60 450 ± 60
S 4370 ± 310 4360 ± 320
CI 90 ± 10 80 ± 9
K 320 ± 130 150 ± 9
Ca 11O±1O 110 ± 10
Ti 63 ± 13 64 ± 13
Mn 17 ± 3 17 ± 3
Fe 220 ± 20 220 ± 20
Ni 2·3 ± 0·2 2·3 ± 0·2
Cu 16 ± 3 15 ± 3
Zn 78 ± 8 75 ± 8
Se 2·7 ± 0·2 2·7 ± 0·2
Br 140 ± 9 130 ± 8
Sr 5± 4 1·1 ± 0·1
Ba 19 ± 5 15 ± 4
Pb 730 ± 50 720 ± 50
TABLE 11
Results of eigenvector analysis of July and August 1976 fine fraction data at
Site 112 in St Louis, Missouri, excluding 4 and 5 July data
TABLE 12
Refined source profiles (mg/g), RAPS Station 112, July and August 1976, fine
fraction without 4 and 5 July data
4 SUMMARY
ACKNOWLEDG EM ENTS
REFERENCES
aerosols at rural sites in New York State and their possible sources' by P.
Parekh and L. Husain and 'Seasonal variations in the composition of
ambient sulfur-containing aerosols' by R. Tanner and B. Leaderer. Atmo-
spheric Environ., 16 (1982) 1279-80.
16. Exner, 0., Additive physical properties. Collection of Czech. Chem.
Commun.,31 (1966) 3222-53.
17. Malinowski, E. R., Determination of the number of factors and the experi-
mental error in a data matrix. Anal. Chem., 49 (1977) 612-17.
17a. Malinowski, E.R. & McCue, M., Qualitative and quantitative determination
of suspected components in mixtures by target transformation factor analysis
of their mass spectra. Anal. Chem., 49 (1977) 284-7.
18. Hopke, P.K., Target transformation factor analysis as an aerosol mass
apportionment method: A review and sensitivity analysis. Atmospheric
Environ., 22 (1988a) 1777-92.
18a. Hopke, P.K., FANTASIA, A Program for Target Transformation Factor
Analysis, for program availability, contact P.K. Hopke (l988b).
19. Hopke, P.K., Receptor Modeling in Environmental Chemistry. J. Wiley &
Sons, New York, 1985.
20. Roscoe, B.A., Hopke, P.K., Dattner, S.L. & Jenks, J.M., The use of principal
components factor analysis to interpret particulate compositional data sets.
J. Air Pollut. Control Assoc., 32 (1982) 637-42.
21. Bowman, H.R., Asaro, F. & Perlman, I., On the uniformity of composition
in obsidians and evidence for magnetic mixing. J. Geology, 81 (1973) 312-27.
22. Lis, S.A. & Hopke, P.K., Anomalous arsenic concentrations in Chautauqua
Lake. Env. Letters,S (1973) 45-55.
23. Ruppert, D.F., Hopke, P.K., Clute, P.R., Metzger, WJ. & Crowley, OJ.,
Arsenic concentrations and distribution of Chautauqua Lake sediments. J.
Radioanal. Chem., 23 (1974) 159-69.
24. Clute, P.R., Chautauqua Lake sediments. MS Thesis, State University
College at Fredonia, NY, 1973.
25. Hopke, P.K., Ruppert, D.F., Clute, P.R., Metzger, W.J. & Crowley, OJ.,
Geochemical profile of Chautauqua Lake sediments. J. Radioanal. Chem., 29
(1976) 39-59.
25a. Hopke, P.K., Gladney, E.S., Gordon, G.E., Zoller, W.H. & Jones, A.G., The
use of multivariate analysis to identify sources of selected elements in the
Boston urban aerosol. Atmospheric Environ., 10 (1976) 1015-25.
26. Folk, R.L., A review of grain-size parameters. Sedimentology, 6 (1964) 73-93.
27. Tesmer, I.H., Geology of Chautauqua County, New York Part I: Strati-
graphy and Paleontology (Upper Devonian). New York State Museum and
Science Service Bull. no. 391: Albany, University of the State of New York,
State Education Department, 1963, p. 65.
28. Muller, E.H., Geology of Chautauqua County, New York, Part II: Pleis-
tocene geology. New York State Museum and Science Service Bull. no. 392:
Albany, the University of the State of New York, The State Education
Department, 1963, p. 60.
29. Prinz, B. & Stratmann, H., The possible use offactor analysis in investigating
air quality. Staub-Reinhalt Luft, 28 (1968) 33-9.
FACTOR AND CORRELATION ANALYSIS 179
30. Blifford, I.H. & Meeker, G.O., A factor analysis model of large scale pol-
lution. Atmospheric Environ., 1 (1967) 147-57.
31. Colucci, J.M. & Begeman, C.R., The automotive contribution to air-borne
polynuclear aromatic hydrocarbons in Detroit. 1. Air Pollut. Control Assoc.,
15 (1965) 113-22.
32. Gaarenstrom, P.D., Perone, S.P. & Moyers, J.P., Application of pattern
recognition and factor analysis for characterization of atmospheric par-
ticulate composition in southwest desert atmosphere. Environ. Sci. Technol.,
11 (1977) 795-800.
33. Thurston, G.D. & Spengler, J.D. A quantitative assessment of source contri-
butions to inhalable particulate matter pollution in metropolitan Boston.
Atmospheric Environ., 19 (1985) 9-26.
34. Parekh, P.P. & Husain, L., Trace element concentrations in summer aerosols
at rural sites in New York State and their possible sources. Atmospheric
Environ., 15 (1981) 1717-25.
35. Gatz, D.F., Identification of aerosol sources in the St Louis area using factor
analysis. 1. Appl. Met., 17 (1978) 600-08.
36. Changnon, S.A., Huff, R.A., Schickendenz, p.r. & Vogel, J.L., Summary of
METROMEX, Vol. I: Weather Anomalies and Impacts, Illinois State Water
Survey Bulletin 62, Urbana, IL, 1977.
37. Ackerman, B., Chagnon, S.A., Dzurisin, G., Gatz, D.L., Grosh, R.C.,
Hilberg, S.D., Huff, F.A., Mansell, J.W., Ochs, H.T., Peden, M.E., Schick-
edanz, P.T., Semonin, R.G. & Vogel, J.L. Summary of METRO ME X, Vol.
2: Causes of Precipitation Anomalies. Illinois State Water Survey Bulletin 63,
Urbana, IL, 1978.
38. Hopke, P.K., Wlaschin, W., Landsberger, S., Sweet, C. & Vermette, SJ., The
source apportionment of PM lOin South Chicago. In PM- 10: Implementation
of Standards, ed. C.V. Mathai & D.H. Stonefield. Air Pollution Control
Association, Pittsburgh, PA, 1988, pp. 484-94.
39. Alpert, D.l. & Hopke, P.K. A quantitative determination of sources in the
Boston urban aerosol. Atmospheric Environ., 14 (1980) 1137-46.
39a. Alpert, D.l. & Hopke, P.K., A determination of the sources of airborne
particles collected during the regional air pollution study. Atmospheric
Environ., 15 (1981) 675-87.
40. Hopke, P.K., Alpert, DJ. & Roscoe, B.A., FANTASIA-A program for
target transformation factor analysis to apportion sources in environmental
samples. Computers & Chemistry, 7 (1983) 149-55.
41. Roscoe, B.A. & Hopke, P.K., Comparison of weighted and unweighted
target transformation rotations in factor analysis. Computers & Chem., 5
(1981) 5-7.
42. Severin, K.G., Roscoe, B.A. & Hopke, P.K., The use of factor analysis in
source determination of particulate emissions. Particulate Science and Tech-
nology, 1 (1983) 183-92.
43. Dzubay, T.G., Stevens, R.K. & Richards, L.W., Composition of aerosols
over Los Angeles freeways. Atmospheric Environ., 13 (1979) 653-9.
44. Harrison, R.M. & Sturges, W.T. The measurement and interpretation of
Br/Pb ratios in airborne particles. Atmospheric Environ., 17 (1983) 311-28.
45. Chang, S.N., Hopke, P.K., Gordon, G.E. & Rheingrover, S.W., Target
180 PHILIP K. HOPKE
1 TYPES OF ERROR
often be inferred by the presence of outliers which are most easily iden-
tified by graphing or producing some pictorial representation of the data.
It is important to realise that gross errors can occur in the best planned
laboratories and a watch should always be kept for their presence.
TABLE 1
The concentration of lead in stream sediment samples, mg Pbkg-', as deter-
mined by four analysts
Analyst
A B C D
xx x
xxxxxxx A
x x
XXXXX I xx X B
--_._----"-----
x x
x x
x x
xxxx c
x
xx ~ xx ~x D
Fig. 1. The variability in the lead concentration data, from Table 1, as deter-
mined by four analysts A, B, C and D. The correct value is illustrated by the
dotted line. Analyst A is accurate and precise, analyst B is accurate but less
precise, analyst C is inaccurate but of high precision and the data from analyst
D has low accuracy and poor precision.
a data set and is a measure of the random errors. It is important that this
distinction is appreciated. No matter how many times analysts C and D
repeat their measurements, because of some inherent bias in their pro-
cedure they cannot improve their accuracy. The precision might be im-
proved by exercising greater care to reduce random errors.
Regarding precision and the occurrence of random errors, two other
terms are frequently encountered in analytical science, repeatability and
reproducibility. If analyst B had performed the measurements sequentially
on a single occasion then a measure of precision would reflect the repeata-
bility of the analysis, the within-batch precision. If the tests were run over,
say, 2 days, then an analysis of data from each occasion would provide the
between-batch precision or reproducibility.
2 DISTRIBUTION OF ERRORS
Whilst the qualitative merits of each of our four analysts can be inferred
immediately from the data, to provide a quantitative assessment a statisti-
cal analysis is necessary and should always be performed on data from any
quantitative analytical procedure. This quantitative treatment requires
that some assumptions are made concerning the measurement process.
Any measure of a variable, mass, size, concentration, etc., is expected to
ERRORS AND DETECTION LIMITS 185
approximate the true value but it is not likely to be exactly equal to it.
Similarly, repeated measurements of the same variable will provide further
discrepancies between the observed results and the true value, as well as
differences between each practical measure, due to the presence of random
errors. As more repeated measurements are made, a pattern to the scatter
of the data will emerge; some values will be too high and some too low
compared with the known correct result, and, in the absence of any
systematic error, they will be distributed evenly about the true value. If an
infinite number of such measurements could be made then the true distri-
bution of the data about the known value would be known. Of course, in
practice this exercise cannot be accomplished but the presence of some
such parent distribution can be hypothesised. In analytical science this
distribution is assumed to approximate the well-known normal form and
our data are assumed to represent a sample from this idealised parent
population. Whilst this assumption may appear to be taking a big step in
describing our data, many scientists over many years of study have de-
monstrated that in a variety of situations repeat measurements do ap-
proximate well to the normal distribution and the normal error distribu-
tion is accepted as being the most important for use in statistical studies
of data.
p = _1_ exp [-
ay'ln
~ (x
2
-/YJ
a
(I)
where (x - Jl.) represents the measured deviation of any value x from the
population mean, Jl., and a is the standard deviation which defines the
spread of the curve. The formula given in eqn (1) need not be remembered;
its importance is in defining the shape of the normal distribution curve. As
drawn in Fig. 2, the total area under the curve is one for all values of Jl. and
a by standardising the data. This standard normal transform is achieved
by means of subtracting the mean value from each data value and dividing
by the standard deviation,
(x - Jl.)
Z = (2)
F(z) (0.5
".,_\
, \
, \
\
i \
I, II \
F(z) "
i
i
,
I
I
!
t
, '\
i
:
i
I
:
; ,
'\
--_../ ~r
---- --- -----'---- - -~---------~-----------
-3 -2 -1 o +1 +2 +3
Standard Deviation
(9)
(10)
and the standard deviation, SY' associated with the final result can be
obtained by rearranging eqn (6) such that,
s =
CV x x (11)
100
3 SIGNIFICANCE TESTS
,
\5%
, 2.5%
Fig. 3. The standardised normal distribution curve and the critical regions
containing the extreme 2·5% and 5% of the curve area. Cumulative probabili-
ties, from statistical tables, are also shown.
of 1'96, then statistics would indicate that there was no reason to assume
any difference.
In this example a so-called two-tailed test was performed: we are not
interested in whether the new sample was significantly more or less con-
centrated than the standards, just different. To indicate that the unknown
sample was significantly more concentrated, then a one-tailed test would
have been appropriate to reject the null hypothesis and, at the 5% signifi-
cance level, the critical test value would have been 1·64 (see Fig. 3).
3.1 t-Test
In the above example it was assumed that all possible samples of the
known water source had been analysed-a clearly impossible situation. If
the true standard deviation of the parent population distribution is not
known, as is usually the case, then the test statistic calculation proceeds in
a similar fashion but depends on the number of samples analysed. If a
large number of samples are analysed (n > 25) then the sample deviation,
s, is considered a good estimate of (1 and then the test statistic,now
denoted as t, is given by
t
x-
= ---
Jl.o
(13)
s..{ri
When n is less than 25, s may be a poor estimate of (1 and the test statistic
is compared not with the normal curve but with the t-distribution curve
which is more broad and spread out. The t-distribution is symmetric and
similar to the normal distribution but its wider spread is dependent on the
ERRORS AND DETECTION LIMITS 193
TABLE 2
The concentration of copper determined from ten soil samples by AAS
Sample no.
2 3 4 5 6 7 8 9 10
sion is that the soil samples do arise from a site, the copper content of
which is greater than the level of 80 mg kg-I and copper toxicity may be
expected.
(15)
(16)
The standard error can be obtained from the square root of this variance,
i.e.
ERRORS AND DETECTION LIMITS 195
Z = (17)
(1' .j(l/n l + l/n2)
If the variance of the underlying population distribution is not known
then (1' is replaced by the sample standard deviation, s, and the t-test
statistic is used,
(18)
where s is derived from the standard deviations of each sample set by,
; = !: (Xi - X)2
(19)
n-I
and in the special case of nl = n2, then
; = ~+~ (20)
2
If the null hypothesis is true, i.e. the two means are equal, then the
t-statistic can be compared with the t-distribution with (nl + n2 - 2)
degrees of freedom at some selected level of significance.
t = (XI - x2 ) = 1.27
s.j(l/IO + 1/10)
A two-tailed test is appropriate as we are not concerned whether any
one sample contains more or less CFC, and from statistical tables for a
t-distribution with 18 degrees offreedom (n l + n2 - 2), to-025.18 = 2·101.
196 M.J. ADAMS
TABLE 3
The analysis of two expanded polyurethane foams. A and e, for CFC content
expressed as mg CFC m- 3 foam
Sample A Sample B
Our result (t = 1·27) is less than this critical value, therefore we have
no evidence that the two samples came from different sources and the null
hypothesis is accepted.
(21)
ERRORS AND DETECTION LIMITS 197
TABLE 4
The comparison of two students' results for the determination of iron in water
samples
A B
3.2 F-Test
In applying the I-test an assumption is made that the two sets of sample
data have the same variance. Whether this assumption is valid or not can
be evaluated using a further statistical measure, the F-test. This test, to
determine the equality of variance of samples, is based on the so-called
F-distribution. This is a distribution calculated from the ratios of all
possible sample variances from a normal population. As the sample
variance is poorly defined if only a small number of observations or trials
are made then the F-distribution, like the I-distribution discussed above,
is dependent on the sample size, i.e. the number of degrees of freedom. As
with I-values, F-test values are tabulated in standard texts and in this case
the critical value selected is dependent on two values of degrees of
freedom, one associated with each sample variance to be compared, and
the level of significance to be used.
In practice, we assume our two samples are drawn from the same parent
population; the variance of each sample is calculated and the F-ratio of
variances evaluated. The null hypothesis is
198 M.J. ADAMS
against
and we can determine the probability that, by chance, the observed F-ratio
came from two samples taken from a single population distribution.
F = ~ (22)
s~
Substituting the data from Table 3, F = 54·84/34·49 = 1·59. From
statistical tables, the critical value for F with 9 degrees of freedom associ-
ated with each data set is F9.9.0-05 = 3·18. Our value is less than this so the
null hypothesis is accepted and we can conclude the variances of the two
sample sets is the same and the application of the t-test is valid.
TABLE 5
The concentration of phosphate, determined colorimetrically in six soil
samples, using five replications on each sample. The results are expressed as
mg kg- 1 , phosphate in dry soil
SST =
m
LL~
n (f ±Xii)2
_ = 1 i=1 (23)
j=1 i=1 m x n
where xi.i represents the ith replicate of the jth sample. The variance
TABLE 6
An ANOVA table"
m
[ ( Ln Xij )2 ] (mL Ln Xij )2
SSB = L i=1 _ j=1 i=1
(24)
j=1 n m x n
and similarly for the between-group variance from totalling the squares of
the sums of each replicate and dividing by the number of replicates,
m [( L Xi,j
n )2]
SSB = j~ i=l
n - 36248 = 1009
TABLE 7
The completed ANOVA table
tion of experimental data and there are many specialised texts on the
subject.
4 ANALYTICAL MEASURES
amplitudes and phases. It is this random, white noise that will concern us
in these discussions. Flicker noise, or Ilf noise, is characterised by a power
spectrum which is pronounced at low frequencies and is minimised by a.c.
detection and signal processing. Many instruments will also display inter-
ference noise due to pick-up usually from the 50-Hz or 60-Hz power lines.
Most instruments will operate detector systems well away from these
frequencies to minimise such interference. One of the aims of instrument
manufacturers is to produce instruments that can extract an analytical
signal as effectively as possible. However, because noise is a fundamental
characteristic of all instruments, complete freedom from noise can never
be realised in practice. A figure of merit to describe the quality of a
measurement is the signal-to-noise ratio, SIN, which is defined as,
SIN = average signal ~agnitude
(27)
rms nOise
The rms (root mean square) noise is defined as the square root of the
average squared deviation of the signal, s, from its mean value, s, i.e.
.
rms nOIse =
Jr,(S n- S)2
(28)
Comparison with eqns (4) and (5) illustrates the equivalency of the rms
value and the standard deviation of the signal, a,. The signal-to-noise
ratio, therefore, can be defined as S/a,.
The SIN can be measured easily in one of two ways. One method is to
repeatedly measure the analytical signal, determine the mean value and
calculate the rms noise using eqn (29). A second method of estimating SIN
is to record the analytical signal on a strip-chart recorder. Assuming the
noise is random, white noise, then it is 99% likely that the deviations in
the signal lie within ± 2·5a, of the mean value. The rms value can thus be
determined by measuring the peak-to-peak deviation of the signal from
the mean and dividing by 5. With both methods it is important that the
signal be monitored for sufficient time to obtain a reliable estimate of the
standard deviation. Note also that it has been assumed that at low analyte
concentrations the noise associated with the analyte signal is the same as
that when no analyte signal is present, the blank noise aB' i.e. at low
concentrations the measurement error is independent of the concentra-
ERRORS AND DETECTION LIMITS 203
i\i.! \
1
•. \
'1\
'.
j
\ jlj~" \\j1\\j .~I\I \'
Ji \
I
;,
v
! I\
/
I
I :.'
:
:
I
__
~
(J
S
-I------- . . ---.~----
! ~ i
I I
s
a.
\
\
\ 5%
o 1.65 ...
:'----
b.
c. /
1\ /\ \ / \oetection Limit
~/ -y,--
o 3.29
"-.~
Determination Limit
i\
d.
----/
//
I
\, '----
o 5 10
Standard deviation of blank
Fig. 5. (a) The normal distribution with the 5% critical region marked; (b) two
normally distributed signals overlapping, with the mean of one located at the
5% point of the second; the so-called decision limit; (c) two normally distributed
signals overlapping at their 5% points with their means separated by 3'290', the
so-called detection limit; (d) two normally distributed signals with equal
variance and their means separated by 100'; the so-called determination limit.
then the I-distribution should be used and the definition is now given by
Decision Limit = IO.9SSB (31)
where SB is an estimate of O"B and as before (see Section 2.2); the value of
I depends on the degrees of freedom, i.e. the number of measurements and
10'9S approaches ZO.9S as the number of measurements increases.
This definition of the decision limit addresses our concern with Type I
errors but says nothing about the effects of Type II errors. If an analyte
producing a signal equivalent to the decision limit is repeatedly analysed,
then the distribution of results will appear as illustrated in Fig. 5(b). Whilst
the mean value of this analyte signal, Jl" is equivalent to the decision limit,
50% of the results are below this critical value and no analyte will be
reported present in 50% of the measurements. This decision limit
therefore must be modified to take account of these Type II errors so as
to obtain a more practical definition of the limit of detection.
If we are willing to accept a 5% chance of committing a Type II error,
the same probability as for a Type I error, then the relationship between
the blank measurements and the sample reading is as indicated in Fig. 5(c).
In such a case
Detection Limit = 2Z0.9S O"B (32)
or if SB is an estimate of O"B,
Detection Limit = 2to'9SSB (33)
Under these conditions we have a 5% chance of reporting the presence
of the analyte in a blank solution and a 5% chance of missing the analyte
in a true sample. Before we accept this defintion of detection limit, it is
worth considering the precision of measurements made at this level.
Repeated measurements on an analytical sample at the detection limit will
lead to the analyte being reported as below the detection limit 50% of the
time. In addition, from eqn (6) the relative standard deviation, or CV, of
the sample measurement is defined as
CV = 1000",/Jl, = 100/3·29 = 30·3%
compared with 60% at the decision limit. Thus while quantitative mea-
surements can be made at these low concentrations they do not constitute
the accepted degree of precision for quantitative analysis in which the
relative error should be below 20% or, better still, 10%.
If a minimum relative standard deviation of, say, 10%, is required from
the analysis then a further critical value must be defined. This is sometimes
206 M.J. ADAMS
5 CALIBRATION
TABLE 8
Analytical data of concentration and absorbance detained fromAAS for the
determination of copper in aqueous media using standard solutions
Absorbance
0.5
/
0.4
0.3
with blank
0.2
0.1 J
v i I I IT I
2 4 6 8 10
Fig. 6. The calibration curves produced by plotting the data from Table 6
using the least-squares regression technique. The upper line uses the original
data and the lower line uses the same data following a correction for a blank
analysis.
208 M.J. ADAMS
a~xi + b~xf
and solving for a and b
~(Xi - i)(Ai - A)
b (38)
~(Xi - i)2
a = A - bi (39)
The calculated values are presented with Table 8 and for these data the
least squares estimates for a and bare,
b = 1·925/48·8 = 0·0395
a = 0·289 - 5·2b = 0·0839
and the best straight line is given by
A = 0·0839 + 0·0395x (40)
ERRORS AND DETECTION LIMITS 209
SR
_ ~ _ JI:.(A - a -
- ---
bXJ2
(41)
n-2 n-2
and, using a two-tailed test with the t-distribution, the 95% confidence
limits for the slope b are defined by
1 (xo - X)2
-n + ~( ~Xi
-)2
- x
For our example, therefore, the characteristics of the fitted line and their
95% confidence limits are (to-025,3 = 3·182),
slope, b = 0·0395 ± 6·2 x 10- 6
intercept, a = 0·0839 ± 2·6 x 10- 4
If an unknown solution is subsequently analysed and gives an average
absorbance reading of 0·182 then rearranging eqn (40),
x = (A - 0·0839)/0·0395 = 2·48 mg kg-I
210 M.J. ADAMS
TABLE 9
Analysis for potassium using the method of standard additions (all volumes in
cm 3 )
Solution no.
2 3 4 5
Sample volume 20 20 20 20 20
Water volume 5 4 3 2 I
Standard K volume 0 I 2 3 4
Emission response (mv) 36 35 57 69 80
and the 95% confidence limits for the fitted value are
2·48 ± 3·182sR (I/5) + «2·48 - 5·2)/48·8) 2-48 ± 1·80
x 10-4 mg kg- '
80 ~/
;//
60 //
/!
40 - , - /
/ :
////20
/T . '---T----' T----,----,-----i--i
mg K added
0·081 absorbance units and this is subtracted from each response, then the
value of b' is calculated as
b' = 7·333/184 = 0·0399
and the new line is shown in Fig. 6. For our unknown solution, of
corrected absorbance (0·182 - 0·081 = 0'102) then
x = A/b' = 0'102/0·0399 = 2'56mgkg- 1
The results are similar and the use of the blank solution simplifies the
calculations and removes the effect of bias or systematic error due to the
background level from the sample and the instrument.
and samples can give rise to a severe systematic error in an analysis. One
technique to overcome the problem is to use the method of standard
additions. By this method the sample to be analysed is split into several
sub-samples and to each is added small volumes of a known standard. A
simple example will serve to illustrate the technique and subsequent cal-
culations.
BIBLIOGRAPHY
Bevington, P.R., Data Reduction and Error Analysis for the Physical Sciences.
McGraw-Hill, New York, 1969.
Caulcutt, R. & Boddy, K., Statistics for Analytical Chemistry. Chapman and Hall,
London, 1983.
Chatfield, c., Statistics for Technology. Chapman and Hall, London, 1975.
Davis, J .c., Statistics and Data Analysis in Geology. J. Wiley and Sons, New York,
1973.
Malmstadt, H.V., Enke, C.G., Crouch, S.R. & Horlick, G., Electronic Measure-
ments for Scientists. W.A. Benjamin, California, 1974.
Chapter 6
1 INTRODUCTION
213
214 J.M. THOMPSON
10 10
(a) (b)
, r=0·855 r=0·666
o 10 o 10
10 10
r=0·980 r=0·672
(c) (d) , , '
.... ,
o 10 o 10
10r---------------------~
5
81- " ,
4 •••••
-. . . . ""0
61- ,:'~/);.:~
.... .. . :'
• • -0 ' . "0° .... : ,. •
. .......~.:::-.
• 0. . : . \ ••
4 ...
2 •• '~"o : ...; •
•0. :.
~ '.
'. '
1 ...
"
o~--~--~----~--~--~. O~__~__~__~--~--~~
o 2 4 6 8 10 0 2 3 4 5
Fig, 2, Effect of scale of plot on perception of plot; both plots have correlation
coefficients of 0·8. (Reprinted with permission from W.S. Cleveland, P. Dia-
conis & R. McGill Science 216 1138-1141, copyright 1982 by the American
Association for the Advancement of Science.)
VISUAL REPRESENTATION OF DATA 215
of data sets with correlation coefficients between O· 3 and 0·8. The per-
ceived degree of association was found to shift by 10-15% as a result of
rescaling.
The purpose of visual representation of data is to provide the scientist/
technologist as a data analyst with insights into data behaviour not readily
obtained by nonvisual methods. However, one must be vigilant about the
psychological problems of experimenter bias resulting from anchoring of
one's interpretation onto preconceived notions as to the outcome of an
investigation. 5•6 The importance of designing the visual presentation of
data so as to avoid bias in perception of either the presenter or the receiver
of such displays has been emphasized by several groups.7-1O These percep-
tual problems are active areas of collaborative research between statisti-
cians and psychologists. Awareness of the importance of subjective, per-
ceptual elements in the design of effective visual representations and a
willingness to take account of these in data analysis, and in the com-
munication of the results of such analysis, should now be considered as an
essential part of statistical good practice.
distort nonresistant analyses, so that the true extent of their influence may
be grossly distorted, and therefore not apparent when examining the
residuals. Some brief glimpses at various aspects of exploratory data
analysis (EDA) are outlined below and aspects relevant to the theme of
this chapter are discussed in more detail in the appropriate sections. The
reader is encouraged to delve further into this subject by studying texts
listed in the Bibliography and References.
calibration, etc. If the outlier still appears genuine, then it may well
become the starting point for new experiments or surveys.
2 TYPES OF DATA
3 TYPES OF DISPLAY
II
0·22 .....
,
I
III I
iii I
I
:::l
"C
'>
I
"C ~
.!:
'0
c
.2
I
r- ~
....u i I
I
I
",
Lt J
irl
i
0'" J I J
I
71 305
Xylene concentration (mg m- 3 )
71 xylene 305
(b)
Fig. 4. (a) Cluttered one-dimensional scatter plot using vertical bars to repre-
sent individual observations of xylene vapour exposure of histopathology
technicians. It is not obvious from this plot that several observations overlap
(generated using Stata). (b) Alternative means of reducing clutter using jitter-
ing of the same data used for Figs. 3 and 4(a) but this method of plotting
reveals that there are several observations with the same value (generated
using Stata).
••
•
•
••
Fig. 5. Box and whisker plot. Fig. 6. Notched box and whisker plot.
TABLE 1
Relationship between letter values, their tags, tail areas and Gaussian letter
spreads
M 1/2
F 1/4 0·6745 1·349
E 1/8 0·1503 2·301
D 1/16 1· 534 1 3-068
C 1/32 1·8627 3·725
B 1/64 2·1539 4·308
A 1/128 H176 4·835
Z 1/256 2·6601 5·320
Y 1/512 2-8856 5·771
X 1/1024 3-0973 6·195
W 1/2048 3·2972 6·594
Fig. 7. (a) Five and (b) seven number summaries: simple letter value displays
of the xylene vapour exposure data.
218 174 125 170 145 180 124 135 115 148 264
305 107 144 202 239 106 171 201 137 216 224
153 102 173 125 119 154 186 118 204 141 226
105 150 194 129 118 233 224 227 170 128 173
71 185 144 209 155 124 108 124 144
detail in Refs. 24, 27 and 28, and will be briefly discussed here. The
mid summary is the average of a pair of letter values, so we can have the
midfourth, the mideighth, etc., all the way through to the midextreme or
midrange. The median is already a midsummary. The spread is the differ-
ence between the upper and lower letter values. The pseudosigma is
calculated by dividing the letter spread by the Gaussian letter spread.
Letter values and the derived parameters can be used to provide us with
a powerful range of graphical methods for exploring the shapes of data
distributions, especially for the analysis of skewness and elongation. 28
These techniques will be discussed in more detail in Section 4.7.
k = 0, 1, 2, ... , N; o<p<
Negative binomial:
Pk = (k + m - 1) pm (I _ p)k;
m - I
k = 0, I, 2, ... ; ° < P < I, m >
226 1.M. THOMPSON
Ne~a[ive
binomial (il > 0)
.......
"-
-------.,,------ Poisson (f> = 0)
......
o
"- Bmomlal
Logsenes (b = -.I) (b < 0)
Logarithmic series:
Pk = - cf>kj[k -In (1 - cf> )]; k = 1,2, ... ;
where Pk is the relative number of observations in group k in the sample,
nk is the number of observations in group k in the sample, L is the mean
of the Poisson distribution, P is the proportion of the population with the
particular attribute, cf> is the logarithmic series parameter.
4.6.1.1 Ord's procedure for large samples. Ord's procedure for large
samples 30 involves calculating the following parameter:
Uk = kpkjpk_1
for all observed x values and then plotting Uk versus k for all nk_1 > 5.
If this plot appears reasonably linear, one of the models listed in Section
4.6.1 can be chosen by comparison with those in Fig. 9, but this procedure
is regarded as suitable only for use with large samples of data. 30 More
appropriate procedures for smaller samples are described briefly in the
next three sections; they are discussed in detail in Ref. 29.
where k is the number of classes into which the distribution is divided and
nk is the number of observations in each class. The slope of this plot is
10g,,(L) and the intercept is - L. The slope may be estimated by resistantly
fitting a line to the points and then estimating L from L = eb • This
procedure works even if the Poisson distribution being fitted is truncated
or is a 'no-zeroes' Poisson distribution. Discrepant points in the plot do
not affect the position of other points, so the procedure is reasonably
resistant. A further improvement, which Hoaglin & Tukey29 introduced,
'levels' the Poissonness plot by plotting:
log,,(k!nk/N) + [Lo - klog,,(Lo)] versus k
where Lo is a rough value for the Poisson parameter L, and this new plot
would have a slope of 10g,,(L) - 10g,,(Lo) and intercept Lo - L. If the
original Lo was a reasonable estimate then this plot will be nearly as flat
as it is possible to achieve. If there is systematic curvature in the Poisson-
ness plot then the distribution is not Poisson.
4.6.1.4 Plots for other discrete data distributions. The same ap-
proach can be used for checking binomialness, negative binomialness or
tendency to conform to the logarithmic or the geometric series. The
negative binomialness of a distribution of many contagious insect popula-
tions is a measure of the dispersion parameter and Southwood32 discusses
the problems of estimating the characteristic parameter of the negative
binomial using the more traditional approach. It would be useful to
re-assess data, previously assessed using the methods discussed by South-
wood,32 with the methods proposed in Hoaglin & Tukey.29 However, space
and time constraints limit the author to alerting the interested reader to
studying Ref. 29 on these very useful plots for estimating discrete distribu-
228 J .M. THOMPSON
tion parameters, and for hunting out discrepancies in a way that is resis-
tant to misclassification.
11
>,5
u
c:
V
:J
r:r
v
tL 2
-1
-4
o
XYLENE 1. xylene
Fig. 10. Hanging histobars plot of the xylene vapour exposure data (gener-
ated using Statgraphics PC).
we can calculate other parameters for graphical analysis of shape. The five
plots discussed below are described in more detail in Ref. 37.
regression line and confidence envelopes on the scatter plot but there are
numerous ways of performing regression and of calculating confidence
envelopes. In environmental studies, much data is collected from observa-
tion in the field or by measurements made on field-collected samples as
opposed to laboratory experiments. When regressing such variables
against one another, ordinary least squares regression is inappropriate
because the x variable is assumed to be error free; clearly this is not the
case in practice. Deming39 and Mandel40 have discussed ways of allowing
for errors in x within the framework of ordinary least squares. If outliers
and/or non-Gaussian behaviour are present, any such extreme data will
have a disproportionate effect on the position of the regression line even
for Deming/Mandel regression. Various exploratory, robust and non-
parametric methods are available for regression which are protective
against such behaviour. 41-43 No one method will give a unique best line but
methods are now appearing which enable the calculation of confidence
envelopes around resistant lines to be made. 44.45 Many of these resistant
and robust regression methods are likely to be in reasonable agreement
with one another. Using such methods, one can plot useful regression lines
and confidence envelopes. The scatter plot is also essential as a diagnostic
tool in the examination of regression residuals in guiding us on whether
the data need transformation to improve the regression.
100
cumulative r·d
----pF
frequency
OL--------------------
Xx
x
x x
x x x
Xx x x x
x x xxx x Xx x x
x xx x x x xx x
x XX x
x
x x
x
x Xx
xx x xX x x x
x x x
x x x x x x
x x
x x x
x X
x x Xx x x x x
x x x x x
x x x
x x x
x x x
Rousseeuw & Leroi 2 make clear, the purpose of such plots is identifica-
tion not rejection. If a point is clearly erroneous, as a result of faulty
measurement, recording or transcription, then we have sound reasons for
rejection. Otherwise the identification of outliers should be a starting point
for the search for the reasons for the unusual behaviour and/or refinement
of the hypothesized model.
Fig. 13. Multiple jittered scatter plots of occupational exposure to the volatile
anaesthetic agent. halothane, in a poorly ventilated operating theatre, mg m- 3
(data of the author and R. Sithamparanadarajah; generated using Stata).
480
+
,....
c
.2
+
;u
c
ou
II
li
s:
180
1)
ii
J:
80
Fig. 14. Graphical one-way analysis of variance, using multiple notched box
and whisker plots, of the same data as in Fig. 14 (generated using Statgraphics
PC).
Much attention is being devoted to the devising of clear displays and there
is much to be gained from the collaboration between statisticians/data
analysts and psychologists in this area. KosslynsS has usefully reviewed
some recent monographs on graphical display methods from the percep-
tual psychologist's viewpoint.
t Halothane blood + +
concentrations
+ + +
itt ++ f t
0
18.----+-------,
Halothane
t t expired air
concentrations
N+ + +*+ + +
0
0
1 + 80 0
++ + I I I I I I
50 0 18
15
c:
o
:;; 12
....l!c:
~ 9
c:
o
u
.c. 6
80
Fig. 16. Three dimensional scatter plot (same data as Fig. 15; generated using
Statgraphics PC).
c: +
.2
....I'll
....Lc: 60 f-
41
u
c:
0
U
"0
0
0
:0
40 -
it +
S
41
c:
I'll
.c
.... + t
0
iii 20 f- + +
:r t
Fig. 17. Casement display of same data as Fig. 15 (generated using Stat-
graphics PC).
individual samples. Only the signs of the residuals are used to make the
plot. Mead describes simple sorting routines and illustrates these ideas
with various data sets. 60
WEt]
Co
0
t \0+
+++
ID ~
Cr
f +
+
: r
[em ~ ~
++t+ +
0
0. +t t + t +*+ +
0
ill w~
+
Ni '\ + ++ t
+++ +
o +
+r f + ++ + +t
83
*
. i \
~ ~~
.-;-
~ ~
,,~
~R \ J 1:0
, I
\1
\T'
\ I
,
' f,/
., .........
i"~
\
u,'" ( --1
'-J Cr
'----:J
5 9 '" 10 ' 'J CU
j
r--.
"'- ~-c..; ~h
~
\j/ ~ ~ 1N (-.J~)
~)
----..
Pb
Sr
11 12 13 \4 15
(::) '.,t
'-J
"
"\ '" Zn
(f ~ \.:J
A v
~
f ~
J
16 I: 18 19 20
' \ ,/
,'"" ,/1
~
h\' \
~
~.
~
Fig. 20. Star symbol plots of the concentrations of various trace metals in 20
sediments collected from streams draining an old mining district in Sweden
(generated using Stata with data from Ref. 58).
of the variables, so this must be kept unvarying from one data vector to
another. The plot is termed a 'profile symbol' and many profiles may be
plotted on the same graph for comparison.
Providing that the density of display of such symbols is not too high
then this should provide a useful approach to spatial cluster analysis. It
could also be quite an effective way of demonstrating or illustrating, for
example, regional variations of species diversity or water quality in a
drainage basin.
A well-known multivariate coded symbol is the weather vane, which
combines quantitative and directional information in a well-tried way.
Varying the size of a symbol to convey quantitative information is not
a straightforward solution, because our perception of such symbols does
not always follow the designer's intentions. For example, using circles of
different areas to convey quantity is not a reliable approach, because in
our perception of circle size, we seem unable to judge their relative sizes
at all accurately, and our judgement of sphere symbols appears to fare
even worse. 75
% similarity
100 95 90 85 80
Leontodon
Poterium
Lo/ium
Trifolium
Ranunculus
Plantago
Helictotrichon
Anthriscus
Taraxacum
Heracleum
Lathyrus
Dactylis
Poa trivialis
-====:::r-----1
Poa pratensis
Holcus
Alopecurus
Arrhenatherum -
Agrostis
Anthoxanthum
Festuca
9
I I
10
2 3 4 5
I 6
~_
7 8 9 10
REFERENCES
32. Southwood, T.R.E., Ecological Methods with Particular Reference to the Study
of Insect Populations. Chapman and Hall, London, 1978, Chapter 2.
33. Chambers, J.M., Cleveland, W.S., Kleiner, B. & Tukey, P.A., Graphical
Methods for Data Analysis. Wadsworth International Group, Belmont, CA,
1983, Chapter 6.
34. Hoaglin, D.C., In Exploring Data Tables, Trends, and Shapes, ed. D.C.
Hoaglin, F. Mosteller & J.W. Tukey. John Wiley & Sons, New York, 1985,
pp.437-8.
35. Statistical Graphics Corporation. STATGRAPHICS Statistical Graphics
System. User's Guide, Version 2.6. STSC, Inc., Rockville, MA, 1987, pp. 11-13
to ll-14.
36. Velleman, P.F. & Hoaglin, D.C., Applications, Basics, and Computing of
Exploratory Data Analysis. Duxbury Press, Boston, MA, 1981, Chapter 9.
37. Hoaglin, D.C., Using quantiles to study shape. In Exploring Data Tables.
Trends. and Shapes, ed. D.C. Hoaglin, F. Mosteller & J.W. Tukey, John Wiley
& Sons, New York, 1985, Chapter 10.
38. Hoaglin, D.C., Summarizing shape numerically: the g-and-h distributions. In
Exploring Data Tables. Trends, and Shapes, ed. D.C. Hoaglin, F. Mosteller &
J.W. Tukey, John Wiley & Sons, New York, 1985, Chapter II.
39. Deming, W.E., Statistical Adjustment of Data. John Wiley & Sons, New York,
1943, pp. 178-82.
40. Mandel, J., The Statistical Analysis of Experimental Data. John Wiley & Sons,
New York, 1964, pp. 288-92.
41. Theil, H., A rank invariant method of linear and polynomial regression
analysis, parts I, II and III. Proc. Kon. Nederl. Akad. Wetensch., A, 53 (1950)
386-92, 521-5, \397-412.
42. Tukey, J.W., Exploratory Data Analysis, Limited Preliminary Edition, Ad-
dison-Wesley, Reading, MA, 1970.
43. Rousseeuw, P.I. & Leroy, A.M., Robust Regression and Outlier Detection.
John Wiley &.Sons, New York, 1987, Chapter 2.
44. Lancaster, J.F. & Quade, D., A non parametric test for linear regression based
on combining Kendall's tau with the sign test. J. Amer. Statist. Assoc., 80
(1985) 393-7.
45. Thompson, J.M., The use of a robust resistant regression method for personal
monitor validation with decay of trapped materials during storage. Analytica
Chimica Acta, 186 (1986), 205-12.
46. Cleveland, W.S. & McGill, R., The many faces of a scatterplot. 1. Amer.
Statist. Assoc., 79 (1984) 807-22.
47. Tukey, J.W. & Tukey, P.A., Some graphics for studying four-dimensional
data. Computer Science and Statistics: Proceedings of the 14th. Symposium on
the Interface. Springer-Verlag, New York, 1983, pp. 60-6.
48. Chambers, J.M., Cleveland, W.S., Kleiner, B. & Tukey, P.A., Graphical
Methods for Data Analysis. Duxbury Press, Belmont, CA, 1983, pp. 110-21.
49. Goodall, c., Examining residuals. In Understanding Robust and Exploratory
Data Analysis, ed. D.C. Hoaglin, F. Mosteller & J.W. Tukey. John Wiley &
Sons, New York, 1983, Chapter 7.
50. Atkinson, A.C., Plots, Transformations and Regression. An Introduction to
VISUAL REPRESENTATION OF DATA 257
BIBLIOGRAPHY
Cleveland, W.S. & McGill, M.E. (ed.), Dynamic Graphics for Statistics. Wads-
worth, BeImont, CA, 1989.
Davis, J.e., Statistics and Data Analysis in Geology. John Wiley & Sons, New
York, 1973.
Eddy, W.F. (ed.), Computer Science and Statistics: Proceedings of the 13th Sym-
posium on the Interface. Springer-Verlag, New York. 1981.
Green. W.R., Computer-aided Data Analysis. A Practical Guide. John Wiley &
Sons, New York, 1985.
Haining, R., Spatial Data Analysis in the Social and Environmental Sciences.
Cambridge University Press, Cambridge, 1990.
Isaaks, E.H. & Srivastava, R.M., An Introduction to Applied Geostatistics. Oxford
University Press, New York, 1989.
Ripley, B.D., Spatial Statistics. J. Wiley & Sons, New York, 1981.
Ripley, B.D., Statistical Inference for Spatial Processes. Cambridge University
Press, Cambridge, 1988.
Upton, G. & Fingleton, B., Spatial Data Analysis by Example. Volume 1, Point
Pattern and Quantitative Data and Volume 2, Categorical and Directional Data.
John Wiley & Sons, New York, 1985, 1989.
Chapter 7
KATHLEEN A. CARLBERG
29 Hoffman Place. Belle Mead. New Jersey 08502, USA
MITZI S. MILLER
Automated Compliance Systems. 673 Emory Valley Road, Oak Ridge,
Tennessee 37830, USA
1 INTRODUCTION
1.1 Background
Environmental assessment activities may be viewed as being comprised of
four parts: establishment of Data Quality Objectives (DQO); design of the
Sampling and Analytical Plan; execution of the Sampling and Analytical
Plan; and Data Assessment.
During the last 20 years, numerous environmental assessments have
been conducted, many of which have not met the needs of the data users.
In an effort to resolve many of these problems, the National Academy of
Sciences of the United States was requested by the US Environmental
Protection Agency (EPA) to review the Agency's Quality Assurance (QA)
Program. I
259
260 A.A. LIABASTRE, K.A. CARLBERG AND M.S. MILLER
This review and the efforts of the EPA Quality Assurance Management
Staff have led to the use of both Total Quality Principles and the DQO
concept. Data quality objectives are interactive management tools used to
interpret and communicate the data users' needs to the data supplier such
that the supplier can develop the necessary objectives for QA and appro-
priate levels of quality control. In the past, it was not considered important
that data users convey to data suppliers what the quality of the data
should be.
Data use objectives are statements relating to why data are needed, what
questions have to be answered, what decisions have to be made, and/or
what decisions need to be supported by the data. It should be recognized
that the highest attainable quality may be unrelated to the quality of data
adequate for a stated objective. The suppliers of data should know what
quality is achievable, respond to the users' quality needs and eliminate
unnecessary costs associated with providing data of much higher quality
than needed, or the production of data of inadequate quality.
Use of the DQO process in the development of the assessment plan
assures that the data adequately support decisions, provides for cost
effective sampling and analysis, prevents unnecessary repetition of work
due to incorrect quality specifications, and assures that decisions which
require data collection are considered in the planning phase.
The discussions presented here assume the DQO process is complete
and the data use objectives are set, thus allowing the sampling and analyti-
cal design process to begin.
The sampling and analytical design process should minimally address
several basic issues including: historical information; purpose of the
sample collection along with the rationale for the number, location, type
of samples and analytical parameters; field and laboratory QA pro-
cedures.
Once these and other pertinent questions are addressed, a QA program
may be designed that is capable of providing the quality specifications
required for each project. By using this approach, the QA program ensures
a project capable of providing the data necessary for meeting the data use
objectives.
2. 1. 1 Organization
The field and laboratory organizations should be structured such that each
member of the organization has a clear understanding of their role and
responsibilities. To accomplish this, a table of organization indicating
lines of authority, areas of responsibility, and job descriptions for key
personnel should be available. The organization'S management must
promote QA ideals and provide the technical staff with a written QA
policy. This policy should require development, implementation, and
maintenance of a documented QA program. To further demonstrate their
QUALITY ASSURANCE FOR ENVIRONMENTAL ASSESSMENT ACTIVITIES 263
2. 1.2 Personnel
Current job descriptions and qualification requirements, including educa-
tion, training, technical knowledge and experience, should be maintained
for each staff position. Training shall be provided for each staff member
to enable proper performance of their assigned duties. All such training
must be documented and summarized in the individual's training file.
Where appropriate, evidence demonstrating that the training met minimal
acceptance criteria is necessary to establish competence with testing, sam-
pling, or other procedures. The proportion of supervisory to non-supervis-
ory staff should be maintained at the level needed to ensure production of
quality environmental data. In addition, back-up personnel should be
designated for all senior technical positions, and they should be trained to
provide for continuity of operations in the absence of the senior technical
staff member. There should be sufficient personnel to provide timely and
proper conduct of analytical sampling and analysis in conformance with
the QA program.
and clerical and secretarial services. Each member of the support staff
must be provided with on-the-job training to enable them to perform the
job in conformance with the requirements of the job description and the
QA program. Such training must enable the trainee to meet adopted
performance criteria and it must also be documented in the employee's
training file.
2.1.3 Subcontractors
All subcontractors should be required to meet the same quality standards
as the primary laboratory or sample collection organization. Subcontrac-
tors should be audited to meet the same criteria as the in-house labora-
tories by using quality control samples, such as double-blind samples and
site visits conducted by the QA Officer.
2.2 Facilities
Improperly designed and poorly maintained laboratory facilities can have
a significant effect on the results of analyses, the health, safety and morale
of the analysts, and the safe operation of the facilities. Although the
emphasis here is the production of quality data, proper facility design
266 A.A. LIABASTRE, K.A. CARLBERG AND M.S. MILLER
2.2.1 General
Each laboratory should be sized and constructed to facilitate the proper
conduct of laboratory analyses and associated operations. Adequate
bench space or working area per analyst should be provided; 4'6-
7·6 meters of bench space or 14-28 square meters of floor space per analyst
has been recommended. 21 Lighting requirements may vary depending
upon the tasks being performed. Lighting levels in the range of 540-
1075 lumens per square meter are usually adequate. 24 A stable and reliable
source of power is essential to proper operation of many laboratory
instruments. Surge suppressors are required for computers and other
sensitive instruments and uninterrupted power supplies, as well as isolated
ground circuits may be required. The actual requirements depend on the
equipment or apparatus utilized, power line supply characteristics and the
number of operations that are to be performed (many laboratories have
more work stations than analysts) at one time. The specific instrumenta-
tion, equipment, materials and supplies required for performance of a test
method are usually described in the approved procedure. If the laboratory
intends to perform a new test, it must acquire the necessary instrumenta-
tion and supplies, provide the space, and conduct the training necessary
to demonstrate competence with the new test before analyzing any routine
samples.
2.3. 1 General
The laboratory and field sampling organizations should have available all
items of equipment and instrumentation necessary to correctly and ac-
curately perform all the testing and measuring services that it provides to
its users, For the field sampling activities, the site facilities should be
examined prior to beginning work to ensure that appropriate facilities,
equipment, instrumentation, and supplies are available to accomplish the
objectives of the QA plan. Records should be maintained on all major
items of equipment and instrumentation which should include the name of
the item, the manufacturer and model number, serial number, date of
purchase, date placed in service, accessories, any modifications, updates or
upgrades that have been made, current location of the equipment, as well
as related accessories and manuals, and all details of maintenance. Items
of equipment used for environmental testing must meet certain minimum
requirements. For example, analytical balances must be capable of weigh-
ing to 0·1 mg; pH meters must have scale graduations of at least 0·01 pH
units and must employ a temperature sensor or thermometer; sample
storage refrigerators must be capable of maintaining the temperature in
the range 2-5°C; the laboratory must have access to a certified National
Institute of Science and Technology (NIST) traceable thermometer; ther-
mometers used should be calibrated and have graduations no larger than
that appropriate to the testing or operating procedure; and probes for
conductivity measurements must be of the appropriate sensitivity.
must provide clearly defined and written maintenance procedures for each
measurement system and the required support equipment. The program
must also detail the records required for each maintenance activity to
document the adequacy of maintenance schedules and the parts inventory.
All equipment should be properly maintained to ensure protection from
corrosion and other causes of deterioration. A proper maintenance pro-
cedure should be available for those items of equipment which require
periodic maintenance. Any item of equipment which has been subject to
mishandling, which gives suspect results, or has been shown by calibration
or other procedure to be defective, should be taken out of service and
clearly labelled until it has been repaired. After repair, the equipment must
be shown by test or calibration to be performing satisfactorily. All actions
taken regarding calibration and maintenance should be documented in a
permanent record.
Calibration of equipment used for environmental testing usually falls in
either the operational or periodic calibration category. Operational cali-
bration is usually performed prior to each use of an instrumental measure-
ment system. It typically involves developing a calibration curve and
verification with a reference material. Periodic calibration is performed
depending on use, or at least annually on items of equipment or devices
that are very stable in operation, such as balances, weights, thermometers
and ovens. Typical calibration frequencies for instrumentation is either
prior to use, daily or every 12 h. All such calibrations should be traceable
to a recognized authority such as NIST. The calibration process involves:
identifying equipment to be calibrated; identifying reference standards
(both physical and chemical) used for calibration; identifying, where
appropriate, the concentration of standards; use of calibration pro-
cedures; use of performance criteria; use of a stated frequency of calibra-
tion; and the appropriate records and documentation to support the
calibration process.
3.1 Calibration
Calibration criteria should be specified for each test technology and/or
analytical instrument and method in order to verify measurement system
performance. The calibration used should be consistent with the data use
objectives. Such criteria should specify the number of standards necessary
to establish a calibration curve, the procedures to employ for determining
linear fit and linear range of the calibration, and acceptance limits for
(continuing) analysis of calibration curve verification standards. All SOPs
for field and laboratory measurement system operation should specify
system calibration requirements, acceptance limits for analysis of calibra-
tion curve verification standards, and detailed steps to be taken when
acceptance limits are exceeded.
Detection limit (DL) The concentration which Analysis of replicate Two times the standard 35
deviation
>
is distinctly detectable standards ~
above, but close to a
blank ~
The lowest concentration Analysis of replicate Three times the standard 36
=
>
VJ
Limit of detection
(LOD) that can be determined samples deviation ;d
~m
to be statistically
:;-:
different from a blank
~
Method detection limit The minimum Analysis of a minimum The standard deviation 37 (")
(MOL) concentration of a of seven replicates spiked times the Student t-value >
::o:l
t"'"
substance that can be at 1-5 times the expected at the desired confidence =
identified, measured and detection limit level (for seven replicates, ~
the value is 3·14) 0
reported with 99%
confidence that the >
analyte concentration is ~
greater than zero is:
1n
Instrument detection The smallest signal above Analysis of three Three times the standard 38
limit (IDL) background noise that an replicate standards at deviation ~
instrument can detect concentrations of 3-5 ~
reliably times the detection limit
Method quantitation The minimum Analysis of replicate Five times the standard 39
limit (MQL) concentration of a samples deviation
substance that can be
measured and reported
Limit of quantitation The level above which Analysis of replicate Ten times the standard 36
(LOO) quantitative results may samples deviation
be obtained with a
specified degree of ~
confidence
Practical quantitation The lowest level that Interlaboratory analysis (1) Ten times the MDL 39
~
limit (POL) can be reliably of check samples (2) Value where 80% of 40
determined within laboratories are
specified limits of within 20% of the !
precision and accuracy true value Q
during routine ~
laboratory operating :ocJ
conditions
Contract required Reporting limit specified Unknown Unknown 38
detection limit for laboratories under
(CRDL) contract to the EPA for
Superfund activities
I
....j
I
;;
::l
;::;
~
N
......
......
278 A.A. LIABASTRE, K.A. CARLBERG AND M.S. MILLER
how the results of analysis of these samples will be used in evaluating the
data.
Common types of field and laboratory QC samples are discussed in the
following sections, along with how they are commonly used in evaluating
data quality.
5 DATA ASSESSMENT34.52
Data assessment involves determining whether the data meet the require-
ments of the QA plan and the needs of the data user. Data assessment is
a three-part process involving assessment of the field data, the laboratory
data and the combined field and laboratory data. Both the field and
laboratory assessments involve comparison of the data obtained to the
specifications stated in the QA plan, whereas the combined, or overall,
assessment involves determining the data usability. The data usability
assessment is the determination of the data's ability to meet the DQOs and
whether they are appropriate for the intended use.
and trends. The control limits from these data should be used on a
real-time basis to demonstrate measurement system control and sample
data validity.
(3) Performance evaluation sample: the results ofPE sample analyses are
used to evaluate and compare different laboratories. In order for such
comparisons to be valid, the PE sample results must have associated
control limits that are statistically valid.
5.2.6 Calibration
Instrument calibration information including sensitivity checks, instru-
ment calibration and continuing calibration checks should be evaluated
and compared to historical information and acceptance criteria. Control
charts are a preferred method for documenting calibration performance.
Standard and reference material traceability must be evaluated and doc-
286 A.A. LlABASTRE, K.A. CARLBERG AND M.S. MILLER
(6) observations, remarks, and name and signature of person recording the
data.
ACKNOWLEDGEMENT
REFERENCES·
"The majority of the references cited are periodically reviewed and updated, the
reader is advised to consult the latest edition of these documents.
296 A.A. LIABASTRE, K.A. CARLBERG AND M_S. MILLER
tion of criteria for use in the evaluation of testing laboratories and inspection
bodies. ASTM Designation E548, Philadelphia, PA, 1984.
16. American Society for Testing and Materials, Standard guide for laboratory
accreditation systems. ASTM Designation E994, Philadelphia, PA, 1990.
17. Locke, J.W., Quality, productivity, and the competitive position of a testing
laboratory. ASTM Standardization News, July (1985) 48-52.
18. American National Standards Institute, Fire protection for laboratories using
chemicals. National Fire Protection Association, Designation ANSI/NFPA
45, Quincy, MA, 1986.
19. Koenigsberg, J., Building a safe laboratory environment. American Laborat-
ory, 19 June (1987) 96-105.
20. US Code of Federal Regulations, Title 29, Part 1910, Occupational Safety and
Health Standards, Subpart Z, Toxic and Hazardous Substances, Section
.1450, Occupational exposure to hazardous chemicals in laboratories, OSHA,
1990, pp 373-89.
21. American Society for Testing and Materials, Standard guide for good laborat-
ory practices in laboratories engaged in sampling and analysis of water.
ASTM Designation 03856, Philadelphia, PA, 1988.
22. Committee on Industrial Ventilation, Industrial Ventilation: A Manual of
Recommended Practice, 20th edn. American Conference of Governmental
Industrial Hygienists, Cincinnati, OH, 1988.
23. American National Standards Institute, Flammable and combustible liquids
code. National Fire Protection Association, Designation ANSI/NFPA 30,
Quincy, MA, 1987.
24. Kaufman, J.E. (ed.), IES Lighting Handbook: The Standard Lighting Guide,
5th edn. Illuminating Engineering Society, New York, 1972.
25. US Code of Federal Regulations, Title 29, Part 1910, Occupational Safety and
Health Standards, Subpart H, Hazardous Materials, Section .106, Flammable
and combustible liquids, OSHA, 1990, pp. 242-75.
26. American Society of Testing and Materials, Standard guide for documenting
the standard operating procedure for the analysis of water. ASTM Designa-
tion D5172, Philadelphia, PA, 1991.
27. American Society for Testing and Materials, Standard guide for records
management in spectrometry laboratories performing analysis in support of
nonciinicallaboratory studies. ASTM Designation E899, Philadelphia, PA,
1987.
28. American Society for Testing and Materials, Standard practice for sampling
chain of custody procedures. ASTM Designation D4840, Philadelphia, PA,
1988.
29. American Society for Testing and Materials, Standard guide for accountabil-
ity and quality control in the chemical analysis laboratory. ASTM Designa-
tion E882, Philadelphia, PA, 1987.
30. American Society for Testing and Materials, Standard guide for quality
assurance of laboratories using molecular spectroscopy. ASTM Designation
E924, Philadelphia, PA, 1990.
31. US Environmental Protection Agency, Handbook for Analytical Quality
Control in Water and Wastewater, Laboratories. EPA-600/4-79-019, USEPA,
Environmental Monitoring and Support Laboratory, Cincinnati, OH, 1979.
QUALITY ASSURANCE FOR ENVIRONMENTAL ASSESSMENT ACTIVITIES 297
32. American Society for Testing and Materials, Standard practice for determina-
tion of precision and bias of applicable methods of Committee 0-19 on water.
ASTM Designation 02777, Philadelphia, PA, 1986.
33. American Society for Testing and Materials, Standard practice for intralabo-
ratory quality control procedures and a discussion on reporting low-level data.
ASTM Designation 04210, Philadelphia, PA, 1989.
34. US Environmental Protection Agency, Guidance Documentfor Assessment of
RCRA Environmental Data Quality. DRAFT, USEPA, Office of Solid Waste
and Emergency Response, Washington, DC, 1987.
35. US Environmental Protection Agency, Methods for Chemical Analysis of
Water and Wastes. EPA/600/4-79-020, Environmental Monitoring and
Support Laboratory, Cincinnati, OH, revised 1983.
36. Keith, L.H., Crummett, W., Deegan, 1., Jr., Libby, R.A., Taylor, 1.K.,
Wender, G., Principles in environmental analysis. Analytical Chemistry, 55
(1983) 2210-18.
37. US Code of Federal Regulations, Title 40, Part 136, Guidelines establishing
test procedures for the analysis of pollutants, Appendix B-Definition and
procedure for the determination of the method detection limit, Revision 1.11,
USEPA, 1990, pp. 537-9.
38. US Environmental Protection Agency, User's Guide to the Contract Laborat-
ory Program: and Statements of Work for Specific Types of Analysis. Office of
Emergency and Remedial Response, USEPA 9240.0-1, December 1988,
Washington, DC.
39. US Environmental Protection Agency, Test Methods for Evaluating Solid
Waste, SW-846, 3rd edn, Office of Solid Waste (RCRA), Washington, DC,
1990.
40. US Code of Federal Regulation, Title 40, Protection of Environment, Part 141
-National Primary Drinking Water Regulations, Subpart C: Monitoring and
Analytical Requirements, Section .24, Organic Chemicals other than total
trihalomethanes, sampling and analytical requirements, USEPA, 1990, pp.
574-9.
41. Britton, P., US Environmental Protection Agency, Estimation of generic
acceptance limits for quality control purposes in a drinking water laboratory.
Environmental Monitoring and Support Laboratory, Cincinnati, OH, 1989.
42. Britton, P., US Environmental Protection Agency, Estimation of generic
quality control limits for use in a water pollution laboratory. Environmental
Monitoring and Support Laboratory, Cincinnati, OH, 1989.
43. Britton, P. & Lewis, D., US Environmental Protection Agency, Statistical
basis for laboratory performance evaluation limits. Environmental Monitor-
ing and Support Laboratory, Cincinnati, OH, 1986.
44. US Environmental Protection Agency, Data Quality Objectives for Remedial
Response Activities Example Scenario: RI/ FS Activities at a Site with Contami-
nated Soils and Ground Water. EPA/540/G-87/004, USEPA, Office of Emer-
gency and Remedial Response and Office of Waste Programs Enforcement,
Washington, DC, 1987.
45. US Environmental Protection Agency, Field Screening Methods Catalog:
User's Guide. EPA/540/2-88/005, USEPA, Office of Emergency and Remedial
Response, Washington, DC, 1988.
298 A.A. LIABASTRE, K.A. CARLBERG AND M.S. MILLER
301
302 INDEX
Tables, 216
presentation of, 3-7
cumulative, 6-7
frequency, 3-6, 15-7,29 Uncorrelated variables, 87
mean, of, 15-7 Unimodal distribution, 17
standard deviation, of, 29 Univariate methods, see
Target transformation factor analysis Nonstationary, time-series
(TTFA), 169-76 analysis
t-distribution, 192-3
Technical staff, 264
Third party audits, 281
Three-variable, two-dimensional Validation, 283-4, 292
views, 239 Variability, decomposition, 95-9
Time-domain time-series, see Variable transformation, 234-5
Correlation analysis Variables, see specific types of
Time-series Variance, 24-9, 112-9
analysis, see Nonstationary see also Standard deviation
time-series analysis Variance analysis, see ANOVA
INDEX 309