0% found this document useful (0 votes)
208 views22 pages

Understanding The Structure of Scientific Data: LC - GC Europe Online Supplement

This document discusses exploratory data analysis techniques for understanding the structure of scientific data. It focuses on calculating basic summary statistics like the mean and standard deviation. Exploratory techniques explored include blob plots, stem-and-leaf plots, histograms, and box-and-whisker plots, which can reveal outliers and the underlying data distribution. Understanding the data structure is important for selecting appropriate statistical methods and interpreting their results.

Uploaded by

HERNANDO
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
208 views22 pages

Understanding The Structure of Scientific Data: LC - GC Europe Online Supplement

This document discusses exploratory data analysis techniques for understanding the structure of scientific data. It focuses on calculating basic summary statistics like the mean and standard deviation. Exploratory techniques explored include blob plots, stem-and-leaf plots, histograms, and box-and-whisker plots, which can reveal outliers and the underlying data distribution. Understanding the data structure is important for selecting appropriate statistical methods and interpreting their results.

Uploaded by

HERNANDO
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 22

LC•GC Europe Online Supplement statistics and data analysis 3

Understanding the
Structure of
Scientific Data
Shaun Burke, RHM Technology Ltd, High Wycombe, Buckinghamshire, UK.

This is the first in a series of articles that aims to promote the better use of
statistics by scientists. The series intends to show everyone from bench
chemists to laboratory managers that the application of many statistical
methods does not require the services of a ‘statistician’ or a ‘mathematician’
to convert chemical data into useful information. Each article will be a con-
cise introduction to a small subset of methods. Wherever possible, diagrams
will be used and equations kept to a minimum; for those wanting more the-
ory, references to relevant statistical books and standards will be included.
By the end of the series, the scientist should have an understanding of the
most common statistical methods and be able to perform the test while
avoiding the pitfalls that are inherent in their misapplication.

In this article we look at the initial steps in known as a dot plot) can be used to graph types with a few clicks of the
data analysis (i.e., exploratory data analysis), explore how the data set is distributed mouse. All of these plots can give an
and how to calculate the basic summary (Figure 1). Blob plots are constructed indication of the presence or absence of
statistics (the mean and sample standard simply by drawing a line, marking it off outliers (1). The frequency histogram, stem
deviation). These two processes, which with a suitable scale and plotting the data and leaf plot, and blob plot can also
increase our understanding of the data along the axis. indicate the type of distribution the data
structure, are vital if the correct selection of A stem-and-leaf plot is yet another belongs to. It should be remembered that
more advanced statistical methods and method for examining patterns in the data if the data set is from a non-normal (2)
interpretation of their results are to be set. These are complex to describe and distribution, (Figure 2(a) and possibly
achieved. From that base we will progress to perceived as old fashioned, especially with Figure 1(a)), it may be that which looks like
significance testing (t-tests and the F-test). the modern graphical packages available an outlier is in fact a good piece of
These statistics allow a comparison between today. For the sake of completeness they information. The outliers are the most
two sets of results in an objective and are described in Box 1. extreme points on the right-hand side of
unbiased way. For example, significance For larger data sets, frequency Figures 1(a) and 2(a). Note: Outliers, outlier
tests are useful when comparing a new histograms (Figure 2(a)) and Box and tests and robust methods will be the
analytical method with an old method or Whisker plots (Figure 2(b)) may be better subject of a later article.
when comparing the current day’s options to display the data distribution. Assuming there are no obvious outliers,
production with that of the previous day. Once the data set is entered, or as is more we still have to do one more plot to make
usual with modern instrumentation, sure we understand the data structure. The
Exploratory Data Analysis electronically imported, most modern PC individual results should be plotted against
Exploratory data analysis is a term used to statistical packages can construct these a time index (i.e., the order the data were
describe a group of techniques (largely
graphical in nature) that sheds light on the
structure of the data. Without this
knowledge the scientist, or anyone else, (a)
cannot be sure they are using the correct
form of statistical evaluation. Scale
The statistics and graphs referred to in this Mean
(b)
first section are applicable to a single
column of data (i.e., univariate data), such Scale
as the number of analyses performed in a Mean
laboratory each month. For small amounts
of data (<15 points), a blob plot (also figure 1 Blob plots of the raw data.
4 statistics and data analysis LC•GC Europe Online Supplement

obtained). If any systematic trends are together with how they relate to the Unfortunately, the mean is often reported
observed (Figures 3(a)–3(c)) then the confidence intervals for normally as an estimate of the ‘true-value’ (m) of
reasons for this must be investigated. distributed data. whatever is being measured without
Normal statistical methods assume a considering the underlying distribution.
random distribution about the mean with The Mean This is a mistake. Before any statistic is
time (Figure 3(d)) but if this is not the case The average or arithmetic mean (3) is calculated it is important that the raw data
the interpretation of the statistics can be generally the first statistic everyone is should be carefully scrutinized and plotted
erroneous. taught to calculate. This statistic is easily as described above. An outlying point can
found using a calculator or spreadsheet have a big effect on the mean (compare
Summary Statistics and simply involves the summing of the Figure 1(a) with 1(b)).
Summary statistics are used to make sense individual results x1, x2, x3, ..., xi) and
of large amounts of data. Typically, the division by the number of results (n), The Standard Deviation (3)
mean, sample standard deviation, range, n The standard deviation is a measure of the
 xi
confidence intervals, quantiles (1), and i1 spread of data (dispersion) about the mean
measures for skewness and x n and can again be calculated using a
spread/peakedness of the distribution where, calculator or spreadsheet. There is,
(kurtosis) are reported (2). The mean and however, a slight added complication; if
n
sample standard deviation are the most   x1  x2  x3  … xi you look at a typical scientific calculator
widely used and are discussed below i1 you will notice there are two types of
Frequency (Nº of data points in each bar)

(a) (b)
Box 1: Stem-and-leaf plot
1.5  interquartile
A stem-and-leaf plot is another
method of examining patterns in the
upper quartile value
data set. They show the range, in
which the values are concentrated, interquartile
and the symmetry. This type of plot is median
constructed by splitting data into the lower quartile value
stem (the leading digits). In the figure
below, this is from 0.1 to 0.6, and 1.5  interquartile
the leaf (the trailing digit). Thus,
0.216 is represented as 2|1 and
0.350 by 3|5. Note, the decimal *outlier
places are truncated and not round-
ed in this type of plot. Reading the *The interquartile range is the range which contains the middle 50% of the data when
it is sorted into ascending order.
plot below, we can see that the data
values range from 0.12 to 0.63. The figure 2 Frequency histogram and Box and Whisker plot.
column on the left contains the
depth information (i.e., how many
leaves lie on the lines closest to the
end of the range). Thus, there are 13
(a) (b)
points which lie between 0.40 and
0.63. The line containing the middle Magnitude Magnitude
value is indicated differently with a 10 10
count (the number of items in the 8 8
line) and is enclosed in parentheses. 6 6
4 4
2 2
Stem-and-leaf plot 0 Time 0 Time
n = 7, mean = 6, standard deviation = 2.16 n = 9, mean = 6, standard deviation = 2.65
Units = 0.1 1|2 = 0.12 Count =
(c) (d)
42
Magnitude Magnitude
5 1|22677 10 10
14 2|112224578 8 8
(15) 3|000011122333355 6 6
13 4|0047889 4 4
6 5|56669 2 2
0 Time 0 Time
1 6|3 n = 9, mean = 6, standard deviation = 2.06 n = 9, mean = 6, standard deviation = 1.80

figure 3 Time-indexed plots.


LC•GC Europe Online Supplement statistics and data analysis 5

standard deviation (denoted by the


99.7% symbols n and n-1, or  and s). The
95% correct one to use depends upon how the
68%
problem is framed. For example, each
Mean
batch of a chemical contains 10 sub-units.
You are asked to analyse each sub-unit, in
a single batch, for mercury contamination
and report the mean mercury content and
standard deviation. Now, if the mean and
-3 -2 -1 0 1 2 3 standard deviation are to be used solely
with this analysed batch, then the 10
Standard deviations from the mean
results represent the whole population (i.e.,
all are tested) and the correct standard
figure 4 The relationship between the deviation to use is the one for a population
normal distribution curve, the mean and (n). If, however, the intended use of the
standard deviation. results is to estimate the mercury

(a)
(i)
probably not different and would 'pass' the t-test
(ii) (tcrit > tcalculated value)

(b)
(i)
probably different and would 'fail' the t-test contamination for several batches of the
(tcrit < tcalculated value) chemical, the 10 results then represent a
(ii)
sample from the whole population and the
correct standard deviation to use is that for
(c) a sample (n-1). If you are using a statistical
(i) package you should always check that the
could be different but not enough data to say for correct standard deviation is being
(ii) sure (i.e., would 'pass' the t-test [tcrit > tcalculated value])
calculated for your particular problem.

µ1
(d)   n  ((xi  µ)2 / n)
(i)
practically identical means, but with so many data
points there is a small but statistically siginificant s  n–1 ((xi  x)2 / n  1)
(ii) ('real') difference and so would 'fail' the t-test
(tcrit < tcalculated value)
µ2

Interpreting the mean and standard


deviation
(e)
If the distribution is normal (i.e., when the
(i)
spread in the data as measured by the variance data are plotted it approximates to the curve
are similar would 'pass' the F-test (Fcrit > Fcalculated value) shown in Figure 4) then the mean is located
(ii) at the centre of the distribution. Sixty-eight
per 0cent of the results will be contained
within ±1 standard deviation from the mean,
(f) 95% within ±2 standard deviations and
(i) 99.7% within ±3 standard deviations.
spread in the data as measured by the variance are
different would 'fail' the F-test (Fcrit < Fcalculated value) Using the above facts it is possible to
(ii) and hence (i) gives more consistent results than (ii) estimate a standard deviation from a
stated confidence interval and vice versa a
confidence interval from a standard
(g) deviation. For example, if a mean value of
(i) 0.72 ±0.02 g/L at the 95% confidence
could be a different spread but not enough data level is quoted then it follows that the
to say for sure would 'pass' the F-test
(Fcrit > Fcalculated value) standard deviation = 0.02/2 or 0.01 g/L. If
(ii)
the same figure was quoted at the 99.7%
confidence level the standard deviation
figure 5 Comparison of different data sets. would be 0.02/3 or 0.0066 g/L.
6 statistics and data analysis LC•GC Europe Online Supplement

Significance Testing and its spread. For example, consider the nearly always lead to a significant
Suppose, for example, we have the blob plots shown in Figure 5. For the two difference but a statistically significant
following two sets of results for lead data sets shown in Figure 5(a), the means result is not necessarily an important result.
content in water 17.3, 17.3, 17.4, 17.4 for set (i) and set (ii) are numerically For example in Figure 5(d) there is a
and 18.5, 18.6, 18.5, 18.6. It is fairly clear, different. From the limited amount of statistically significant difference, but does
by simply looking at the data, that the two information available, however, they are it really matter in practice?
sets are different. In reaching this from a statistical point of view the same.
conclusion you have probably considered For Figure 5(b), the means for set (i) and What is a t-test?
the amount of data, the average for each set (ii) are probably different but when A t-test is a statistical procedure that can
set and the spread in the results. The fewer data points are available, Figure 5(c), be used to compare mean values. A lot of
difference between two sets of data is, we cannot be sure with any degree of jargon surrounds these tests (see Table 1
however, not so clear in many situations. confidence that the means are different for definition of the terms used below) but
The application of significance tests gives even if they are a long way apart. With a they are relatively simple to apply using the
us a more systematic way of assessing the large number of data points, even a very built-in functions of a spreadsheet like
results with the added advantage of small difference, can be significant (Figure Excel or a statistical software package.
allowing us to express our conclusion with 5(d)). Similarly, when we are interested in Using a calculator is also an option but you
a stated degree of confidence. comparing the spread of results, for have to know the correct formula to apply
example, when we want to know if (see Table 2) and have access to statistical
What does significance mean? method (i) gives more consistent results tables to look up the so-called critical
In statistics the words ‘significant’ and than method (ii), we have to take note of values (4).
‘significance’ have specific meanings. A the amount of information available Three worked examples are shown in
significant difference, means a difference (Figures 5(e)–(g)). Box 2 (5) to illustrate how the different
that is unlikely to have occurred by chance. It is fortunate that tables are published t-tests are carried out and how to interpret
A significance test, shows up differences that show how large a difference needs to the results.
unlikely to occur because of a purely be before it can be considered not to have
random variation. occurred by chance. These are, critical What is an F-test?
As previously mentioned, to decide if one t-value for differences between means, An F-test compares the spread of results in
set of results is significantly different from and critical F-values for differences two data sets to determine if they could
another depends not only on the between the spread of results (4). reasonably be considered to come from the
magnitude of the difference in the means Note: Significance is a function of sample same parent distribution. The test can,
but also on the amount of data available size. Comparing very large samples will therefore, be used to answer questions
such as are two methods equally precise?
The measure of spread used in the F-test is
variance which is simply the square of the
Jargon Definition standard deviation. The variances are
ratioed (i.e., divide the variance of one set
Alternate Hypothesis A statement describing the alternative to the null hypothesis
(H1) (i.e., there is a difference between the means [see two-tailed] of data by the variance, of the other) to
or mean1 is ≥ mean2 [see one-tailed]). get the test value F = 2
S1 2
S2
Critical Value The value obtained from statistical tables or statistical packages at a
(tcrit or Fcrit) given confidence level against which the result of applying a signifi- This F value is then compared with a critical
cance test is compared. value that tells us how big the ratio needs
Null hypothesis A statement describing what is being tested to be to rule out the difference in spread
(H0) (i.e., there is no difference between the two means [mean1 = mean2]). occurring by chance. The Fcrit value is
found from tables using (n1–1) and (n2–1)
One-tailed A one-tailed test is performed if the analyst is only interested in the
answer when the result is different in one direction, for example, (1) degrees of freedom, at the appropriate
the level of confidence.
new production method results in a higher yield, or (2) the amount of [Note: it is usual to arrange s1 and s2 so
waste product is reduced (i.e., a limit value ≤, >, <, or ≥ is used in the that F > 1]. If the standard deviations are to
alternate hypothesis). In these cases the calculation to determine the be considered to come from the same
t-value is the same as that for the two-tailed t-test but the critical population then Fcrit > F. As an example we
value is different.
use the data in Example 2 (see Box 2).
Population A large group of items or measurements under investigation
(e.g., 2500 lots from a single batch of a certified reference material). 2
F  2.75 1.471 2  3.49
Sample A group of items or measurements taken from the population
(e.g., 25 lots of a certified reference material taken from a batch
containing 2500 lots). Fcrit = 9.605 (5–1) and (5–1) degrees of
Two-tailed A two-tailed t-test is performed if the analyst is interested in any
freedom at the 97.5% confidence level.
change. For example, is method A different from method B As Fcrit> Fcalculated we can conclude that the
(i.e., ≠ is used in the alternate hypothesis. Under most circumstances spread of results in the two data sets are
two-tailed t-tests should be performed). not significantly different and it is,
therefore, reasonable to combine the two
table 1 Definitions of statistical terms used in significance testing. standard deviations as we have done.
LC•GC Europe Online Supplement statistics and data analysis 7

(3) BS 2846 part 4 (ISO 2854): Techniques of


Using statistical software Shaun Burke currently works in the Food
Estimation Relating to Means and Variances
(what is a p-value?) (1976). Technology Department of RHM Technology
When you use statistical software packages (4) D.V. Lindley and W.F. Scott, New Cambridge Ltd, High Wycombe, Buckinghamshire, UK.
and some spreadsheet functions, the Elementary Statistical Tables (ISBN: 0 521 However, these articles were produced while
48485 5). Cambridge University Press (1995).
results of performing a significance test are (5) T.J. Farrant, Practical Statistics for the Analytical he was working at LGC, Teddington,
often summarized as a p-value. The Scientist: A Bench Guide (ISBN: 085 404 4426), Middlesex, UK (http://www.lgc.co.uk).
p-value represents an inverse index of the Royal Society of Chemistry (1997).
(6) M. Sargent, VAM Bulletin, Issue 13, 4–5,
reliability of the statistic (i.e., the (Laboratory of the Government Chemist,
probability of error in accepting the Teddington, UK) Autumn 1995.
observed result as valid). Thus, if we are
comparing two means to see if they are Bibliography
different a p-value of 0.10 is equivalent to 1. G.B. Wetherill, Elementary Statistical
saying we are 90% certain that the means Methods, Chapman and Hall, London,
are different; 0.05 is equivalent to saying UK.
we are 95% certain that the means are 2. J.C. Miller and J.N. Miller, Statistics for
different; and 0.01 we are 99% certain Analytical Chemistry, Ellis Horwood PTR
that the means are different, i.e., [(1–p) x Prentice Hall, London, UK.
100%]. It is usual when analysing chemical 3. J. Tukey, Exploration of Data Analysis,
data (but somewhat arbitrary) to say that Edison and Westley.
p-levels ≤ 0.05 are statistically significant. 4. T.J. Farrant, Practical Statistics for the
Analytical Scientist: A Bench Guide
Some assumptions (ISBN: 085 404 4426), Royal Society of
behind significance testing Chemistry, London, UK (1997).
In most statistical tests it is
assumed that the sample correctly
represents the population and that the
population follows a normal distribution. t-test to use when comparing Equation
Although these assumptions are never The long-term average (population mean, µ) with a sample mean
complied with precisely, in a large number x µ
t
s/ n
of situations where laboratory data is being
used they are not grossly violated.
The difference between two means (e.g., two analytical methods) For a two-tailed test

Conclusions d  n
t sd
• Always plot your data and understand
the patterns in it before calculating any For a one-tailed test
statistic, even the arithmetic mean. the sign is important
• Make sure the correct standard deviation
d n
is calculated for your particular t sd
circumstance. This will nearly always be
the sample standard deviation (n-1). x1  x2
Difference between independent sample means with equal variances t
• Significance tests are used to compare, 1 1
in an unbiased way, the means or spread sc n1  n2
(variance) of two data sets.
• The tests are easily performed using Difference between independent sample means with unequal variances† x1  x2
t
statistical routines in spreadsheets and s21 s22
statistical packages. n1  n2
• The p-value is a measure of confidence
in the result obtained when applying a where:
significance test. –
x is the sample mean, µ is the population mean, s is the standard deviation for the sample, n is the number items in the sample,
– –
|d | is the absolute mean difference between pairs, d is the mean difference between pairs, sd is the sample standard deviation for the
Acknowledgement – –
pairs, x1 and x2 are two independent sample means, n1 and n2 are the number of items making up each sample
The preparation of this paper was
2 2
supported under a contract with the UK s1 n1  1  s2 n2  1
and s is the combined standard deviation found using sc 
Department of Trade and Industry as part c
n1  n2 2
of the National Measurement System Valid where s1 and s2 are the sample standard deviations.
Analytical Measurement Programme
(VAM)6. †Note: The degrees of freedom (υ) used for looking up the critical t value for independent sample means with unequal variances

References 1 s41 s42 s21 s22


υ  k 2 n2 n  1  k 2 n2 n – 1 where k  n1  n2
is given by
(1) ISO 3534 part 1: Statistics Vocabulary and
1 1 2 2
Symbols. Part 1: Probability and General
Statistical Terms (1993).
(2) BS 2846 part 7: Tests for Departure from
Normality (1984). table 2 Summary of statistical formulae.
8 statistics and data analysis LC•GC Europe Online Supplement

Box 2

x s
Example 1 Method 1 4.2 4.5 6.8 7.2 4.3 5.40 1.471
Method 2 9.2 4.0 1.9 5.2 3.5 4.76 2.750
A chemist is asked to validate a new
economic method of derivatization table 3 Results from two methods used to determine concentrations of selenium.
before analysing a solution by a standard
gas chromatography method. The long-
term mean for the check samples using tcrit = 2.26 at the 95% confidence
the old method is 22.7 µg/L. For the new level for 9 degrees of freedom. t 0.64 0.64 0.459
2.205  0.632 1.395
method the mean is 23.5 µg/L, based on As tcalculated > tcrit we can reject the null
10 results with a standard deviation of hypothesis and conclude that we are 95% The 95% critical value is 2.306 for
0.9 µg/L. Is the new method equivalent certain that there is a significant difference n = 8 (n1 + n2 –2 ) degrees of freedom.
to the old? To answer this question we between the new and old methods. This exceeds the calculated value of
use the t-test to compare the two mean [Note: This does not mean the new 0.459, thus the null hypothesis (H0)
values. We start by stating exactly what derivatization method should be cannot be rejected and we conclude
we are trying to decide, in the form of abandoned. A judgement needs to there is no significant difference between
two alternative hypotheses; (i) the means be made on the economics and on the means or the results given by the
could really be the same, or (ii) the whether the results are ‘fit for purpose’. two methods.
means could really be different. In The significance test is only one piece
statistical terminology this is written as: of information to be considered.] Example 3 (5)
• The null hypothesis (H0): new method Two methods are available for
mean = long-term check sample mean. Example 2 (5) determining the concentration of
• The alternative hypothesis (H1): new Two methods for determining the vitamins in foodstuffs. To compare
method mean ≠ long-term check sample concentration of Selenium are to be the methods several different sample
mean. compared. The results from each matrices are prepared using the same
To test the null hypothesis we calculate method are shown in Table 3: technique. Each sample preparation is
the t-value as below. Note, the calculated Using the t-test for independent then divided into two aliquots and
t-value is the ratio of the difference sample means we define the null readings are obtained using the two
– –
between the means and a measure of hypothesis H0 as x 1 = x 2 methods, ideally commencing at the
the spread (standard deviation) and the This means there is no difference between same time to lessen the possible effects
amount of data available (n). the means of the two methods (the of sample deterioration. The results are
alternative hypothesis is H1: x–1 ≠ x–2). If shown in Table 4.
23.5  22.7 –
t  2.81 the two methods have sample standard The null hypothesis is H0: d = 0
0.9 / 10 –
deviations that are not significantly against the alternative H1: d ≠ 0
In the final step of the significance test different then we can combine (or pool) The test is a two-tailed test as we are
– –
we compare the calculated t-value with the standard deviation (Sc). interested in both d<0 and d>0

the critical t-value obtained from tables (see What is an F-Test?) The mean d = 0.475 and the sample
(4). To look up the critical value we need standard deviation of the paired
to know three pieces of information: 1.4712  (5  1)  2.7502  (5  1) differences is sd = 0.700
Sc 
(i) Are we interested in the direction (5  5  2)
of the difference between the two 0.475  8
 2.205 t  1.918
0.700
means or only that there is a difference,
for example, are we performing a one- If the standard deviations are The tabulated value of tcrit (with
sided or two-sided t-test (see Table 1)? significantly different then the t-test n = 7 degrees of freedom, at the 95%
In the case above, it is the latter, there- for un-equal variances should be used confidence limit) is 2.365. Since the
fore, the two-sided critical value is used. (Table 2). calculated value is less than the critical
(ii) The degrees of freedom: this is Evaluating the test statistic t value, H0 cannot be rejected and it
simply the number of data points follows that there is no difference between
minus one (n–1). (5.40  4.76) the two techniques.
t =>
(iii) How certain do we want to be 2.205 1 1
5 5
about our conclusions? It is normal
practice in chemistry to select the 95%
confidence level (i.e., about 1 in 20 Matrix
times we perform the t-test we could
arrive at an erroneous conclusion). Method 1 2 3 4 5 6 7 8
However, in some situations this is an A (mg/g) 2.52 3.13 4.33 2.25 2.79 3.04 2.19 2.16
unacceptable level of error, such as in B (mg/g) 3.17 5.00 4.03 2.38 3.68 2.94 2.83 2.18
medical research. In these cases, the Difference (d) -0.65 -1.87 0.30 -0.13 -0.89 0.10 -0.64 -0.02
99% or even the 99.9% confidence
level can be chosen. table 4 Comparison of two methods used to determine the concentration of vitamins in foodstuffs.
LC•GC Europe Online Supplement statistics and data analysis 9

Analysis
of Variance
Shaun Burke, RHM Technology Ltd, High Wycombe, Buckinghamshire, UK.

Statistical methods can be powerful tools for unlocking the information


contained in analytical data. This second part in our statistics refresher series
looks at one of the most frequently used of these tools: Analysis of Variance
(ANOVA). In the previous paper we examined the initial steps in describing
the structure of the data and explained a number of alternative significance
tests (1). In particular, we showed that t-tests can be used to compare the
results from two analytical methods or chemical processes. In this article,
we will expand on the theme of significance testing by showing how ANOVA
can be used to compare the results from more than two sets of data at
the same time, and how it is particularly useful in analysing data from
designed experiments.

With the advent of built-in spreadsheet central tenet of ANOVA is that the total SS in the form of the data contained in Figure 1,
functions and affordable dedicated an experiment can be divided into the which shows the results from 12 different
statistical software packages, Analysis of components caused by random error, given analysts analysing the same material. Using
Variance (ANOVA) has become relatively by the within-group (or sample) SS, and the these data and a spreadsheet, the results
simple to carry out. This article will components resulting from differences obtained from carrying out one-way
therefore concentrate on how to select the between means. It is these latter components ANOVA are reported in Example 1. In this
correct variant of the ANOVA method, the that are used to test for statistical example, the ANOVA shows there are
advantages of ANOVA, how to interpret significance using a simple F-test (1). significant differences between analysts
the results and how to avoid some of the (Fvalue > Fcrit at the 95% confidence level).
pitfalls. For those wanting more detailed Why not use multiple t-tests This result is obvious from a plot of the
theory than is given in the following instead of ANOVA? data (Figure 1) but in many situations a
section, several texts are available (2–5). Why should we use ANOVA in preference visual inspection of a plot will not give such
to carrying out a series of t-tests? I think a clear-cut result. Notice that the output
A bit of ANOVA theory this is best explained by using an example; also includes a ‘p-value’ (see Interpretation
Whenever we make repeated suppose we want to compare the results of the result(s) section, which follows).
measurements there is always some from 12 analysts taking part in a training
variation. Sometimes this variation (known exercise. If we were to use t-tests, we Note: ANOVA cannot tell us which
as within-group variation) makes it difficult would need to calculate 66 t-values. Not individual mean or means are different
for analysts to see if there have been only is this a lot of work but the chance of from the consensus value and in what
significant changes between different groups reaching a wrong conclusion increases. The direction they deviate. The most effective
of replicates. For example, in Figure 1 correct way to analyse this sort of data is way to show this is to plot the data (Figure
(which shows the results from four replicate to use one-way ANOVA. 1) or alternatively, but less effectively, carry
analyses by 12 analysts), we can see that out a multiple comparison test such as
the total variation is a combination of the One-way ANOVA Scheffe's test (2). It is also important to
spread of results within groups and the One-way ANOVA will answer the question: make sure the right questions are being
spread between the mean values (between- Is there a significant difference between asked and that the right data are being
group variation). The statistic that measures the mean values (or levels), given that the captured. In Example 1, it is possible that
the within and between-group variations in means are calculated from a number of the time difference between the analysts
ANOVA is called the sum of squares and replicate observations? ‘Significant’ refers carrying out the determinations is the
often appears in the output tables to the observed spread of means that reason for the difference in the mean
abbreviated as SS. It can be shown that the would not normally arise from the chance values. This example shows how good
different sums of squares calculated in variation within groups. We have already experimental design procedures could have
ANOVA are equivalent to variances (1). The seen an example of this type of problem in prevented ambiguity in the conclusions.
10 statistics and data analysis LC•GC Europe Online Supplement

Example 1 An example of one-way ANOVA carried out by Excel Two-way ANOVA


In a typical experiment things can be more
A_1 A_2 A_3 A_4 A_5 A_6 complex than described previously. For
Replicate 1 34.1 35.84 36.67 40.54 41.19 41.22 example, in Example 2 the aim is to find
Replicate 2 34.1 36.58 37.33 40.67 40.29 39.61
Replicate 3 34.69 31.3 36.96 40.81 40.99 37.89
out if time and/or temperature have any
Replicate 4 34.6 34.19 36.83 40.78 40.4 36.67 effect on protein yield when analysing
samples of tinned ham. When analysing
A_7 A_8 A_9 A_10 A_11 A_12
Replicate 1 40.71 39.2 42.5 39.75 36.04 44.36
data from this type of experiment we use
Replicate 2 40.91 39.3 42.3 39.69 37.03 45.73 two-way ANOVA. Two-way ANOVA can
Replicate 3 40.8 39.3 42.5 39.23 36.85 45.25 test the significance of each of two
Replicate 4 38.42 39.3 42.5 39.73 36.24 45.34
experimental variables (factors or
Anova: Single Factor treatments) with respect to the response,
Source of Variation SS df MS F P-value F crit such as an instrument's output. When
Between Groups 438.7988 11 39.8908 40.31545 6.6E-17 2.066606
Within Groups 35.6208 36 0.989467
replicate measurements are made we can
also examine whether or not there are
(Note: the data table has been split into two sections (A_1 to A_6, A_7 to A_12) for display purposes. The ANOVA is
significant interactions between variables.
carried out on a single table.) An interaction is said to be present when
the response being measured changes
SS = sum of squares, df = degrees of freedom, MS = mean square (SS/df).
more than can be explained from the
The P-value is < 0.05 (Fvalue is > Fcrit - 95% confidence level for 11 and 36 degrees of freedom)
therefore it can be concluded that there is a significant difference between the analysts' results.
change in level of an individual factor. This
is illustrated in Figure 2 for a process with
two factors (Y and Z) when both factors
are studied at two levels (low and high). In
Figure 2(b), the changes in response
Example 2 Two-way ANOVA
caused by Y depend on Z, and vice versa.
The analysis of tinned ham was carried out at three temperatures (415, 435 and 460
ºC) and three times (30, 60 and 90 minutes). Three analyses, determining protein In two-way ANOVA we ask the
yield were made at each temperature and time. The measurements are summarized following questions:
in the diagram below and the results of the two-way ANOVA are given in the table. • Is there a significant interaction between
the two factors (variables)?
Temp (ºC) • Does a change in any of the factors
415 435 460 affect the measured result?
Time (min) It is important to check the answers in the
right order: Figure 3 illustrates the
decision process. In the case of Example
27

27.1
27.2

27

27.1
27.2

27

27.1
27.2
26.9

27.3

26.9

27.3

26.9

27.3

2 the questions are:


30
• Is there an interaction between
temperature and time which affects the
27

27.1
27.2

27

27.1
27.2

27

27.1
27.2
26.9

27.3

26.9

27.3

26.9

27.3

protein yield?
60
• Does time and/or temperature affect the
protein yield?
27

27.1
27.2

27

27.1
27.2

27

27.1
27.2
26.9

27.3

26.9

27.3

26.9

27.3

Using the built-in functions of a


90 spreadsheet (in this case Excel’s data
Time (min)/Temp (°C)
analysis tools — two-factor analysis with
415 435 460
30 27.13 27.2 27.03 replication) we see that there is a
30 27.2 26.97 27.1 significant interaction between time and
30 27.13 27.13 27.13
temperature and a significant effect of
60 27.29 27.07 27.1
60 27.13 27.1 27.07 temperature alone (both p-value < 0.05
60 27.23 27.03 27.03 and F > Fcrit). Following the process
90 27.03 27.2 27.03 outlined in Figure 3, we consider the
90 27.13 27.23 27.07
90 27.07 27.27 26.9 interaction question first by comparing the
mean squares (MS) for the within-group
Anova: Two-factor with replication
variation with the interaction MS. This is
Source of Variation SS df MS F P-value F crit
Sample (=Time) 0.000867 2 0.000433 0.100429 0.904952 3.554561 reported in the results table of Example 2.
Columns (=Temperature) 0.049689 2 0.024844 5.75794 0.011667 3.554561
Interaction 0.087644 4 0.021911 5.078112 0.006437 2.927749
F = 0.021911/0.004315 = 5.078
Within 0.077667 18 0.004315

Total 0.215867 26 If the interaction is significant (F > Fcrit),


as in this case, then the individual factors
Note: in the above example, the spreadsheet (Excel) labels Source of Variation as Sample, Columns, Interaction and Within. (time and temperature) should each be
compared with the MS for the interaction
Sample = Time, Columns = Temperature, Interaction is the interaction between temperature and time, and Within is a
(not the within-group MS) thus:
measure of the within-group variation. (Note: Source of variation – Columns = Temperature and Sample = Time).

Ftemp = 0.024844/0.021911 = 1.134


LC•GC Europe Online Supplement statistics and data analysis 11

Ftime = 0.000433/0.021911 = 0.020 Interpretation of the result(s)


Fcrit = 6.944, for 2 and 4 degrees of freedom (at the 95% confidence level) To reiterate the interpretation of ANOVA
results, a calculated F-value that is greater
In other words, there is no significant difference between the interaction of time and than Fcrit for a stated level of confidence
temperature with respect to either of the individual factors, and, therefore, the interaction (typically 95%) means that the difference
of temperature with time is worth further investigation. If one or both of the individual being tested is statistically significant at
factors were significant compared with the interaction, then the individual factor or factors that level. As an alternative to using the F-
would dominate and for all practical purposes any interaction could be ignored. values the p-value can be used to indicate
If the interaction term is not significant then it can be considered to be another small the degree of confidence we have that
error term and can thus be pooled with the within-group (error) sums of squares term. It is there is a significant difference between
the pooled value (SS2pooled) that is then used as the denominator in the F-test to means (i.e., (1-p) * 100 is the percentage
determine if the individual factors affect the measured results significantly. To combine the confidence). Normally a p-value of ≤ 0.05
sums of squares the following formula is used: is considered to denote a significant
difference.
ss inter  ss within Note: Extrapolation of ANOVA results is
ss2pooled 
dof inter  dof within not advisable, so in Example 2 for instance,
it is impossible to say if a time of 15 or 120
where dofinter and dofwithin are the degrees of freedom for the interaction term and minutes would lead to a measurable effect
error term, and SSinter and SSwithin are the sums of squares for the interaction term and on protein yield. It is, therefore, always
error term, respectively. more economic in the long run to design
the experiment in advance, in order to
(dofpooled  dofinter  dofwithin) cover the likely ranges of the parameter(s)
of interest.
Selecting the ANOVA method
One-way ANOVA should be used when there is only one factor being considered and Avoiding some of
replicate data from changing the level of that factor are available. Two-way ANOVA (with the pitfalls using ANOVA
or without replication) is used when there are two factors being considered. If no replicate In ANOVA it is assumed that the data for
data are collected then the interactions between the two factors cannot be calculated. each variable are normally distributed.
Higher level ANOVAs are also available for looking at more than two factors. Usually in ANOVA we don’t have a large
amount of data so it is difficult to prove
Advantages of ANOVA any departure from normality. It has been
Compared with using multiple t-tests, one-way and two-way ANOVA require fewer shown, however, that even quite large
measurements to discover significant effects (i.e., the tests are said to have more power). deviations do not affect the decisions
This is one reason why ANOVA is used frequently when analysing data from statistically made on the basis of the F-test.
designed experiments. A more important assumption about
Other ANOVA and multivariate ANOVA (MANOVA) methods exist for more complex ANOVA is that the variance (spread)
experimental situations but a description of these is beyond the scope of this introductory between groups is homogeneous
article. More details can be found in reference 6. (homoscedastic). If this is not the case (this
often happens in chemistry, see Figure 1)
then the F-test can suggest a statistically

48

46

44
Analyte concentration (ppm)

42

40 total
standard
deviation
38

36

34

32 Mean

30
A1 A2 A3 A4 A5 A6 A7 A8 A9 A10 A11 A12
Analyst ID

figure 1 Plot comparing the results from 12 analysts.


12 statistics and data analysis LC•GC Europe Online Supplement

significant difference when none is a number of tests for heteroscedasity (i.e., problem in the data structure by
present. The best way to avoid this pitfall Bartlett's test (5) and Levene's test (2)). It transforming it, such as by taking logs (7).
is, as ever, to plot the data. There also exist may be possible to overcome this type of If the variability within a group is
correlated with its mean value then
ANOVA may not be appropriate and/or it
may indicate the presence of outliers in the
ZHigh data (Figure 4). Cochran's test (5) can be
ZHigh used to test for variance outliers.

ZLow
Conclusions
• ANOVA is a powerful tool for
Response

Response
ZLow
determining if there is a statistically
significant difference between two or
more sets of data.
• One-way ANOVA should be used
when we are comparing several sets
of observations.
YLow YHigh YLow YHigh • Two-way ANOVA is the method
used when there are two separate
(a) Y and Z are independent (b) Y and Z are interacting factors that may be influencing a result.
• Except for the smallest of data sets
ANOVA is best carried out using a
figure 2 Interactive factors. spreadsheet or statistical software
package.
• You should always plot your data to
make sure the assumptions ANOVA is
Yes Compare interaction mean
based on are not violated.
Compare within-group mean Significant
Start squares with interaction mean difference? squares with individual factor
squares (F > F crit) mean squares Acknowledgements
The preparation of this paper was
No supported under a contract with the UK
Pool the within-group and
Department of Trade and Industry as part
interaction sums of squares of the National Measurement System Valid
Analytical Measurement Programme (VAM)
(8).
Compare pooled mean
squares with individual factor References
mean squares (1) S. Burke, Scientific Data Management 1(1),
32–38, September 1997.
(2) G.A. Millikem and D.E. Johnson, Analysis of
Messy Data, Volume 1: Designed Experiments,
Van Nostrand Reinhold Company, New York,
figure 3 Comparing mean squares in two-way ANOVA with replication.
USA (1984).
(3) J.C. Miller and J.N. Miller, Statistics for
Analytical Chemistry, Ellis Horwood PTR
Prentice Hall, London, UK (ISBN 0 13 0309907).
(4) C. Chatfield, Statistics for Technology,
Chapman & Hall, London, UK (ISBN 0412
25340 2).
(5) T.J. Farrant, Practical Statistics for the Analytical
Unreliable high mean (may contain outliers)
Scientist, A Bench Guide, Royal Society of
Chemistry, London, UK (ISBN 0 85404 442 6)
(1997).
(6) K.V. Mardia, J.T. Kent and J.M. Bibby,
Multivariate Analysis, Academic Press Inc. (ISBN
Significantly different means by ANOVA 0 12 471252 5) (1979).
Variance

(7) ISO 4259: 1992. Petroleum Products -


Determination and Application of Precision
Data in Relation to Methods of Test. Annex E,
International Organisation for Standardisation,
Geneva, Switzerland (1992).
(8) M. Sargent, VAM Bulletin, Issue 13, 4–5,
Laboratory of the Government Chemist
(Autumn 1995).

Shaun Burke currently works in the Food


Technology Department of RHM Technology
Ltd, High Wycombe, Buckinghamshire, UK.
Mean value
However, these articles were produced while
he was working at LGC, Teddington,
figure 4 A plot of variance versus the mean. Middlesex, UK (http://www.lgc.co.uk).
LC•GC Europe Online Supplement statistics and data analysis 19

Missing Values, Outliers,


Robust Statistics &
Non-parametric Methods
Shaun Burke, RHM Technology Ltd, High Wycombe, Buckinghamshire, UK.

This article, the fourth and final part of our statistics refresher series, looks
at how to deal with ‘messy’ data that contain transcription errors or extreme
and skewed results.

This is the last article in a series of short readings are taken at set times or the cost Pairwise deletion can be used as an
papers introducing basic statistical methods of retesting is prohibitive, so alternative alternative to casewise deletion in
of use in analytical science. In the three ways of addressing this problem are needed. situations where parameters (correlation
previous papers (1–3) we have assumed Current statistical software packages coefficients, for example) are calculated on
the data has been ‘tidy’; that is, normally typically deal with missing data by one of successive pairs of variables (e.g., in a
distributed with no anomalous and/or three methods: recovery experiment we may be interested
missing results. In the real world, however, Casewise deletion excludes all examples in the correlations between material
we often need to deal with ‘messy’ data, (cases) that have missing data in at least recovered and extraction time, temperature,
for example data sets that contain one of the selected variables. For example, particle size, polarity, etc. With pairwise
transcription errors, unexpected extreme in ICP–AAS (inductively coupled deletion, if one solvent polarity measurement
results or are skewed. How we deal with plasma–atomic absorption spectroscopy) was missing only this single pair would be
this type of data is the subject of this article. calibrated with a number of standard deleted from the correlation and the
solutions containing several metal ions at correlations for recovery versus extraction
Transcription errors different concentrations, if the aluminium time and particle size would be unaffected)
Transcription errors can normally be value were missing for a particular test (see Table 2).
corrected by implementing good quality portion, all the results for that test portion Pairwise deletion can, however, lead to
control procedures before statistical would be disregarded (See Table 1). serious problems. For example, if there is a
analysis is carried out. For example, the This is the usual way of dealing with ‘hidden’ systematic distribution of missing
data can be independently checked or, missing data, but it does not guarantee points then a bias may result when
more rarely, the data can be entered, again correct answers. This is particularly so, in calculating a correlation matrix (i.e., different
independently, into two separate files and complex (multivariate) data sets where it is correlation coefficients in the matrix can be
the files compared electronically to possible to end up deleting the majority based on different subsets of cases).
highlight any discrepancies. There are also of your data if the missing data are Mean substitution replaces all missing
a number of outlier tests that can be used randomly distributed across cases data in a variable by the mean value for
to highlight anomalous values before other and variables. that variable. Though this looks as if the
statistics are calculated. These tests do not
remove the need for good quality
assurance; rather they should be seen as
an additional quality check. Al B Fe Ni
Solution 1 94.5 578 23.1
Solution 2 567 72.1 673 7.6
Missing data
Solution 3 34.0 674 44.7
No matter how well our experiments are
Solution 4 234 97.4 429 82.9
planned there will always be times when
something goes wrong, resulting in gaps in Casewise deletion. Statistical analysis
only carried out on the reduced data set.
the data. Some statistical procedures will
not work as well, or at all, with some data Al B Fe Ni
missing. The best recourse is always to Solution 2 567 72.1 673 7.6
Solution 4 234 97.4 429 82.9
repeat the experiment to generate the
complete data set. Sometimes, however,
this is not feasible, particularly where table 1 Casewise deletion.
20 statistics and data analysis LC•GC Europe Online Supplement

data set is now complete, mean substitution way (7,8). Good grounds for believing the Shapiro–Wilk’s test, skewness test,
has its own disadvantages. The variability data is normal are kurtosis test (7,9) etc.
in the data set is artificially decreased in • past experience of similar data • plots of the data, e.g., frequency
direct proportion to the number of missing • passing normality tests, for example, histogram normal probability plots (1,7).
data points, leading to underestimates of Kolmogrov–Smirnov–Lillefors test, Note that the tests used to check
dispersion (the spread of the data). Mean
substitution may also considerably change
the values of some other statistics, such as
Recovery Extraction Particle Solvent
linear regression statistics (3), particularly % time Size Polarity
where correlations are strong (See Table 3). (mins) (µm) (pKa)
Examples of these three approaches are Sample 1 93 20 90
illustrated in Figure 1, for the calculation of Sample 2 105 120 150 1.8
a correlation matrix, where the correlation Sample 3 99 180 50 1.0
coefficient (r) (3) is determined for each Sample 4 73 10 500 1.5
paired combination of the five variables, Pairwise deletion. Statistical analysis unaffected except
A to E. Note, how the r value can increase, for when one of a pair of data points are missing.
diminish or even reverse sign depending on Recovery Recovery Recovery
which method is chosen to handle the vs vs vs
Extraction Particle Solvent
missing data (i.e., the A, B correlation time Size Polarity
coefficients). r 0.728886 -0.87495 0.033942
(number of data points (4) (4) (3)
in the correlation)
Extreme values,
stragglers and outliers
Extreme values are defined as observations table 2 Pairwise deletion.
in a sample, so far separated in value from
the remainder as to suggest that they may
be from a different population, or the
result of an error in measurement (6). Al B Fe Ni
Extreme values can also be subdivided into Solution 1 94.5 578 23.1
stragglers, extreme values detected Solution 2 567 72.1 673 7.6
between the 95% and 99% confidence Solution 3 34.0 674 44.7
Solution 4 234 97.4 429 82.9
levels; and outliers, extreme values at >
99% confidence level. Mean substitution. Statistical analysis carried
It is tempting to remove extreme values out on pseudo completed data with no
allowance made for errors in estimated values.
automatically from a data set, because
they can alter the calculated statistics, e.g.,
Al B Fe Ni
increase the estimate of variance (a
Solution 1 400.5 94.5 578 23.1
measure of spread), or possibly introduce a
Solution 2 567 72.1 673 7.6
bias in the calculated mean. There is one Solution 3 400.5 34.0 674 44.7
golden rule however: no value should be Solution 4 234 97.4 429 82.9
removed from a data set on statistical
grounds alone. ‘Statistical grounds’ include
outlier testing. table 3 Mean substitution.
Outlier tests tell you, on the basis of
some simple assumptions, where you are
most likely to have a technical error; they
do not tell you that the point is ‘wrong’. Box 1: Imputation (4,5) is yet another method that is increasingly being used to
No matter how extreme a value is in a set handle missing data. It is, however, not yet widely available in statistical software
of data, the suspect value could packages. In its simplest ad hoc form an imputed value is substituted for the
nonetheless be a correct piece of missing value (e.g., mean substitution already discussed above is a form of
information (1). Only with experience or imputation). In its more general/systematic form, however, the imputed missing
the identification of a particular cause can values are predicted from patterns in the real (non-missing) data. A total of m
data be declared ‘wrong’ and removed. possible imputed values are calculated for each missing value (using a suitable
So, given that we understand that the statistical model derived from the patterns in the data) and then m possible
tests only tell us where to look, how do we complete data sets are analysed in turn by the selected statistical method. The m
test for outliers? If we have good grounds intermediate results are then pooled to yield the final result (statistic) and an
for believing our data is normally estimate of its uncertainty. This method works well providing that the missing
distributed then a number of ‘outlier tests’ data is randomly distributed and the model used to predict the inputed values
(sometimes called Q-tests) are available is sensible.
that identify extreme values in an objective
LC•GC Europe Online Supplement statistics and data analysis 21

Correlation matrices with different


approaches selected for missing data.
No missing data (15 cases)
B C D E
A 0.62 0.68 0.41 0.39
normality usually require a significant Variables / Factors B 0.53 0.47 0.50
amount of data (a minimum of 10–15 Cases A B C D E C 0.57 0.59
results are recommended depending on 1 105.1 101.7 115.1 101.0 95.2 D 0.61
the normality test applied). For this reason 2 77.0 72.9 77.5 72.7 61.6
3 86.0 82.2 78.9 78.0 91.7 Casewise deletion (only 5 cases remain)
there will be many examples in analytical
B C D E
science where either it will be impractical
13 90.0 77.4 100.8 97.0 111.1 A -0.62 0.11 0.50 0.02
to carry out such tests, or the tests will not B -0.21 -0.36 0.17
14 90.0 91.3 89.2 81.3 100.5
tell us anything meaningful. 15 96.9 103.0 97.5 98.5 96.8 C 0.91 0.71
If we are not sure the data set is normally D 0.66
distributed then robust statistics and/or mean 99.2 92.4 94.6 89.4 91.7
non-parametric (distribution independent) Pairwise deletion (Variable number of cases)

tests can be applied to the data. These value = Data removed to show the effects B C D E
A 0.54(12) 0.55(12) 0.27(12) 0.23(11)
three approaches (outlier tests, robust of missing data.
mean = Mean values replacing missing data. B 0.50(11) 0.47(11) 0.77(10)
estimates and non-parametric methods)
C 0.79(11) 0.70(10)
are examined in more detail below.
Note, at the 95% confidence level, significant correlations are indicated at
3 D 0.71(10)

Outlier tests n r≥ Mean substitution (15 cases)


In analytical chemistry it is rare that we 15 0.514 B C D E
have large numbers of replicate data, and 12 0.576 A 0.01 -0.05 0.02 0.36
small data sets often show fortuitous 11 0.602 B 0.40 0.47 0.25
grouping and consequent apparent 10 0.632 C 0.47 0.43
5 0.950 D 0.46
outliers. Outlier tests should, therefore, be
used with care and, of course, identified
data points should only be removed if a figure 1 Effect of missing data on a correlation matrix.
technical reason can be found for their
aberrant behaviour.
Most outlier tests look at some measure
of the relative distance of a suspect point
Outlier Outlier
from the mean value. This measure is then
assessed to see if the extreme value could
reasonably be expected to have arisen by (a) or
chance. Most of the tests look for single
extreme values (Figure 2(a)), but Outlier Outlier
sometimes it is possible for several
‘outliers’ to be present in the same data (b)
set. These can be identified in one of two
ways: Outliers Outliers
• by iteratively applying the outlier test
• by using tests that look for pairs of (c) or
extreme values, i.e., outliers that are
masking each other (see Figure 2(b) and
2(c)). figure 2 Outliers and masking.
Note, as a rule of thumb, if more than
20% of the data are identified as outlying
you should start to question your
assumption about the data distribution
and/or the quality of the data collected. x – xi x –x n – 3 × sn2 – 2
The appropriate outlier tests for the G1 = s G2 = n s 1 G3 = 1 –
three situations described in Figure 2 are:
n – 1 × s2
2(a) Grubbs 1, Dixon or Nalimov; 2(b)
Grubbs 2 and 2(c) Grubbs 3.
We will concentrate on the three where, s is the standard deviation for the whole data set, xi is the suspected single
Grubbs’ tests (7). The test values are outlier, i.e., the value furthest away from the mean, | | is the modulus — the value of a
calculated using the formulae below, after calculation ignoring the sign of the result, –x is the mean, n is the number of data points, xn
the data are arranged in ascending order. and x1 are the most extreme values, sn-2 is the standard deviation for the data set
22 statistics and data analysis LC•GC Europe Online Supplement

excluding the suspected pair of outlier is not only possible for individual points within a group to be outlying but also for the
values, i.e., the pair of values furthest away group means to have outliers with respect to each other. Another type of ‘outlier’ that can
from the mean. occur is when the spread of data within one particular group is unusually small or large
If the test values (G1, G2, G3) are greater when compared with the spread of the other groups (see Figure 4).
than the critical value obtained from tables • The same Grubbs’ tests that are used to determine the presence of within group
(see Table 4) then the extreme value(s) are outlying replicates may also be used to test for suspected outlying means.
unlikely to have occurred by chance at the • The Cochran’s test can be used to test for the third case, that of a suspected
stated confidence level (see Box 2). outlying variance.
To carry out the Cochran’s test, the suspect variance is compared with the sum of all
Pitfalls of outlier tests group variances. (The variance is a measure of spread and is simply the square of the
Figure 3 shows three situations where standard deviation (1).)
outlier tests can misleadingly identify an g
extreme value.
suspected s2 Σ ni
Figure 3(a) shows a situation common in Cn = g where g is the number of groups and n = i = g1
chemical analysis. Because of limited Σ S2
measurement precision (rounding errors) it i=1 i
is possible to end up comparing a result If this calculated ratio, C–n , exceeds the critical value obtained from statistical tables (7)
which, no matter how close it is to the then the suspect group spread is extreme. The choice of n – is the average number of all
other values, is an infinite number of sample results produced by all groups.
standard deviations away from the mean The Cochran’s test assumes the number of replicates within the groups are the same or
of the remaining results. This value will at least similar (± 1). It also assumes that none of the data have been rounded and there
therefore always be flagged as an outlier. are sufficient numbers of replicates to get a reasonable estimate of the variance. The
In Figure 3(b) there is a genuine long tail Cochran’s test should not be used iteratively as this could lead to a large percentage of
on the distribution that may cause data being removed (See Box 3).
successive outlying points to be identified.
This type of distribution is surprisingly Robust statistics
common in some types of chemical Robust statistics include methods that are largely unaffected by the presence of extreme
analysis, e.g., pesticide residues. values. The most commonly used of these statistics are as follows:
If there is very little data (Figure 3(c)) an Median: The median is a measure of central tendency1 and can be used instead of the

outlier can be identified by chance. In this mean. To calculate the median ( χ ) the data are arranged in order of magnitude and the
situation it is possible that the identified median is then the central member of the series (or the mean of the two central
point is closer to the ‘true value’ and it is members when there is an even number of data, i.e., there are equal numbers of
the other values that are the outliers. This observations smaller and greater than the median). For a symmetrical distribution the mean
occurs more often than we would like to and median have the same value.
admit; how many times do your procedures
state ‘average the best two out of three ↔
xm when n is odd 1, 3, 5,…
x = n
determinations’? xm xm  1 when n is even 2, 4, 6,… where m = round up 2
2
Outliers by variance
When the data are from different groups Median Absolute Deviation (MAD): The MAD value is an estimate of the spread in the
(for example when comparing test data similar to the standard deviation.
methods via interlaboratory comparison) it

Box 2: Grubbs’ tests (worked example).

13 replicates are ordered in ascending order.

x1 xn
47.876 47.997 48.065 48.118 48.151 48.211 48.251 48.559 48.634 48.711 49.005 49.166 49.484
2 = 0.123
n = 13, mean = 48.479, s = 0.498, sn–2

G1 = 49.484 – 48.479 = 2.02


0.498
G2 = 49.484 – 47.876 = 3.23
0.498 G3 = 1 – 10 × 0.1232 = 0.587
12 × 0.498
Grubbs’ critical values for 13 values are G1 = 2.331 and 2.607, G2 = 4.00 and 4.24, G3 = 0.6705 and 0.7667 for the 95%
and 99% confidence levels. Since the test values are less than their respective critical values, in all cases, it can be concluded
there are no outlying values.
LC•GC Europe Online Supplement statistics and data analysis 23

For n values MAD = median xi – ↔


x i = 1, 2, …, n
MADE = 1.483 × MAD
Other robust statistical estimates include
If the MAD value is scaled by a factor of 1.483 it becomes comparable with a standard trimmed mean and deviations, Winsorized
deviation, this is the MADE value. mean and deviation, least median of
squares (robust regression), Levene’s test
(heterogeneity in ANOVA), etc. A
discussion of robust statistics in analytical
95% confidence level 99% confidence chemistry can be found elsewhere (10, 11).
level
n G(1) G(2) G(3) G(1) G(2) G(3) Non-parametric tests
3 1.153 2.00 --- 1.155 2.00 --- Typical statistical tests incorporate
4 1.463 2.43 0.9992 1.492 2.44 1.0000 assumptions about the underlying
5 1.672 2.75 0.9817 1.749 2.80 0.9965
distribution of data (such as normality),
and hence rely on distribution parameters.
6 1.822 3.01 0.9436 1.944 3.10 0.9814
‘Non-parametric’ tests are so called
7 1.938 3.22 0.8980 2.097 3.34 0.9560
because they make few or no assumptions
8 2.032 3.40 0.8522 2.221 3.54 0.9250 about the distributions, and do not rely on
9 2.110 3.55 0.8091 2.323 3.72 0.8918 distribution parameters. Their chief
10 2.176 3.68 0.7695 2.410 3.88 0.8586 advantage is improved reliability when the
12 2.285 3.91 0.7004 2.550 4.13 0.7957 distribution is unknown. There is at least
13 2.331 4.00 0.6705 2.607 4.24 0.7667 one non-parametric equivalent for each
15 2.409 4.17 0.6182 2.705 4.43 0.7141 parametric type of test (see Table 5). In a
short article, such as this, it is impossible to
20 2.557 4.49 0.5196 2.884 4.79 0.6091
describe the methodology for all these
25 2.663 4.73 0.4505 3.009 5.03 0.5320
tests but more information can be found in
30 2.745 4.89 0.3992 3.103 5.19 0.4732 other publications (12, 13).
35 2.811 5.026 0.3595 3.178 5.326 0.4270
40 2.866 5.150 0.3276 3.240 5.450 0.3896 Conclusions
50 2.956 5.350 0.2797 3.336 5.650 0.3328 • Always check your data for transcription
60 3.025 5.500 0.2450 3.411 5.800 0.2914 errors. Outlier tests can help to identify
70 3.082 5.638 0.2187 3.471 5.938 0.2599 them as part of a quality control check.
• Delete extreme values only when a
80 3.130 5.730 0.1979 3.521 6.030 0.2350
technical reason for their aberrant
90 3.171 5.820 0.1810 3.563 6.120 0.2147
behaviour can be found.
100 3.207 5.900 0.1671 3.600 6.200 0.1980 • Missing data can result in misinterpretation
110 3.239 5.968 0.1553 3.632 6.268 0.1838 of the resulting statistics so care should
120 3.267 6.030 0.1452 3.662 6.330 0.1716 be taken with the method chosen to
130 3.294 6.086 0.1364 3.688 6.386 0.1611 handle the gaps. If at all possible, further
140 3.318 6.137 0.1288 3.712 6.437 0.1519 experiments should be carried out to fill
in the missing points.
table 4 Grubbs’ critical value table (5).

Box 3: Cochran’s test (worked example).

An interlaboratory study was carried out by 13 laboratories to determine the amount of cotton in a cotton/polyester fabric,
85 determinations where carried out in total. The standard deviations of the data obtained by each of the 13 laboratories
was as follows:
Std. Dev. 0.202 0.402 0.332 0.236 0.318 0.452 0.210 0.074 0.525 0.067 0.609 0.246 0.198
Cn = 0.6092 = 0.371 = 0.252
n = 85 = 6.54 ≈ 7 0.2022 + 0.4022 ....... 0.2462 + 0.1982 1.474
13

Cochran’s critical value for n = 7 and g = 13 is 0.23 at the 95% confidence levels7.

As the test value is greater than the critical values it can be concluded that the laboratory with the highest standard deviation
(0.609) has an outlying spread of replicates and this laboratory’s results therefore need to be investigated further. It is normal
practice in inter-laboratory comparisons not to test for low variance outliers, i.e., laboratories reporting unusually precise results.
24 statistics and data analysis LC•GC Europe Online Supplement

• Outlier tests assume the data distribution


is known. This assumption should be
Box & Whisker Plot checked for validity before these tests
Analyte concentration are applied.
22 • Robust statistics avoid the need to use
outlier tests by down-weighting the
21 effect of extreme values.
• When knowledge about the underlying
outlying variance data distribution is limited, non-
20
parametric methods should be used.
outlying mean
19
NB: It should be noted that following a
judgement in a US court, the Food and
18 Drug Administration (FDA) in a guide —
Guide to inspection of pharmaceutical
17 quality control laboratories — has
specifically prohibited the use of outlier
16 tests.

15 Acknowledgement
The preparation of this paper was supported
14 under a contract with the UK’s Department
of Trade and Industry as part of the
National Measurement System Valid
13
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 Analytical Measurement Programme (VAM)
Laboratory ID (14).

figure 4 Different types of outlier in grouped data. References


(1) S. Burke, Scientific Data Management 1(1),
32–38, 1997.
(2) S. Burke, Scientific Data Management 2(1),
Types of comparison Parametric methods Non-parametric methods (12, 13) 36–41, 1998.
(3) S. Burke, Scientific Data Management 2(2),
Differences between t-test for independent groups2 Wald–Wolfowitz runs test 32–40, 1998.
(4) J.L. Schafer, Monographs on Statistics and
independent groups Mann–Whitney U test
Applied Probability 72 — Analysis of
of data Kolmogorov–Smirnov two-sample Incomplete Multivariate Data, Chapman & Hall
test (1997) ISBN 0-412-04061-1.
(ANOVA/MANOVA)2 Kruskal–Wallis analysis of ranks. (5) R.J.A. Little & D.B. Rubin, Statistical Analysis
With Missing Data, John Wiley & Sons (1987),
Median test
ISBN 0-471-80243-9.
(6) ISO 3534. Statistics — Vocabulary and Symbols.
Differences between t-test for dependent groups2 Sign test Part 1: Probability and general statistical terms,
dependent groups Wilcoxon’s matched pairs test section 2.64. Geneva 1993.
(7) T.J. Farrant, Practical statistics for the analytical
of data McNemar’s test
scientist: A bench guide, Royal Society of
2 (Chi-square) test Chemistry 1997. (ISBN 0 85404 442 6).
Friedman’s two-way ANOVA (8) V. Barret & T. Lewis, Outliers in Statistical Data,
ANOVA with replication2 Cochran Q test 3rd Edition, John Wiley (1994).
(9) William H. Kruskal & Judith M. Tanur,
International Encyclopaedia of Statistics, Collier
Relationships between Linear regression3 Spearman R Macmillian Publishers, 1978. ISBN 0-02-
continuous variables Correlation coefficient3 Kendall 917960-2.
Tau (10) Analytical Methods Committee, Robust
Statistics — How Not to Reject Outliers Part 2.
Analyst 1989 114, 1693–7.
Homogeneity of Variance Bartlett’s test7 Levene’s test, Brown & Forsythe (11) D.C. Hoaglin, F. Mosteller & J.W. Tukey,
Understanding Robust and Exploratory Data
Relationships between coefficient Gamma Analysis, John Wiley & Sons (1983), ISBN 0-
counted variables 2 (Chi-square) test 471-09777-2.
(12) M. Hollander & D.A. Wolf, Non-parametric
Phi coefficient statistical methods, Wiley & Sons, New York
Fisher exact test 1973.
Kendall coefficient of (13) W.W. Daniel, Applied non-parametric statistics,
Houghton Mifflin, Boston 1978.
concordance
(14) M. Sargent, VAM Bulletin, Issue 13, 4–5,
Autumn. Laboratory of the Government
table 5 Non-parametric alternatives to parametric statistical tests. Chemist, 1995.
LC•GC Europe Online Supplement statistics and data analysis 13

Regression
and Calibration
Shaun Burke, RHM Technology Ltd, High Wycombe, Buckinghamshire, UK.

One of the most frequently used statistical methods in calibration is linear


regression. This third paper in our statistics refresher series concentrates on
the practical applications of linear regression and the interpretation of the
regression statistics.

Calibration is fundamental to achieving to do this is least-squares regression, which possible for several different data sets to
consistency of measurement. Often works by finding the “best curve” through yield identical regression statistics (r value,
calibration involves establishing the the data that minimizes the sums of residual sum of squares, slope and
relationship between an instrument squares of the residuals. The important intercept), but still not satisfy the linear
response and one or more reference term here is the “best curve”, not the assumption in all cases (9). It, therefore,
values. Linear regression is one of the most method by which this is achieved. There remains essential to plot the data in order
frequently used statistical methods in are a number of least-squares regression to check that linear least-squares statistics
calibration. Once the relationship between models, for example, linear (the most are appropriate.
the input value and the response value common type), logarithmic, exponential As in the t-tests discussed in the first
(assumed to be represented by a straight and power. As already stated, this paper paper (10) in this series, the statistical
line) is established, the calibration model is will concentrate on linear least-squares significance of the correlation coefficient is
used in reverse; that is, to predict a value regression. dependent on the number of data points.
from an instrument response. In general, [You should also be aware that there are To test if a particular r value indicates a
regression methods are also useful for other regression methods, such as ranked statistically significant relationship we can
establishing relationships of all kinds, not regression, multiple linear regression, non- use the Pearson’s correlation coefficient
just linear relationships. This paper linear regression, principal-component test (Table 1). Thus, if we only have four
concentrates on the practical applications regression, partial least-squares regression, points (for which the number of degrees of
of linear regression and the interpretation etc., which are useful for analysing instrument freedom is 2) a linear least-squares
of the regression statistics. For those of you or chemically derived data, but are beyond correlation coefficient of –0.94 will not be
who want to know about the theory of the scope of this introductory text.] significant at the 95% confidence level.
regression there are some excellent However, if there are more than 60 points
references (1–6). What do the linear least-squares an r value of just 0.26 (r2 = 0.0676) would
For anyone intending to apply linear regression statistics mean? indicate a significant, but not very strong,
least-squares regression to their own data, Correlation coefficient: Whether you use a positive linear relationship. In other words,
it is recommended that a statistics/graphics calculator’s built-in functions, a a relationship can be statistically significant
package is used. This will speed up the spreadsheet or a statistics package, the but of no practical value. Note that the test
production of the graphs needed to first statistic most chemists look at when used here simply shows whether two sets
confirm the validity of the regression performing this analysis is the correlation are linearly related; it does not “prove”
statistics. The built-in functions of a coefficient (r). The correlation coefficient linearity or adequacy of fit.
spreadsheet can also be used if the ranges from –1, a perfect negative It is also important to note that a
routines have been validated for accuracy relationship, through zero (no relationship), significant correlation between one
(e.g., using standard data sets (7)). to +1, a perfect positive relationship variable and another should not be taken
(Figures 1(a–c)). The correlation coefficient as an indication of causality. For example,
What is regression? is, therefore, a measure of the degree of there is a negative correlation between
In statistics, the term regression is used to linear relationship between two sets of time (measured in months) and catalyst
describe a group of methods that data. However, the r value is open to performance in car exhaust systems.
summarize the degree of association misinterpretation (8) (Figures 1(d) and (e), However, time is not the cause of the
between one variable (or set of variables) show instances in which the r values alone deterioration, it is the build up of sulfur
and another variable (or set of variables). would give the wrong impression of the and phosphorous compounds that
The most common statistical method used underlying relationship). Indeed, it is gradually poisons the catalyst. Causality is,
14 statistics and data analysis LC•GC Europe Online Supplement

(n  1)
RSE  s(y) 1  r2
(n  2)
where s(y) is the standard deviation of the y values in the calibration, n is the number of
in fact, very difficult to prove unless the data pairs and r is the least-squares regression correlation coefficient.
chemist can vary systematically and
independently all critical parameters, while Confidence intervals
measuring the response for each change. As with most statistics, the slope (b) and intercept (a) are estimates based on a finite
sample, so there is some uncertainty in the values. (Note: Strictly, the uncertainty arises
Slope and intercept from random variability between sets of data. There may be other uncertainties, such as
In linear regression the relationship measurement bias, but these are outside the scope of this article.) This uncertainty is
between the X and Y data is assumed to quantified in most statistical routines by displaying the confidence limits and other
be represented by a straight line, Y = a + statistics, such as the standard error and p values. Examples of these statistics are given in
bX (see Figure 2), where Y is the estimated Table 2.
response/dependent variable, b is the slope
(gradient) of the regression line and a is
the intercept (Y value when X = 0). This
straight-line model is only appropriate if Degrees of freedom Confidence level
the data approximately fits the assumption (n-2) 95% (  = 0.05) 99% (  = 0.01)
of linearity. This can be tested for by 2 0.950 0.990
plotting the data and looking for curvature 3 0.878 0.959
(e.g., Figure 1(d)) or by plotting the 4 0.811 0.917
residuals against the predicted Y values or 5 0.754 0.875
X values (see Figure 3).
6 0.707 0.834
Although the relationship may be known
7 0.666 0.798
to be non-linear (i.e., follow a different
functional form, such as an exponential 8 0.632 0.765
curve), it can sometimes be made to fit the 9 0.602 0.735
linear assumption by transforming the data 10 0.576 0.708
in line with the function, for example, by 11 0.553 0.684
taking logarithms or squaring the Y and/or 12 0.532 0.661
X data. Note that if such transformations
13 0.514 0.641
are performed, weighted regression
14 0.497 0.623
(discussed later) should be used to obtain
an accurate model. Weighting is required 15 0.482 0.606
because of changes in the residual/error 20 0.423 0.537
structure of the regression model. Using 30 0.349 0.449
non-linear regression may, however, be a 40 0.304 0.393
better alternative to transforming the data 60 0.250 0.325
when this option is available in the Significant correlation when |r| ≥ table value
statistical packages you are using.

Residuals and residual standard error 1


A residual value is calculated by taking the
difference between the predicted value 0.8
and the actual value (see Figure 2). When 0.6
Correlation coefficient (r)

the residuals are plotted against the


0.4
predicted (or actual) data values the plot
becomes a powerful diagnostic tool, 0.2
enabling patterns and curvature in the data
0
to be recognized (Figure 3). It can also be
5 10 15 20 25 30 35 40 45 50 55 60
used to highlight points of influence (see -0.2
Bias, leverage and outliers overleaf).
-0.4
The residual standard error (RSE, also
known as the residual standard deviation, -0.6
RSD) is a statistical measure of the average 95% confidence level
-0.8
residual. In other words, it is an estimate 99% confidence level
of the average error (or deviation) about -1
the regression line. The RSE is used to Degrees of freedom (n-2)
calculate many useful regression statistics
including confidence intervals and outlier
test values. table 1 Pearson's correlation coefficient test.
LC•GC Europe Online Supplement statistics and data analysis 15

The p value is the probability that a value could arise by chance if the true value was x data for the n points in the calibration.
zero. By convention a p value of less than 0.05 indicates a significant non-zero statistic. RSE is the residual standard error for the
Thus, examining the spreadsheet’s results, we can see that there is no reason to reject the calibration.
hypothesis that the intercept is zero, but there is a significant non-zero positive If we want, therefore, to reduce the size
gradient/relationship. The confidence intervals for the regression line can be plotted for all of the confidence interval of the prediction
points along the x-axis and is dumbbell in shape (Figure 2). In practice, this means that the there are several things that can be done.
model is more certain in the middle than at the extremes, which in turn has important 1. Make sure that the unknown
consequences for extrapolating relationships. determinations of interest are close to
When regression is used to construct a calibration model, the calibration graph is used the centre of the calibration (i.e., close
in reverse (i.e., we predict the X value from the instrument response [Y-value]). This to the values –x ,y– [the centroid point]).
prediction has an associated uncertainty (expressed as a confidence interval) This suggests that if we want a small
confidence interval at low values of x
Y a
Xpredicted  then the standards/reference samples
b used in the calibration should be
concentrated around this region. For
t RSE Y y
2 example, in analytical chemistry, a typical
Conf. interval for the prediction is: X predicted 
1 1
b m n  2 pattern of standard concentrations
b2 n  1 s x might be 0.05, 0.1, 0.2, 0.4, 0.8, 1.6

where a is the intercept and b is the slope obtained from the regression equation.

Y is the mean value of the response (e.g., instrument readings) for m replicates (replicates
are repeat measurements made at the same level).
–y is the mean of the y data for the n points in the calibration. t is the critical value obtained (a) r = -1
from t-tables for n–2 degrees of freedom. s(x) is the standard deviation for the

1.4
Intercept Slope Residuals (b) r = 0
1.2
Y= -0.046 + 0.1124 * X
r = 0.98731
1.0
Y Correlation coefficient
0.8
(c) r = +1
0.6

0.4 confidence limits for


Intercept the regression line
0.2
(d) r = 0
0.0
confidence limits for the prediction
-0.2
0 2 4 6 8 10 12
X
(e) r = 0.99
figure 2 Calibration graph.

0.14
0.12
(f) r = 0.9
0.1
0.08
0.06 possible outlier
Residuals

0.04
0.02
0 (g) r = 0.9
-0.02
-0.04
-0.06
-0.08
0 1 2 3 4 5 6 7 8 9 10
X
figure 1 Correlation coefficients and
figure 3 Residuals plot. goodness of fit.
16 statistics and data analysis LC•GC Europe Online Supplement

2
xi  x
Leverage i  1n  n
(i.e., only one or two standards are used at Σ
j=1
x j x 2
higher concentrations). While this will lead
to a smaller confidence interval at lower where xi is the x value for which the leverage statistic is to be calculated, n is the
concentrations the calibration model will number of points in the calibration and –x is the mean of all the x values in the calibration.
be prone to leverage errors (see below). To test if a data point (xi,yi) is an outlier (relative to the regression model) the following
2. Increase the number of points in the outlier test can be applied.
calibration (n). There is, however, little
residualma x
improvement to be gained by going Test value  2
above 10 calibration points unless Y y
RSE 1  1n  i
standard preparation and analysis is n  1 sy2
rapid and cheap.
3. Increase the number of replicate where RSE is the residual standard error, sy is the standard deviation of the Y values, Yi is
determinations for estimating the the y value, n is the number of points, –y is the mean of all the y values in the calibration
unknown (m). Once again there is a and residualmax is the largest residual value.
law of diminishing returns, so the For example, the test value for the suspected outlier in Figure 3 is 1.78 and the critical
number of replicates should typically value is 2.37 (Table 3 for 10 data points). Although the point appears extreme, it could
be in the range 2 to 5. reasonably be expected to arise by chance within the data set.
4. The range of the calibration can be
extended, providing the calibration is still
linear. Extrapolation and interpolation
We have already mentioned that the regression line is subject to some uncertainty and that
Bias, leverage and outliers this uncertainty becomes greater at the extremes of the line. If we, therefore, try to
Points of influence, which may or may not extrapolate much beyond the point where we have real data (10%) there may be
be outliers, can have a significant effect on relatively large errors associated with the predicted value. Conversely, interpolation near
the regression model and therefore, on its the middle of the calibration will minimize the prediction uncertainty. It follows, therefore,
predictive ability. If a point is in the middle that when constructing a calibration graph, the standards should cover a larger range of
of the model (i.e., close to –x ) but outlying concentrations than the analyst is interested in. Alternatively, several calibration graphs
on the Y axis, its effect will be to move the covering smaller, overlapping, concentration ranges can be constructed.
regression line up or down. The point is
then said to have influence because it
introduces an offset (or bias) in the
predicted values (see Figure 1(f)). If the
point is towards one of the extreme ends Response Residuals
of the plot its effect will be to tilt the (a) (b)
regression line. The point is then said to
have high leverage because it acts as a
lever and changes the slope of the
regression model (see Figure 1(g)).
0
Leverage can be a major problem if one or
two data points are a long way from all the
other points along the X axis.
A leverage statistic (ranging between
_ and 1) can be calculated for each value
1
n
of x. There is no set value above which this Concentration Predicted value
leverage statistic indicates a point of
influence. A value of 0.9 is, however, used
by some statistical software packages. figure 4 Plots of typical instrument response versus concentration.

Coefficients Standard Error t Stat p value Lower 95% Upper 95%


Intercept -0.046000012 0.039648848 -1.160185324 0.279423552 -0.137430479 0.045430455
Slope 0.112363638 0.00638999 17.58432015 1.11755E-07 0.097628284 0.127098992

*Note the large number of significant figures. In fact none of the values above warrant more than 3 significant figures!

table 2 Statistics obtained using Excel 5.0 regression analysis function from the data used to generate the calibration graph in Figure 2.
LC•GC Europe Online Supplement statistics and data analysis 17

Weighted linear regression and calibration


In analytical science we often find that the precision changes with concentration. In
particular, the standard deviation of the data is proportional to the magnitude of the value
being measured, (see Figure 4(a)). A residuals plot will tend to show this relationship even
more clearly (Figure 4(b)). When this relationship is observed (or if the data has been
transformed before regression analysis), weighted linear regression should be used for
obtaining the calibration curve (3). The following description shows how the weighted
regression works. Don’t be put off by the equations as most modern statistical software
packages will perform the calculations for you. They are only included in the text for
completeness. (wi  12 )
Weighted regression works by giving points known to have a better precision a higher
si at each of the n
weighting than those with lower precision. During method validation the way the standard concentrations in the calibration.
deviation varies with concentration should have been investigated. This relationship can These initial weightings can then be
then be used to calculate the initial weightings standardized by multiplying by the number
of calibration points divided by the sum of
all the weights to give the final weights (Wi).
Sample size Confidence table-value n
n
(n)
5
95%
1.74
99%
1.75
Wi  wi Σ
j=1
wj
6 1.93 1.98 The regression model generated will be
7 2.08 2.17 similar to that for non-weighted linear
8 2.20 2.23 regression. The prediction confidence
9 2.29 2.44 intervals will, however, be different.
10 2.37 2.55 The weighted prediction (xw) for a given
12 2.49 2.70 instrument reading (y) for the regression
model forcing the line through the origin (y
14 2.58 2.82
= bx) is:
16 2.66 2.92
18 2.72 3.00
20 2.77 3.06 X(w)predicted Y
bw
25 2.88 3.25
30 2.96 3.36 with
35 3.02 3.40 n
40 3.08 3.43 Σ Wi xi yi
45 3.12 3.47 b(w)  i=1
n
50 3.16 3.51 Σ Wi xi2
i=1
60 3.23 3.57
70 3.29 3.62

where Y is the mean value of the
80 3.33 3.68 response (e.g., instrument readings) for m
90 3.37 3.73 replicates and xi and yi are the data pair for
100 3.41 3.78 the ith point.
By assuming the regression line goes
through the origin a better estimate of the
4
slope is obtained, providing that the
assumption of a zero intercept is correct.
3.5 This may be a reasonable assumption in
some instrument calibrations. However, in
Test value

3 most cases, the regression line will no


longer represent the least-squares “best
2.5 line” through the data.

2 95%
99%
1.5
0 10 20 30 40 50 60 70 80 90 100

Number of samples (n)

table 3 Outlier test for simple linear least-squares regression.


18 statistics and data analysis LC•GC Europe Online Supplement

References
(1) G.W. Snedecor and W.G. Cochran, Statistical
The associated uncertainty for the weighted prediction, expressed as a confidence
Methods, The Iowa State University Press, USA,
interval is then: 6th edition (1967).
Conf. interval for the prediction is (2) N. Draper and H. Smith, Applied Regression
Analysis, John Wiley & Sons Inc., New York,
t RSE(w) 1  Y
2 USA, 2nd edition (1981).
X(w)predicted  n (3) BS ISO 11095: Linear Calibration Using
b(w) mWi 2
b(w) Σ Wj xj2
j=1
Reference Materials (1996).
(4) J.C. Miller and J.N. Miller, Statistics for
Analytical Chemistry, Ellis Harwood PTR Prentice
where t is the critical value obtained from t tables for n–2 degrees of freedom at a Hall, London, UK.
stated significance level (typically a = 0.05), Wi is the weighted standard deviation for the (5) A.R. Hoshmand, Statistical Methods for
x data for the ith point in the calibration, m is the number of replicates and the weighted Environmental and Agricultural Sciences, 2nd
edition, CRC Press (ISBN 0-8493-3152-8)
residual. (1998).
n n (6) T.J. Farrant, Practical Statistics for the Analytical
Σ Wj yj2  b(w)
j=1
2
Σ Wj xj2
j=1
Scientist, A Bench Guide, Royal Society of
Standard error for the calibration RSE(w)  Chemistry, London, UK (ISBN 0 85404 4226)
n 1 (1997).
(7) Statistical Software Qualification: Reference
Data Sets, Eds. B.P. Butler, M.G. Cox, S.L.R.
Ellison and W.A. Hardcastle, Royal Society of
Conclusions
Chemistry, London, UK (ISBN 0-85404-422-1)
• Always plot the data. Don’t rely on the regression statistics to indicate a linear (1996).
relationship. For example, the correlation coefficient is not a reliable measure of (8) H. Sahai and R.P. Singh, Virginia J. Sci., 40(1),
goodness-of-fit. 5–9, (1989).
(9) F.J. Anscombe, Graphs in Statistical Analysis,
• Always examine the residuals plot. This is a valuable diagnostic tool. American Statistician, 27, 17–21, February
• Remove points of influence (leverage, bias and outlying points) only if a reason can be 1973.
found for their aberrant behaviour. (10) S. Burke, Scientific Data Management, 1(1),
32–38, September 1997.
• Be aware that a regression line is an estimate of the “best line” through the data and (11) M. Sargent, VAM Bulletin, Issue 13, 4–5,
that there is some uncertainty associated with it. The uncertainty, in the form of a Laboratory of the Government Chemist
confidence interval, should be reported with the interpolated result obtained from any (Autumn 1995).
linear regression calibrations.
Shaun Burke currently works in the Food
Technology Department of RHM Technology
Acknowledgement Ltd, High Wycombe, Buckinghamshire, UK.
The preparation of this paper was supported under a contract with the Department of However, these articles were produced while
Trade and Industry as part of the National Measurement System Valid Analytical he was working at LGC, Teddington,
Measurement Programme (VAM) (11). Middlesex, UK (http://www.lgc.co.uk).

You might also like