A Protocol For The Validation of Qualitative Methods of Analysis
A Protocol For The Validation of Qualitative Methods of Analysis
A Protocol For The Validation of Qualitative Methods of Analysis
is given by
Equation 1
Calculate the observed standard deviation s across the values of p
i
(i=1 to L). This
includes the contribution from between laboratory variation in the probability of
detection and uncertainty about the probability of detection in each laboratory which
is associated with the finite number of replicate analyses used to estimate each
probability.
11
Prediction interval across laboratories case 1: different values for p
i
The expected range of probabilities of detection across laboratories can be
calculated as follows. First calculate the standard deviation s of the estimates of
across laboratories (if all estimates of
1
Equation 2
Equation 3
0.5
Equation 4
0.5
Equation 5
N is the total number of results and X is the total number of positve results, both
parameters being pooled from all laboratories. v
s
and w
s
are the parameters of a
beta distribution with mean equal to and standard deviation equal to s. v
H
and w
H
are the parameters of a beta distribution that describes the uncertainty associated
with the average probability of detection [13] assuming no between-laboratory
variation, reflecting exclusively the sampling error.
A prediction interval for the expected range of probabilities of detection across
laboratories can be calculated using the inverse beta distribution with shape
parameters v and w. The inverse beta distribution is available in Excel, in the free
open source software packages R and OpenOffice.org, and in many other packages
(see Appendix 1).
Calculate lower and upper limits for the prediction from the observed between-
laboratory variation using
12
Equation 6
,
Equation 7
and from the pure sampling error using
Equation 8
Equation 9
Where LC is the percentile value of the lower end of a confidence interval (usually
5%) and UC is the percentile value of the upper end of a confidence interval (usually
95%).
and then
Equation 10
Equation 11
Hence upper limits are taken as the maximum of the estimated upper end of the
confidence intervals (equation 7 or 9) for the average probability of detection
calculated by equation 1 and the probability of detection in a single laboratory, lower
limits are taken as the minimum of the estimated lower end of the confidence
intervals (equation 6 or 8) for the average probability of detection and the probability
of detection in a single laboratory.
Range across laboratories case 2: all p
i
=0 or all p
i
=1.
All laboratories participating in a trial may achieve 100% positive results for samples
containing higher concentrations of analyte and, for a good method, 100% negative
results for samples that do not contain the analyte. For these sets of results use [13]
13
for 1
Equation 10b
1
Equation 11b
for 0
0
Equation 10c
1 1
Equation 11c
Plotting the prediction intervals
The performance of the analytical method is described graphically by plotting results,
the limits for probabilities of detection and using these to estimate the values of the
limit of detection and false positive probability, as shown in Figure 1 to 3.
1) For each concentration plot the estimates of the probability of detection in
each laboratory p
1
, p
2
, p
L
, (Figure 1)
2) plot , upperlimit and lowerlimit and join the averages and limits across
concentrations by linear interpolation (Figure 2)
3) The estimated average false positive probability is given by the intercept of
the line on the y-axis. The upper 95% confidence for the false positive
probability in an individual line is given by the intercept of the upperlimit line
on the y-intercept (0.032 in Figure 3). The estimated average limit of detection
is given by the concentration at which the line crosses the required
probability of detection. For instance, at 95 % confidence level, the estimated
average limit of detection is 0.089 % MBM (Figure 3). The estimated upper
14
95% confidence interval for the limit of detection in an individual laboratory is
given by the concentration at which the lowerlimit line crosses the required
probability of detection. (0.122% MBM in Figure 3)
Assessment of between laboratory variation
4) The variation displayed by estimates of the probability of detection across
laboratories (s) has two sources: between laboratory variation s
L
and the
binomial sampling variation associated with the finite number of analyses
used to estimate p
i
(s
B
). The use of an analysis of variance to test whether
there is significant between-laboratory variation is demonstrated by Wilrich
[10] and Wehling at al [11].
An alternative graphical approach is used here where an interval is plotted
which gives the variation we can expect to see in results if there is no
between-laboratory variation; the probability of variation in all laboratories is
equal to the average probability of detection across laboratories; and the
observed variation in probability of detection is just what we would expect to
see when using a small number of tests to estimate it in each laboratory. The
size of the interval is estimated using the beta-binomial distribution (Appendix
2) which gives estimates of confidence intervals for the number of positive
results x
L
(lower limit) and x
U
(upper limit) per laboratory where there is no
between-laboratory variation. Plot
and
and
and
and we may
conclude that increasing the number of analyses will lead to better estimates
of the probability of detection.
15
5) Report a) the scope of the validation, b) estimates of likely range of false
positive rates and limits of detection when the method is used in practice c) an
opinion on whether undertaking further analyses is likely to change the
conclusions of the study.
6) For results from an absolute minimum study design calculate the mean upper
(5%) and lower (95%) limits of the prediction interval for the false positive
probability and probability of detection at the target limit of detection using
Equations 1 to 11
Examples
Example 1: Detection of meat and bone meal in animal feed
A collaborative trial (18 laboratories, 20 replicate samples at 7 levels including zero)
of a PCR-based method to detect the presence of meat and bone meal (MBM) in
animal feed yielded the following results (Table 1).
Note that we do not recommend that validation results are reported using the format of Table 1
because information about the performance of individual laboratories is lost. This format is
used here because it is a compact way of giving data which can be reconstituted to allow
readers of this paper to repeat the data analysis methods used in the draft validation protocol.
For example Table 2 gives reconstituted results for results produced by the analysis of
samples containing 0.01% MBM.
Table 2 shows an example of individual test results, reconstituted from Table 1,
which can be used as inputs for the candidate validation protocol.
Table 3 and 4 show the results of applying Equations 1 to 11 from the draft protocol
to results reconstituted from Table 1.
Figures 1 3, which we used to illustrate the draft protocol, are based on this
example. Figure 1 shows the observed probabilities of detection for laboratories
across the seven concentrations of MBM used in the validation study. Figure 3
shows that the observed probabilities of detection tell us that we can expect limits of
detection for individual laboratories using this method to be no higher (95%
confidence) than 0.12 % MBM extracted from animal feed and may be as low as
0.036% MBM extracted from animal feed for some laboratories. We can expect that
the false positive probability will be less than 0.032 for 95% of laboratories.
16
Figure 4 shows that there is more variation in results than can be explained by the
uncertainty about the mean probability of detection and binomial sampling variation
associated with 20 replicates per laboratory for samples containing up to 0.1% MBM.
Hence, the size of the interval within which the limit of detection may lie is likely to be
driven by variation in method performance (probability of detection changing
between laboratories) rather than uncertainty about performance that could be
reduced by further measurements.
Example 2: Detection of peanut protein
A collaborative trial (18 laboratories, 5 replicate samples at 7 levels including zero) of
a dipstick to test to an allergen (peanut) in cookies [14] yielded the following results
(Table 5).
Estimates of statistical parameters and prediction intervals for the probability of a
positive response are shown in Tables 6 and 7
Figure 5 shows that the observed probabilities of detection tell us that we can expect
limits of detection for individual laboratories using this method to be no higher (95%
confidence) than 29 mg/kg peanut in cookies, and may be as low as 13 mg/kg in
some laboratories. We can expect that the false positive probability will be less than
0.14 for 95% of laboratories.
Figure 6 shows that a large part of the observed variation can be explained by the
uncertainty about the mean probability of detection and the binomial sampling
variation associated with 5 replicates per laboratory for most concentrations. Hence,
if the uncertainty associated with the estimated limit of detection is too large then
estimates of method performance may be improved by undertaking further studies
using more replicate samples. This might be the case if the target limit of detection
were around 20 mg/kg.
The effect of the low number of replicates is particularly clear for results produced by
the analysis of cookies that did not contain peanut. Here the 2 out of 90 samples
gave a positive response, giving an estimated average false positive probability of
0.022 with a 95% confidence interval of 0.005 to 0.069. The problem is that the
estimated probability of a positive response for those two laboratories that each
produced a single positive response is 0.2, and these high estimates have a big
impact on the upper limit for the estimated probability of detection. In short the
17
results are consistent with a low-between-laboratory-variation false positive
probability of less than 1% (probably good enough for a screening method), or a
much higher (estimated upper 95% interval of 0.14) false positive rate for some
laboratories, and we cant tell the difference because the number of replicates per
laboratory is low. In order to defend against these kinds of outcomes where
preliminary work leads to the expectation of a few false positive, or false negative
results a larger number of replicate analyses should be undertaken in each
laboratory (see Assessment of draft protocol performance and study design).
Example 3: A collaborative trial for the detection of salmonella in ground beef
A collaborative trial (11 laboratories, 6 replicate samples at 3 levels including zero) of
a method for the detection of salmonella in ground beef [11] yielded the following
results (Table 8), following the removal of results produced by one laboratory whose
performance was judged to be inconsistent with the other laboratories.
Calculation of the upper and lower limits (5
and 95%) for estimated probability of
detection at each concentration is shown in Tables 9 and 10. We were unable to use
the results to provide an estimate of the limit of detection where the mean probability
of detection was 95% or an upper limit for the expected limit of detection when the
method is applied in a new laboratory (Figure 7). We were able to estimate that the
limit of detection, when the method is applied in a new laboratory, can be expected
(95% confidence) to be higher than 8 cfu/25g and that the false positive probability
was likely (95% confidence) to be less than 0.05.
The observed variation between laboratories can be explained by the uncertainty
about the mean probability of detection and the binomial sampling variation
associated with 6 replicates per laboratory for all concentrations (Figure 8). Hence
estimates of method performance may be improved by undertaking further studies
using more replicate samples, possibly including some higher concentration
samples. However, we cant be confident that this will be fruitful unless a limit of
detection larger than 8 cfu/25g is considered fit for purpose.
18
Example 4: The minimum study
Minimum studies may be designed using the relation
N
log 1
log1 F
Equation 12
where N is the number of tests undertaken, F is the maximum acceptable false
detection probability for a fit for purpose method, and C is the confidence required
that the false response probability is below F. The target false positive rate is
confirmed if all negative samples give a negative result and the target limit of
detection is confirmed if all positive samples give a positive response.
For example if we require 95% confidence that a false positive probability is less
than 0.05 then:
N
log 1 0.95
log1 0.05
58.4
Which we round up to the nearest whole number of replicates: 59, and finally
distribute replicates evenly between laboratories which may make some further
rounding-up necessary.
Results of a fictional study designed to test whether false positive probability given
by two methods was less than 0.05, and that the probability of detecting 1 mg/kg of
substance was at least 0.95. (10 laboratories, 6 replicates per laboratory at two
levels: zero and target limit of detection), are shown in Table 11. Despite the tests
being fictional, one laboratory failed to provide results for negative samples analyzed
using method B and two laboratories observed one false negative result each using
method B.
Probabilities of detection were calculated using Equations 1 to 11 in the usual way
(Tables 11-15). All method A analyses produced expected results giving an
estimated false positive probability of less than (95% confidence) 0.0487 and an
estimated probability of detection for 1 mg/kg of analyte of at least (95% confidence)
0.9513. Hence a limit of detection of no more than 1 mg/kg was estimated.
19
The missing results from one laboratory using method B mean that the results do not
confirm that the false positive probability is less than 0.05. Similarly the two negative
responses produced by the analysis of the positive samples have a large effect on
the estimated range of values for the probability of detection that we may expect to
see when the method is used in practice (95% confidence interval 0.82 1.00).
Hence the absolute minimum design is most useful where both the false positive
probability and false negative probability at the target limit of detection are thought to
be low enough such that one or more unexpected result is unlikely. Low enough
means very-close-to-zero, e.g. less than 0.0005 where 60 samples are analyzed at
the limit of detection and zero concentrations.
Assessment of draft protocol performance and validation study size
We assessed the performance of the draft protocol, and how it is influenced by
validation study size, by examining how well the upper 95% confidence limit of the
estimated probability of a positive response represented the expected 95
th
percentile
of simulated laboratories. This gave us an estimate of how reliable the upper limit of
the estimated false positive probability would be when assessing the fitness for
purpose of the method. By symmetry, this also gave us an estimate of how reliable
the lower limit of the estimated probability of detection (at high probabilities) would
be when used to estimate an upper limit for the limit of detection.
We used a number of scenarios. Each scenario consisted of an average probability
of a positive response (0.1, 0.05, 0.01) with beta-distributed between-laboratory
variation such that the probability of a positive response at the 95
th
percentile of
laboratories was at double the average (0.2, 0.1, 0.02). The protocol was applied to
each scenario using a range of numbers of laboratories (5, 10, 15, 20, 30, 50, 100)
and replicate analyses within each laboratory (5, 10, 20, 50, 100). Parameters
describing the scenarios are given in Table 16.
The simulation algorithm was
For each number of laboratories (L=5, 10, 15, 20, 30, 50, 100), number of
replicates per laboratory (n=5, 10, 20, 50, 100), and scenario ( =0.1, 0.05,
0.01)
20
Generate a probability of detection for each laboratory p
i
(i=1 to L) by
selecting a random value from the beta distribution.
Generate a number of positive and negative responses from n Bernouli
trials per laboratory with probability p
i
(i=1 to L).
Analyze the responses using the draft protocol
Report the estimated probability of a positive response for the upper
95
th
percentile of laboratories.
10000 estimates of probability of a positive response for the upper 95
th
percentile of
laboratories were produced for each scenario and study design, which were
summarized by their mean (solid line) , 5
th
and 95
th
percentile (dotted lines). (Figures
9 to 11).
The general pattern is that increasing the number of laboratories and the number of
replicate analyses reduces the size of the variation associated estimates. Also,
estimates of the probability of a positive response for the upper 95
th
percentile of
laboratories becomes increasingly conservative where low numbers of replicate
analyses are used. For example, for the scenario =0.2, 95
th
percentile of
laboratories at p=0.4, average of the estimates of the 95
th
percentile of laboratories
lies at approximately 0.60 for 5 replicates per laboratory and at 0.46 for 20 replicates
per laboratory. This is because low numbers of replicates add additional variation to
estimates of the probability of detection.
This means that a finding that a method gives fit for purpose results is likely to be
safe even for studies based on lower numbers replicate analyses and laboratories if
we use the estimated upper limit, at 95% confidence, to assess fitness for purpose of
a methods false positive probability and the lower limit, at 95% confidence, to
estimate limit of detection. Larger studies (more replicates and laboratories) should
be used to give more assurance that methods which do have fit for purpose
performance will be found to produce fit for purpose results. For example, if we
decide that validation studies should be conservative, but should give results which
demonstrate that a method with an average false response probability (positive at
zero, negative at the limit of detection) of 0.01 with the upper 95
th
percentile of
laboratories having a false response probability of 0.02 should, on average be
21
expected to provide an estimate of the false response probability at the upper 95
th
percentile of laboratories of 0.05, then the 20 replicate analyses per laboratory
analyzed across as many laboratories as can be managed (Figure 12). In general
examining performance where the true value is close to zero or one requires the use
of a larger number of replicates within each laboratory, and then the use of as many
laboratories as can be afforded given the use of approximately 20 replicate samples
per laboratory.
Conclusions
We have presented a draft protocol for the validation of qualitative analytical
methods which can be applied to the validation of methods by collaborative trial or
single-laboratory validation. The analysis of results using the draft protocol is based
on the estimation of the average probability of a positive response and the observed
reproducibility standard deviation. Because a single estimate of the probability of
detection is produced in each laboratory (or in each group in a single laboratory
study) when replicate analyses are undertaken under repeatability conditions an
analysis of variance such as that used in the IUPAC/ISO/AOAC harmonized protocol
for the validation of quantitative methods is not useful. An estimate of the
reproducibility standard deviation is gained by calculating the mean and the standard
deviation of the probabilities of detection across laboratories (or groups for single
laboratory studies). Then a simple plot of confidence intervals for the probability of
detection across laboratories based on the beta distribution against the
concentration of analyte is used to provide an estimate of the range of limits of
detection and false positive probability that we can expect to see when the validated
method is applied in different laboratories.
The draft protocol has been applied to a number of trials based on the analysis of
samples at three to seven levels of analyte concentration, using five to 20 replicate
analyses per laboratory, in 10 to 18 laboratories, for PCR based detection of DNA in
animal feed, ELISA based detection of food allergens and the detection of
salmonella in ground beef.
A simulation study showed that the draft protocol tends to perform conservatively.
This means if this protocol is applied to a study based on a low number of
22
laboratories and replicate analyses and estimates of limit of detection and false
positive probability are sufficiently low, then the conclusion that the method is fit for
purpose is likely to be safe. Increasing study size, in particular by increasing the
number of replicate analyses per laboratory to 20, gives a better chance that good-
enough, methods will give results that lead to a favorable assessment.
Hence, we believe that this protocol strikes the right balance between the three
competing goals for a standard method: to give correct answers; have a broad scope
of application; and be accessible to a wide range of users.
Aknowledgements
We would like to thank Dr Stphane Pietrevalle for his excellent advice. This
research has been supported by funding from IUPAC (IUPAC project 2005-024-2-
600 Establishment of guidelines for the validation of qualitative and semi quantitative
(screening) methods by collaborative trial: a harmonized protocol) and the EU 6
th
Framework Network of Excellence Project MoniQA (Food-CT-2006-036337,
Monitoring and Quality Assurance in the Food Supply Chain
http://www.moniqa.org/). Furthermore we are grateful to Gilber Berben, Ollivier
Fumire and Ana Boix for sharing with us the results from a collaborative study for
the validation of a PCR method for the detection of meat and bone meal in
feedingstuffs conducted within the EU 6
th
Framework project Safeed-PAP (FOOD-
CT-2006-036221, Detection of presence of species-specific processed animal
proteins in animal feed http://safeedpap.feedsafety.org).
Appendix 1: Calculating the inverse beta distribution function
Example
The value of the inverse beta distribution at at the 95% percentile with shape
parameters v=10, w=2 is 0.9667.
Excel and OpenOffice
In excel (Excel 2003 onwards) [15] the value of the inverse beta distribution function,
with shape parameters v and w, at probability x is given by
Betainv(x,v,w)
23
For example if cell A1 contains the probability at which the function is to be
evaluated (x), cell B1 contains the shape parameter v and cell C1 contains the shape
parameter w, then use
=BETAINV(A1,B1,C1)
If an error is generated then instead use
=1-BETAINV(A1,C1,B1)
The same function is also available in the Calc module of OpenOffice [
16
]
R
In R (2.8.1 onwards) [17] the value of the inverse beta distribution function, with
shape parameters v and w, at probability x is given by
Qbeta(x,w,w)
Numerical approximation
If you do not have access to software for calculating the inverse beta distribution an
approximation based on the normal distribution and some arithmetic can be used
(from page 945 of Abramovitz and Stegun[18]):
, ,
. 2
Where
1
2 1
1
2 1
5
6
2
3
,
h
2
1
2v 1
1
2w 1
an
3
6
k is the inverse standard normal distribution function for (1-x). E.g for calculating the
5
th
percentile use k=1.645 and use k=-1.645 for the 95
th
percentile.
24
Appendix 2: calculating the beta binomial distribution function and using it to
examine observed variation
The beta-binomial distribution can be used to estimate the maximum and minimum
expected number of positive results from n tests given that we have observed X
positive results in N tests as follows.
If we have observed X positive results in N tests then, assuming that the probability
of a positive results does not change, the natural log of the probability of observing i
positive results in n tests is given by
|,
n 1 i 1 n i 1 X 0.5 i
N X 0.5 n i N 1 n X 0.5
N X 0.5 N 1
And
|, n 1 i 1 n i 1 X 0.5 i
N X 0.5 n i N 1 n X 0.5
N X 0.5 N 1
Where is the log gamma function.
Hence where X out of N tests are positive then we can expect to see at least x
L
positive results out of n tests, where x
L
is the lowest integer for which
|,
0.05,
And no more than x
U
+1 positive results out of n tests, where x
U
is the highest integer
for which
|,
0.95
For example if 150 out of 300 tests have given a positive result and there are 10
tests per laboratory then the probability of seining no positives in a set of results from
a laboratory is
25
0|150,300 10 1 0 1 10 0 1 150 0.5
0
0 0.001129 0.001129
1 0.010652 0.011781
2 0.045817 0.057597
3 0.118299 0.175896
4 0.203055 0.378951
5 0.242098 0.621049
6 0.203055 0.824104
7 0.118299 0.942403
8 0.045817 0.988219
9 0.010652 0.998871
10 0.001129 1.000000
Hence a 90% confidence interval for the number of positive responses per laboratory
lies between 2/10 and 8/10 positive responses if the probability of a positive
response does not vary between laboratories.
This approach uses the log gamma function which is available in many software
packages including the following.
Excel and OpenOffice
In excel (Excel 2003 onwards) [15] the value of the log gamma function of x is given
by
Gammaln(x)
For example if cell A1 contains the value at which the function is to be evaluated (x,
then use
=GAMMALN(A1)
The same function is also available in the Calc module of OpenOffice [16]
26
R
In R (2.8.1 onwards) [17] the value of the log gamma function of x is given by
lgamma(x)
References
1 Lebesi, D., Dimakou, C., Alldrick, A.J., and Oreopoulou, V.,( 2010), QAS, 2(4),
173181
2 Horwitz, W., (1995), Pure Appl Chem 67, 331-343
3 EURACHEM (2000) Quantifying Uncertainty in Analytical Measurement, 2nd
Ed., http://www.eurachem.org/
4 Feinberg, M., Boulanger, B., Dew, W., & Hubert, P. (2004) Anal. Bioanal.
Chem. 380, 502514
5 Rose M, Poms R, Macarthur R, Ppping B, Ulberth F, QAS, Accepted for
publication
6 McCullagh P, Nelder J.A., 1989, Generalized Linear Models, Second edition,
Chapman and Hall, London
7 Langton SD, Chevennement R, Nagelkeke N, Lombard B (2002) Int J Food
Microbiol, 79:171181
8 van der Voet H, van Raamsdonk WD (2004) Int J Food Microbiol 95, 231234
9 Wilrich C, Wilrich P-T, (2009), JAOAC Int. 92(6), 1763-1772
10 Wilrich, P-T, (2010) Accred Qual Assur (2010) 15, 439444
11 Wehling, P., Labudde, R.A., Brunelle S.L., Nelson, M.T., (2011), JAOAC Int.,
94(1), 335-347
12 LaBudde, R.A. (2009) Coverage Accuracy for Binomial Proportion 95%
Confidence Intervals for 12 to 100 Replicates, TR297, Least Cost
Formulations, Ltd, Virginia Beach, VA,
27
http://www.lcfltd.com/Documents/tr297%20coverage%20accuracy%20binomi
al%20proportions.pdf
13 Brown L.D., Cai T.T., DasGupta A., (2001), Stat. Sci., 16(2), 101-133
14 van Hengel, A.J., Capelletti C., Brohee, M., Anklam E., (2006), JAOAC Int.
89(2), 462-468
15 Microsoft Excel online Help, http://office.microsoft.com/en-us/excel-
help/betainv-HP005209001.aspx?CTT=1
16 Open Office, http://www.OpenOffice.org
17 The R project for statistical computing, http://www.r-project.org/
18 Abramowitz, M., Stegun, I., (1972), Handbook of Mathematical Functions,
10th edition