0% found this document useful (0 votes)
182 views5 pages

Methods of Dealing With Values Below The Limit of Detection Using SAS

Methods of Dealing with Values Below the Limit of Detection using SAS
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
182 views5 pages

Methods of Dealing With Values Below The Limit of Detection Using SAS

Methods of Dealing with Values Below the Limit of Detection using SAS
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 5

Methods of Dealing with Values Below the Limit of Detection using SAS

Carry W. Croghan, US-EPA, Research Triangle Park, NC, and Peter P. Egeghy US-EPA, Las Vegas, NV

ABSTRACT
the distinction between the two, but the LOQ is generally defined
as some multiple of the LOD, usually three times the LOD.
Due to limitations of chemical analysis procedures, small
Estimates of concentrations above the LOD but below the LOQ
concentrations cannot be precisely measured. These
are potentially problematic due to their higher relative uncertainty
concentrations are said to be below the limit of detection (LOD).
(compared to values above the LOQ); these values are rarely
In statistical analyses, these values are often censored and
treated any differently from values above the LOQ.
substituted with a constant value, such as half the LOD, the LOD
divided by the square root of 2, or zero. These methods for
handling below-detection values results in two distributions, a Unfortunately, it is not as easy to use below-LOD values in data
uniform distribution for those values below the LOD, and the true analysis compared to below-LOQ values. First, there is the
distribution. As a result, this can produce questionable standard laboratory practice of reporting “nondetect” instead of
descriptive statistics depending upon the percentage of values numbers for values below the LOD (often referred to as
below the LOD. An alternative method uses the characteristics of “censoring”). Second, even if numbers were reported, they may
the distribution of the values above the LOD to estimate the be highly unreliable (though many researchers might prefer highly
values below the LOD. This can be done with an extrapolation unreliable over nothing). Thus, a researcher faced with data
technique or maximum likelihood estimation. An example program containing nondetects must decide how to appropriately use them
using the same data is presented calculating the mean, standard in statistical analyses. One might reason that since the actual
deviation, t-test, and relative difference in the means for various values of these observations are extraordinarily small, they are
methods and compares the results. The extrapolation and unimportant. However, these observations can still have a large
maximum likelihood estimate techniques have smaller error rates effect on the parameters of the distribution of the entire set of
than all the standard replacement techniques. Although more observations. Treating them incorrectly may introduce severe
computational, these methods produce more reliable descriptive bias when estimating the mean and variance of the distribution
statistics. (Lyles et al., 2001), which may consequently distort regression
coefficients and their standard errors and reduce power in
hypothesis tests (Hughes, 2000). Depending upon the severity of
INTRODUCTION the censoring, the distributional statistics may be highly
questionable.
A common problem faced by researchers analyzing data in the
environmental and occupational health fields is how to The problem of estimating the parameters of distributions with
appropriately handle observations reported to have nondetectable censored data has been extensively studied (Cohen, 1959; Gleit,
levels of a contaminant (Hornung and Reed, 1990). Even with 1985; Helsel and Guilliom, 1986; Helsel, 1990; Hornung and
technical advances in the areas of field sampling, sample Reed, 1990; Lambert, 1991; Özkaynak et al., 1991; Finkelstein
processing protocols, and laboratory instrumentation, the desire and Verma, 2001). The main approaches to handling left
to identify and measure contaminants down to extraordinarily low censored data are: 1) simple replacement, 2) extrapolation, and 3)
levels means that there will always be a threshold below which a maximum likelihood estimation. The most common, and easiest,
value cannot be accurately quantified (Lambert et al., 1991). As a strategy is simple replacement, where censored values are
value approaches zero there is a point where any signal due to replaced with zero, with some fraction of the detection limit
the contaminant cannot be differentiated from the background (usually either 1/2 or 1//2), or with the detection limit itself.
noise, i.e. the precision of the instrument/method is not sufficient Extrapolation strategies use regression or probability plotting
to discern the presence of the compound and quantify it to an techniques to calculate the mean and standard deviation based
acceptable level of accuracy. While one can differentiate on the regression line of the observed, above-LOD values and
between an “instrument limit of detection” and “method limit of their rank scores. A third strategy, maximum likelihood estimation
detection,” it is the method limit of detection that is relevant to this (MLE), is also based on the statistical properties of the
discussion. noncensored portion of the data, but uses an iterative process to
solve for the means and variance. Although the latter two
By strict definition, the limit of detection (LOD) is the level at which methods do not generate replacement data for the censored
a measurement has a 95% probability of being different than zero observations, they can be modified to do so if needed.
(Taylor, 1987). There is no universal procedure for determining a
method LOD, but it is broadly defined as the concentration The SAS language allows for both the statistical analyses and
corresponding to the mean blank response (that is, the mean data manipulations required to perform these techniques. Using
response produced by blank samples) plus three standard SAS procedures and options for the statistical computations will
deviations of the blank response. Various identification criteria reduce the necessary coding that other languages would require.
(for example, retention time on a specific analytical column, The flexibility in outputting the information generated by the
proper ratio of ions measured by a mass spectrometer, etc.) must procedures, allows for the additional manipulations to be easily
also be met to confirm the presence of the compound of interest. made within a datastep. The macro capability of SAS is pivotal
While it is almost always possible to calculate values below the for performing the testing and comparison of the different
method LOD, these values are not considered to be statistically methods of handling censored data. Macros make repetitive
different from zero, and are generally reported as “below detection processing of procedures and datasteps possible and allow for
limit” or “nondetect.” inputting parameters. The flexibility and robustness of the SAS
language makes it a good choice for programming these types of
Although statistically different from zero, values near the LOD are data problems.
generally less accurate and precise (that is, less reliable) than
values that are much larger than the LOD. The limit of
quantitation (LOQ) is the term often used by laboratories to
indicate the smallest amount that they consider to be reliably
quantifiable. There is often considerable confusion concerning
METHOD:
µ= m – σ (k/n-k) f(ε)/F(ε)
To test the difference between the various techniques, 2 2 2
distributions with known parameters are needed. Therefore, σ = [S + (m-µ )] / [1 + ε (k/n-k) f(ε)/F(ε)]
normally distributed datasets were generated with known means where:
and standard deviations using the RANNOR function in SAS with ε= (LOD-µ)/ σ
numbers of observations of either 500 or 1000. The mean and LOD= limit of detection
standard deviation combinations were arbitrarily chosen to be 4 f(k) = the distribution function for the normal distribution
and 1.5, 10 and 5, and 50 and 5. F(k) =the cumulative distribution function for the normal
distribution,
Cut-off points were calculated using the 5, 10, 25, 50 and 75 m= the mean of all values above the LOD
percentiles for the distribution determined using a PROC MEANS S =The population standard deviation of all values
statement. The cut-off points will be used as a surrogate for limit above the LOD.
of detection. These datasets were used to test each of the
methods of dealing with the censored values. In SAS this is done by using LIFEREG

To test the simple replacement techniques, a series of datasets proc lifereg data=ttest1 outest=stat3;
were generated with the values below the cut-off points replaced model (lower, Y) = / d=normal;
using the standard replacement values of zero, cut-off value run;
divided by two, or the cut-off value divided by the square root of
two. The means for the new datasets were compared to those of
where:
the parent dataset using a student t-test. As was done in
if X > cuts then lower = X;
Newman et al. 1989, the significance level was set to 0.1. The
percentage of times that the p-value was less than 0.1 was else lower = .;
calculated for each cut-off level, the replacement technique, and Y = (X > cuts) * X + (X<=cuts)*cuts;
the combination of cut-off level and replacement technique.
(Y is X is X is greater than the cuts,
otherwise Y is set to the cuts value)
The method of extrapolation is a method of calculating the mean
and standard deviation based on the regression line of the values
and the inverse cumulative normal distribution for a function of the The methods are compared using two different error
rank score of those values not censored. To perform this measurements. The first is based on the percentage of times the
method, first the values must be rank ordered including those that derived mean is significantly different from the true mean using a
are censored. Given there are n samples of which k are student t-test at the 0.1 level. The second is the based on the
censored, for the n-k values that are not censored, calculate the relative difference in the derived and true mean. The mean
Blom estimate of the normal score: relative difference is calculated over all the distribution
parameters.
-1
Ai = F ((i-3/8)/ (n+1/4)),
where
-1
F is the inverse cumulative normal distribution function FINDINGS
i is the rank score for the value Xi If only a small percentage of the values have been censored,
n is the size of the sample. there is little bias introduced by any of the replacement
th
techniques (Table 1). At the 5 percentile cut-off, the derived
mean for these situations is never significantly different from the
In SAS this translates to th
true mean for any of the replacement techniques. At the 10
percentile, some significant differences are observed. The largest
proc rank data=one out=two normal=blom; error rate (8%) is for replacement with zero for the combination of
var X; the following parameters, N=1000, mean=4, and standard
ranks A; deviation=1.5.
run;
Conversely, if the percentage of censored values is large, none of
Then regress the X values against A using a simple linear these replacement techniques are adequate. The percent of
regression model: times that the true mean is significantly different from the derived
mean ranges from 33.87 to 100% over all the methods when the
th th
Xi = a + b Ai + ei. 75 percentile is used as the cut-off. Even at the 50 percentile,
the error rate is high for some combinations of sample size, mean
In SAS use the following code: and standard deviation. For N=1000, mean=50, and standard
deviation =5 the error rate is 100%, indicating that all of the
proc reg data=two outest=stat2; derived means are significantly different from the true mean.
model X = A;
run; A difference in the replacements techniques is also obvious at the
th
25 percentile cut-off level. Replacement with 0 has the highest
The mean of the noncensored distribution is estimated by error rate of 33.33%; replacement with the limit of detection
intercept and the standard deviation is estimated by the slope of divided by the square root of 2 has an error rate of 0. The overall
the line. error rate is 15%.

The maximum likelihood estimator (MLE) of the mean and


standard deviation are solved in an iterative process solving the
following system of equations:

2
Table 1: Percentage of Student t-test p-values less than 0.10 for
the replacement methods There is a difference in the overall error rate for the different
methods of replacement. Replacement with the limit of detection
N=500, mean=4, std=1.5 divided by the square root of two has the smallest overall error
Replacement Value rate (21%); while the replacement with zero has almost twice the
Level ZERO LOD/2 LOD//2 overall overall error rate (41%).
05 0.00 0.00 0.00 0.00
10 0.00 0.00 0.00 0.00 The error rate is affected by the choice of mean and standard
25 100.00 0.40 0.00 33.47 deviation. The mean of 10 and standard deviation of 5 produces
the smallest overall error rate, 14%. Further investigation into the
50 100.00 100.00 0.00 66.67
relationship between distribution parameters and the error rate is
75 100.00 100.00 40.80 80.27
needed to assess the cause of this effect and how to interpret it.
overall 60.00 40.08 8.16 36.08
N=1000, mean=4, std=1.5 The extrapolation method (Table 2) is a marked improvement
Replacement Value over the simple replacement techniques. The largest error rate is
th
Level ZERO LOD/2 LOD//2 overall 18% at the 75 percentile cut-off level. This method is not as
05 0.00 0.00 0.00 0.00 sensitive to the mean and standard deviation of the original
10 7.60 0.00 0.00 2.53 distribution. The overall error rate of 1.24% is markedly different
25 100.00 69.20 0.00 56.40 from the rate of 20.90% from the best of the replacement
50 100.00 100.00 0.00 66.67 techniques, LOD//2.
75 100.00 100.00 90.00 96.67
overall 61.52 53.84 18.00 44.45 Table 2: Percentage of Student t-test p-values less than 0.10 for
the Extrapolation method
N=500, mean=10, std=5
Replacement Value
Distributions Cut-off Level
Level ZERO LOD/2 LOD//2 overall
parameters 05 10 25 50 75 overall
05 0.00 0.00 0.00 0.00
N=500, 0.0 0.0 0.0 0.8 17.6 3.68
10 0.00 0.00 0.00 0.00
mean=4, std=1.5
25 0.00 0.00 0.00 0.00
N=1000, 0.0 0.0 0.0 0.4 18.0 3.68
50 20.40 0.00 0.00 6.80
mean=4, std=1.5
75 100.00 0.00 1.60 33.87
N=500, 0.0 0.0 0.0 0.0 0.0 0.0
overall 24.08 0.00 0.32 8.13
mean=10, std=5
N=1000, mean=10, std=5
N=1000, mean=10, 0.0 0.0 0.0 0.0 0.0 0.0
Replacement Value
std=5
Level ZERO LOD/2 LOD//2 overall N=500, 0.0 0.0 0.0 0.0 0.0 0.0
05 0.00 0.00 0.00 0.00 mean=50, std=5
10 0.00 0.00 0.00 0.00 N=1000, mean=50, 0.0 0.0 0.0 0.0 0.0 0.0
25 0.00 0.00 0.00 0.00 std=5
50 100.00 0.00 0.00 33.33 overall 0.0 0.0 0.0 0.2 6.0 1.24
75 100.00 0.00 94.80 64.93
overall 40.00 0.00 18.96 19.65 The method of maximum likelihood estimation (Table 3) has the
N=500, mean=50, std=5 th
best overall error rate (0.7%). Only at the 75 percentile cut-off
Replacement Value level does the error rate differ from zero. There is, however, an
Level ZERO LOD/2 LOD//2 overall effect by the choice of mean and standard deviation on the error
05 0.00 0.00 0.00 0.00 rate, as with the replacement techniques. The highest error rate
10 0.00 0.00 0.00 0.00 observed is 11%.
25 0.00 0.00 0.00 0.00
50 0.00 100.00 99.60 66.53 Table 3: Percentage of Student t-test p-values less than 0.10 for
75 100.00 100.00 100.00 100.00 the Maximum Likelihood Estimate
overall 20.00 40.00 39.92 33.31
Distributions Cut-off Level
N=1000, mean=50, std=5
parameters 05 10 25 50 75 overall
Replacement Value
N=500, 0.0 0.0 0.0 0.0 10.8 2.2
Level ZERO LOD/2 LOD//2 overall
mean=4, std=1.5
05 0.00 0.00 0.00 0.00
N=1000, 0.0 0.0 0.0 0.0 10.8 2.2
10 0.00 0.00 0.00 0.00
mean=4, std=1.5
25 0.00 0.00 0.00 0.00
N=500, 0.0 0.0 0.0 0.0 0.0 0.0
50 100.00 100.00 100.00 100.00
mean=10, std=5
75 100.00 100.00 100.00 100.00
N=1000, mean=10, 0.0 0.0 0.0 0.0 0.0 0.0
overall 20.00 40.00 39.92 33.30
std=5
Overall Sample Types.
N=500, 0.0 0.0 0.0 0.0 0.0 0.0
Replacement Value
mean=50, std=5
Level ZERO LOD/2 LOD//2 overall N=1000, mean=50, 0.0 0.0 0.0 0.0 0.0 0.0
05 0.00 0.00 0.00 0.00 std=5
10 1.27 0.00 0.00 0.42 overall 0.0 0.0 0.0 0.0 3.6 0.7
25 33.33 11.60 0.00 14.98
50 70.07 66.67 33.27 56.67 Bias can also be measured with the percent relative difference
75 100.00 66.67 71.20 79.29 between the derived and true means. Using this measure, the
overall 40.93 28.99 20.90 30.27 effect of choice of parameters is removed. There is little

3
difference between the extrapolation and the MLE techniques In addition, the extrapolation technique is more intuitive and, thus,
(Table 4). The replacement techniques performed worse than the more transparent. Therefore, the extrapolation technique has
computational methods, with the replacement by zero having the some advantages over the maximum likelihood technique.
largest relative difference (Figure 1). The LOD/√2 replacement
has the smallest relative difference of all the replacement
methods using these data.
REFERENCES
Cohen, AC., Jr. Simplified estimators for the normal distribution
when samples are singly censored or truncated. Technometrics
Table 4: Percent Relative difference between derived and true
1:217-237, 1959.
means
Finkelstein MM, Verma DK. Exposure estimation in the presence
Cut-off Replacement Technique
of nondetectable values: another look. AIHAJ 62:195-198, 2001.
Level MLE Ext Zero LOD/2 LOD/√2
5 -0.01 0.01 -1.67 -0.50 -0.01 Gleit A. Estimation for small normal data sets with detection
10 -0.00 0.01 -4.32 -1.40 -0.18 limits. Environ Sci Technol 19:1201-1206, 1985.
25 -0.00 0.04 -14.71 -4.95 -0.90
50 0.06 0.13 -37.09 -12.08 -1.72 Helsel DR, Gilliom RJ. Estimation of distributional parameters for
75 0.34 0.35 -64.73 -19.01 -0.07 censored trace level water quality data: 2. Verification and
applications. Water Resources J 22(2):147-155, 1986.

Figure 1: Percent relative difference between the derived and Helsel DR. Less than obvious: statistical treatment of data below
true means the detection limit. Environ Sci Technol 24(12)1766-1774, 1990.

Hornung RW, Reed LD. Estimation of average concentration in


the presence of nondetectable values. Appl Occup Environ Hyg.
5(1):46-51, 1990.

Lambert D, Peterson B, Terpenning I. Nondetects, detection


limits, and the probability of detection. J Am Stat Assoc
86(414):266-276, 1991.

Lyles RH, Fan D, Chuachoowong R. Correlation coefficient


estimation involving a left censored laboratory assay variable.
Statist Med 20:2921-2933, 2001.

Newman MC, Dixon PM, Looney BB, Pinder JE. Estimating Mean
and Variance for Enviromental Samples with Below Detection
Limit Observations. Water Resources Bulletin 25(4): 905-915,
1989.

Özkaynak H, Xue Jianping, Butler DA, Haroun LA, MacDonell,


MM, Fingleton DJ. Addressing data heterogeneity: lessons
learned from a multimedia risk assessment. Presented at 84th
Annual Meeting of the Air and Waste Management Association,
Vancouver, BC, 1991.

Taylor JK. Quality Assurance of Chemical Measurements.


Chelsea, MI: Lewis Publishers, 1987.

CONTACT INFORMATION
Your comments and questions are valued and encouraged.
CONCLUSION Contact the author at:
Distributions of normally distributed data were generated with
three different mean and standard deviation combinations. These Author Name: Carry W. Croghan
data were censored from 5 to 75 percent to test some common Company: US-EPA
techniques for handling left-censored data. Simple substitution Address: HEASD (MD E205-1)
techniques are adequate if only a small percentage of the values City state ZIP: RTP, NC 27709
have been censored. The error rate and relative difference in the Work Phone: (919) 541-3184
means both increase once the percentage of censored values
Fax: (919) 541-0239
reaches 25%. Replacement with the LOD//2 appears to be the
best choice of replacement values. The reason for the large Email: [email protected]
difference between the LOD/2 and LOD//2 estimates is not
readily apparent. The extrapolation and maximum likelihood This paper has been reviewed in accordance with the United
techniques produce estimates with smaller bias and error rates States Environmental Protection Agency’s peer and administrative
than those from the replacement techniques. Furthermore, the review policies and approved for presentation and publication.
error rate is less affected by the distribution parameters with the SAS and all other SAS Institute Inc. product or service names are
both the extrapolation and maximum likelihood technique. The registered trademarks or trademarks of SAS Institute Inc. in the
maximum likelihood technique appears to have worked best, USA and other countries. ® indicates USA registration. Mention of
overall. Since the maximum likelihood technique is an iterative trade names or commercial products does not constitute
method, there are situations were there may not be a solution; endorsement or recommendation for use.
however, the extrapolation technique has a closed form solution.

4
ATTACHED CODE
***************************;
proc means data=ttest1 noprint;
by group;
var below;
output out=stat1 n=n sum=k;
run;

***************************;
proc rank data=two out=temp2n normal=blom;
var x;
ranks prob;
run;
*****************************************;

proc reg data=temp3 outest=stat2 noprint;


by group;
model x = prob;
run;

*****************************************;
*** Since there is only a mean and std and not
data, the t-test is calculated within a
datastep.
***;

data testit;
merge stat2
stat1 (keep=group n k) end=eof;

replace= substr(group, 1, 1);


level = substr(group, 2);

** Retaining the values for the first record.


This is the parent distribution to which all
others are compared. **;

retain mean1 s1 n1 ss1;

if _n_=1 then do;


mean1= intercept;
s1 = prob **2;
n1 = n;
SS1 = s1*(n1-1);
end;

else do;
mean2= intercept;
s2 = prob **2;
n2 = n;
SS2 = s2*(n2-1);

sp = (ss1 + ss2) / (n1 + n2 -


2);

*** Student t ***;


t = (mean1 - mean2) /
sqrt(sp**2/n1 + sp**2/n2);

p = probt(t,n1+n2-2);

** Relative difference in the means **;


bias = (mean2-mean1) / mean1;

end;

run;

You might also like