0% found this document useful (0 votes)
9 views35 pages

Linear Regression and Correlation

Download as pdf or txt
Download as pdf or txt
Download as pdf or txt
You are on page 1/ 35

7 Linear Regression and Correlation

Simple linear regression analysis is a statistical technique that defines the functional relationship
between two variables, X and Y, by the “best-fitting” straight line. A straight line is described
by the equation, Y = A + BX, where Y is the dependent variable (ordinate), X is the independent
variable (abscissa), and A and B are the Y intercept and slope of the line, respectively (Fig. 7.1).∗
Applications of regression analysis in pharmaceutical experimentation are numerous. This
procedure is commonly used
1. to describe the relationship between variables where the functional relationship is known
to be linear, such as in Beer’s law plots, where optical density is plotted against drug
concentration;
2. when the functional form of a response is unknown, but where we wish to represent a trend
or rate as characterized by the slope (e.g., as may occur when following a pharmacological
response over time);
3. when we wish to describe a process by a relatively simple equation that will relate the
response, Y, to a fixed value of X, such as in stability prediction (concentration of drug
versus time).
In addition to the specific applications noted above, regression analysis is used to define
and characterize dose–response relationships, for fitting linear portions of pharmacokinetic
data, and in obtaining the best fit to linear physical–chemical relationships.
Correlation is a procedure commonly used to characterize quantitatively the relationship
between variables. Correlation is related to linear regression, but its application and interpreta-
tion are different. This topic is introduced at the end of this chapter.

7.1 INTRODUCTION
Straight lines are constructed from sets of data pairs, X and Y. Two such pairs (i.e., two points)
uniquely define a straight line. As noted previously, a straight line is defined by the equation

Y = A + B X, (7.1)

where A is the Y intercept (the value of Y when X = 0) and B is the slope (Y/X). Y/X is
(Y2 − Y1 )/(X2 − X1 ) for any two points on the line (Fig. 7.1). The slope and intercept define the
line; once A and B are given, the line is specified. In the elementary example of only two points,
a statistical approach to define the line is clearly unnecessary.
In general, with more than two X, y points,† a plot of y versus X will not exactly describe
a straight line, even when the relationship is known to be linear. The failure of experimental
data derived from truly linear relationships to lie exactly on a straight line is due to errors
of observation (experimental variability). Figure 7.2 shows the results of four assays of drug
samples of different, but known potency. The assay results are plotted against the known
amount of drug. If the assays are performed without error, the plot results in a 45◦ line (slope
= 1) which, if extended, passes through the origin; that is, the Y intercept, A, is 0 [Fig. 7.2(A)].

∗ The notation Y = A + BX is standard in statistics. We apologize for any confusion that may result from the
reader’s familiarity with the equivalent, Y = mX + b, used frequently in analytical geometry.
† In the rest of this chapter, y denotes the experimentally observed point, and Y denotes the corresponding point
on the least squares “fitted” line (or the true value of Y, according to context).
148 CHAPTER 7

Figure 7.1 Straight-line plot.

Figure 7.2 Plot of assay recovery versus known amount: theoretical and actual data.

In this example, the equation of the line Y = A + BX is Y = 0 + 1(X), or Y = X. Since there is no


error in this experiment, the line passes exactly through the four X, Y points.
Real experiments are not error free, and a plot of X, y data rarely exactly fits a straight
line, as shown in Figure 7.2(B). We will examine the problem of obtaining a line to fit data that
are not error free. In these cases, the line does not go exactly through all of the points. A “good”
line, however, should come “close” to the experimental points. When the variability is small, a
line drawn by eye will probably be very close to that constructed more exactly by a statistical
approach [Fig. 7.3(A)]. With large variability, the “best” line is not obvious. What single line
would you draw to best fit the data plotted in Figure 7.3(B)? Certainly, lines drawn through
any two arbitrarily selected points will not give the best (or a unique) line to fit the totality
of data.
Given N pairs of variables, X, Y, we can define the best straight line describing the
relationship of X and y as the line that minimizes the sum of squares of the vertical distances of
each point from the fitted line. The definition of “sum of squares of the vertical distances of each
point from the fitted line” (Fig. 7.4) is written mathematically as (y − Y)2 , where y represents
the experimental points and Y represents the corresponding points on the fitted line. The line
constructed according to this definition is called the least squares line. Applying techniques of
LINEAR REGRESSION AND CORRELATION 149

Figure 7.3 Fit of line with variable data.

calculus, the slope and intercept of the least squares line can be calculated from the sample data
as follows:

(X − X)(y − y)
Slope = b =  (7.2)
(X − X)2

Intercept = a = y − b X (7.3)

Remember that the slope and intercept uniquely define the line.
There is a shortcut computing formula for the slope, similar to that described previously
for the standard deviation
  
N Xy − ( X)( y)
b=   , (7.4)
N X2 − ( X)2

Figure 7.4 Lack of fit due to (A) experimental error and (B) nonlinearity.
150 CHAPTER 7

Table 7.1 Raw Data from Figure 7.2(A) to Calculate the Least
Squares Line

Drug potency, X Assay, y Xy


60 60 3600
80 80 6400
100 100 10,000
120 120 14,400
  
X = 360 y = 360 Xy = 34,400
 2
X = 34,400

Table 7.2 Raw Data from Figure 7.2(B) Used to Calculate the
Least Squares Line

Drug potency, X Assay, y Xy


60 63 3780
80 75 6000
100 99 9900
120 116 13,920
  
X = 360 y = 353 Xy = 33,600
 2  2
X = 34,400 y = 32,851

where N is the number of X, y pairs. The calculation of the slope and intercept is relatively
simple, and can usually be quickly computed using a computer (e.g., EXCEL) or with a hand
calculator. Some calculators have a built-in program for calculating the regression parameter
estimates, a and b.‡
For the example shown in Figure 7.2(A), the line that exactly passes through the four data
points has a slope of 1 and an intercept of 0. The line, Y = X, is clearly the best line for these data,
an exact fit. The least squares line, in this case, is exactly the same line, Y = X. The calculation of
the intercept and slope using the least squares formulas, Eqs. (7.3) and (7.4), is illustrated below.
Table 7.1 shows the raw data used to construct the line
 in Figure 7.2(A).
 
According to Eq. (7.4) (N = 4, X = 34,400, Xy = 34,400, X =
2
y = 360),

(4)(3600 + 6400 + 10,000 + 14,000) − (360)(360)


b= =1
4(34,400) − (360)2

a is computed from Eq. (7.3); a = y − b X(y = X = 90, b = 1). a = 90 − 1(90) = 0. This represents
a situation where the assay results exactly equal the known drug potency (i.e., there is no error).
The actual experimental data depicted in Figure 7.2(B) are shown in Table 7.2. The slope
b and the intercept a are calculated from Eqs. (7.4) and (7.3). According to Eq. (7.4),
(4)(33,600) − (360)(353)
b= = 0.915.
4(34,400) − (360)2

According to Eq. (7.3),


353
a= − 0.915(90) = 5.9.
4
A perfect assay (no error) has a slope of 1 and an intercept of 0, as shown above. The actual
data exhibit a slope close to 1, but the intercept appears to be too far from 0 to be attributed to
random error. Exercise Problem 2 addresses the interpretation of these results as they relate to
assay method characteristics.

‡ a and b are the sample estimates of the true parameters, A and B.


LINEAR REGRESSION AND CORRELATION 151

This example suggests several questions and problems regarding linear regression analy-
sis. The line that best fits the experimental data is an estimate of some true relationship between
X and Y. In most circumstances, we will fit a straight line to such data only if we believe that the
true relationship between X and Y is linear. The experimental observations will not fall exactly
on a straight line because of variability (e.g., error associated with the assay). This situation (true
linearity associated with experimental error) is different from the case where the underlying
true relationship between X and Y is not linear. In the latter case, the lack of fit of the data to the
least squares line is due to a combination of experimental error and the lack of linearity of the X,
Y relationship (Fig. 7.4). Elementary techniques of simple linear regression will not differentiate
these two situations: (a) experimental error with true linearity and (b) experimental error and
nonlinearity. (A design to estimate variability due to both nonlinearity and experimental error
is given in App. II.)
We will discuss some examples relevant to pharmaceutical research that make use of
least squares linear regression procedures. The discussion will demonstrate how variability is
estimated and used to construct estimates and tests of the line parameters A and B.

7.2 ANALYSIS OF STANDARD CURVES IN DRUG ANALYSIS: APPLICATION


OF LINEAR REGRESSION
The assay data discussed previously can be considered as an example of the construction of a
standard curve in drug analysis. Known amounts of drug are subjected to an assay procedure,
and a plot of percentage recovered (or amount recovered) versus amount added is constructed.
Theoretically, the relationship is usually a straight line. A knowledge of the line parameters A
and B can be used to predict the amount of drug in an unknown sample based on the assay
results. In most practical situations, A and B are unknown. The least squares estimates a and b
of these parameters are used to compute drug potency (X) based on the assay response (y). For
example, the least squares line for the data in Figure 7.2(B) and Table 7.2 is

Assay result = 5.9 + 0.915 (potency). (7.5)

Rearranging Eq. (7.5), an unknown sample that has an assay value of 90 can be predicted
to have a true potency of
y − 5.9
Potency = X =
0.915
90 − 5.9
Potency = = 91.9.
0.915
This point (91.9, 90) is indicated in Figure 7.2 by a cross.

7.2.1 Line Through the Origin


Many calibration curves (lines) are known to pass through the origin; that is, the assay response
must be zero if the concentration of drug is zero. The calculation of the slope is simplified if the
line is forced to go through the point (0,0). In our example, if the intercept is known to be zero,
the slope is (Table 7.2)

Xy
b=  2
X (7.6)
33,600
= 2 = 0.977.
60 + 80 + 1002 + 1202
2

The least squares line fitted with the zero intercept is shown in Figure 7.5. If this line
were to be used to predict actual concentrations based on assay results, we would obtain
answers that are different from those predicted from the line drawn in Figure 7.2(B). However,
both lines have been constructed from the same raw data. “Is one of the lines correct?” or “Is
one line better than the other?” Although one cannot say with certainty which is the better
line, a thorough knowledge of the analytical method will be important in making a choice.
For example, a nonzero intercept suggests either nonlinearity over the range of assays or the
152 CHAPTER 7

Figure 7.5 Plot of data in Table 7.2 with known (0, 0)


intercept.

presence of an interfering substance in the sample being analyzed. The decision of which
line to use can also be made on a statistical basis. A statistical test of the intercept can be
performed under the null hypothesis that the intercept is 0 (H0 : A = 0, sect. 7.4.1). Rejection of
the hypothesis would be strong evidence that the line with the positive intercept best represents
the data.

7.3 ASSUMPTIONS IN TESTS OF HYPOTHESES IN LINEAR REGRESSION


Although there are no prerequisites for fitting a least squares line, the testing of statistical
hypotheses in linear regression depends on the validity of several assumptions.

1. The X variable is measured without error. Although not always exactly true, X is often measured
with relatively little error and, under these conditions this assumption can be considered
to be satisfied. In the present example, X is the potency of drug in the “known” sample. If
the drug is weighed on a sensitive balance, the error in drug potency will be very small.
Another example of an X variable that is often used, which can be precisely and accurately
measured, is “time.”

2. For each X, y is independent and normally distributed. We will often use the notation Y.x to show
that the value of Y is a function of X.
3. The variance of y is assumed to be the same at each X. If the variance of y is not constant, but
is either known or related to X in some way, other methods (see sect. 7.7) are available to
estimate the intercept and slope of the line [1].
4. A linear relationship exists between X and Y. Y = A + BX, where A and B are the true parameters.
Based on theory or experience, we have reason to believe that X and Y are linearly related.

These assumptions are depicted in Figure 7.6. Except for location (mean), the distribution
of y is the same at every value of X; that is, y has the same variance at every value of X. In the
example in Figure 7.6, the mean of the distribution of y’s decreases as X increases (the slope
is negative).
LINEAR REGRESSION AND CORRELATION 153

Figure 7.6 Normality and variance assumptions in linear regression.

7.4 ESTIMATE OF THE VARIANCE: VARIANCE OF SAMPLE ESTIMATES


OF THE PARAMETERS
If the assumptions noted in section 7.3 hold, the distributions of sample estimates of the slope
and intercept, b and a, are normal with means equal to B and A, respectively.§ Because of this
important result, statistical tests of the parameters A and B can be performed using normal
distribution theory. Also, one can show that the sample estimates are unbiased estimates of
the true parameters (similar to the sample average, X, being an unbiased estimate of the true
mean, ␮). The variances of the estimates, a and b, are calculated as follows:
 2

1 X
␴a2 = ␴Y,x
2
+ (7.7)
N (X − X)2
␴Y,x
2
␴b2 =  . (7.8)
(X − X)2

␴Y,x
2
is the variance of the response variable, y. An estimate of ␴Y,x 2
can be obtained from the
closeness of the data to the least squares line. If the experimental points are far from the least
squares line, the estimated variability is larger than that in the case where the experimental
points are close to the least squares line. This concept is illustrated in Figure 7.7. If the data
exactly fit a straight line, the experiment shows no variability. In real experiments the chance of
an exact fit with more than two X, y pairs is very small. An unbiased estimate of ␴Y,x2
is obtained
from the sum of squares of deviations of the observed points from the fitted line as follows:
  
(y − Y)2 (y − y)2 − b 2 [ (X − X)2 ]
SY,x =
2
= , (7.9)
N−2 N−2
where y is the observed value and Y is the predicted value of Y from the least squares line
(Y = a + bX) (Fig. 7.7). The variance estimate, SY,x2
, has N − 2 rather than (N − 1) d.f. because
two parameters are being estimated from the data (i.e., the slope and intercept).
When ␴Y,x2 2
is unknown, the variances of a and b can be estimated, substituting SY,x for ␴y,x
2

in the formulas for the variances [Eqs. (7.7) and (7.8)]. Equations (7.10) and (7.11) are used as
the variance estimates, Sa2 and Sb2 , when testing hypotheses concerning the parameters A and B.
This procedure is analogous to using the sample estimate of the variance in the t test to compare
sample means.
 2

1 X
Sa2 = SY,x
2
× + (7.10)
N (X − X)2
2
SY,x
Sb2 =  (7.11)
(X − X)2

§ a and b are calculated as linear combinations of the normally distributed response variable, y, and thus can be
shown to be also normally distributed.
154 CHAPTER 7

Figure 7.7 Variance calculation from least squares line.

7.4.1 Test of the Intercept, A


The background and formulas introduced previously are prerequisites for the construction of
tests of hypotheses of the regression parameters A and B. We can now address the question
of the “significance” of the Y intercept (a) for the line shown in Figure 7.2(B) and Table 7.2.
The procedure is analogous to that of testing means with the t test. In this example, the null
hypothesis is H0 : A = 0. The alternative hypothesis is Ha : A = 0. Here the test is two-sided; a
priori, if the intercept is not equal to 0, it could be either positive or negative. A t test is performed
2
as shown in Eq. (7.12). SY,x and Sa2 are calculated from Eqs. (7.9) and (7.10), respectively.

|a − A|
td.f. = t2 =  (7.12)
Sa2

where td.f. is the t statistic with N − 2 d.f., a is the observed value of the intercept, and A is the
hypothetical value of the intercept. From Eq. (7.10)
 2

1 X
Sa2 = 2
SY,x × + . (7.10)
N (X − X)2

From Eq. (7.9)

1698.75 − (0.915)2 (2000)


2
SY,x = = 12.15
2 
1 (90)2
Sa2 = 12.15 + = 52.245.
4 2000

From Eq. (7.12)

|5.9 − 0|
t2 = √ = 0.82.
52.245

Note that this t test has 2 (N − 2) d.f. This is a weak test, and a large intercept must be
observed to obtain statistical significance. To define the intercept more precisely, it would be
necessary to perform a larger number of assays. If there is no reason to suspect a nonlinear
relationship between X and Y, a nonzero intercept, in this example, could be interpreted as
being due to some interfering substance(s) in the product (the “blank”). If the presence of a
nonzero intercept is suspected, one would probably want to run a sufficient number of assays
to establish its presence. A precise estimate of the intercept is necessary if this linear calibration
curve is used to evaluate potency.
LINEAR REGRESSION AND CORRELATION 155

7.4.2 Test of the Slope, B


The test of the slope of the least squares line is usually of more interest than the test of the
intercept. Sometimes, we may only wish to be assured that the fitted line has a slope other
than zero. (A horizontal line has a slope of zero.) In our example, there seems to be little doubt
that the slope is greater than zero [Fig. 7.2(B)]. However, the magnitude of this slope has a
special physical meaning. A slope of 1 indicates that the amount recovered (assay) is equal
to the amount in the sample, after correction for the blank (i.e., subtract the Y intercept from
the observed reading of y). An observation of a slope other than 1 indicates that the amount
recovered is some constant percentage of the sample potency. Thus we may be interested in a
test of the slope versus 1.

H0 : B = 1 Ha : B = 1

A t test is performed using the estimated variance of the slope, as follows:

b−B
t=  . (7.13)
Sb2

In the present example, from Eq. (7.11),


2
Sy,x
Sb2 =  (7.11)
(X − X)2
12.15
= = 0.006075.
2000

Applying Eq. (7.13), for a two-sided test, we have

|0.915 − 1|
t= √ = 1.09.
0.006075
This t test has 2 (N − 2) d.f. (the variance estimate has 2 d.f.). There is insufficient evidence
to indicate that the slope is significantly different from 1 at the 5% level. Table IV.4 shows that
a t of 4.30 is needed for significance at ␣ = 0.05 and d.f. = 2. The test in this example has very
weak power. A slope very different from 1 would be necessary to obtain statistical significance.
This example again emphasizes the weakness of the statement “nonsignificant,” particularly
in small experiments such as this one. The reader interested in learning more details of the
use and interpretation of regression in analytical methodology is encouraged to read chapter 5
in Ref. [2].

7.5 A DRUG STABILITY STUDY: A SECOND EXAMPLE OF THE APPLICATION


OF LINEAR REGRESSION
The measurement of the rate of drug decomposition is an important problem in drug formu-
lation studies. Because of the significance of establishing an expiration date defining the shelf
life of a pharmaceutical product, stability data are routinely subjected to statistical analysis.
Typically, the drug, alone and/or formulated, is stored under varying conditions of tempera-
ture, humidity, light intensity, and so on, and assayed for intact drug at specified time intervals.
The pharmaceutical scientist is assigned the responsibility of recommending the expiration
date based on scientifically derived stability data. The physical conditions of the stability test
(e.g., temperature, humidity), the duration of testing, assay schedules, as well as the number of
lots, bottles, and tablets that should be sampled must be defined for stability studies. Careful
definition and implementation of these conditions are important because the validity and pre-
cision of the final recommended expiration date depends on how the experiment is conducted.
Drug stability is discussed further in section 8.7.
The rate of decomposition can often be determined from plots of potency (or log potency)
versus storage time, where the relationship of potency and time is either known or assumed to
156 CHAPTER 7

be linear. The current good manufacturing practices (CGMP) regulations [3] state that statistical
criteria, including sample size and test (i.e., observation or measurement) intervals for each
attribute examined, be used to assure statistically valid estimates of stability (211.166). The
expiration date should be “statistically valid” (211.137, 201.17, 211.62).
The mechanics of determining shelf life may be quite complex, particularly if extreme
conditions are used, such as those recommended for “accelerated” stability studies (e.g., high-
temperature and high-humidity conditions). In these circumstances, the statistical techniques
used to make predictions of shelf life at ambient conditions are quite advanced and beyond
the scope of this book [4]. Although extreme conditions are commonly used in stability testing
in order to save time and obtain a tentative expiration date, all products must eventually
be tested for stability under the recommended commercial storage conditions. The FDA has
suggested that at least three batches of product be tested to determine an expiration date. One
should understand that different batches may show somewhat different stability characteristics,
particularly in situations where additives affect stability to a significant extent. In these cases
variation in the quality and quantity of the additives (excipients) between batches could affect
stability. One of the purposes of using several batches for stability testing is to ensure that
stability characteristics are similar from batch to batch.
The time intervals chosen for the assay of storage samples will depend to a great extent
on the product characteristics and the anticipated stability. A “statistically” optimal design for
a stability study would take into account the planned “storage” times when the drug product
will be assayed. This problem has been addressed in the pharmaceutical literature [5]. How-
ever, the designs resulting from such considerations are usually cumbersome or impractical. For
example, from a statistical point of view, the slope of the potency versus time plot (the rate of
decomposition) is obtained most precisely if half of the total assay points are performed at time

0, and the other half at the final testing time. Note that (X − X)2 the denominator of the expres-
sion defining the variance of a slope [Eq. (7.8)], is maximized under this condition, resulting
in a minimum variability of the slope. This “optimal” approach to designating assay sampling
times is based on the assumption that the plot is linear during the time interval of the test. In
a practical situation, one would want to see data at points between the initial and final assay
in order to assess the magnitude of the decomposition as the stability study proceeds, as well
as to verify the linearity of the decomposition. Also, management and regulatory requirements
are better satisfied with multiple points during the course of the study. A reasonable sched-
ule of assays at ambient conditions is 0, 3, 6, 9, 12, 18, and 24 months and at yearly intervals
thereafter [6].
The example of the data analysis that will be presented here will be for a single batch. If
the stability of different batches is not different, the techniques described here may be applied to
data from more than one batch. A statistician should be consulted for the analysis of multibatch
data that will require analysis of variance techniques [6,7]. The general approach is described
in section 8.7.
Typically, stability or shelf life is determined from data from the first three production
batches for each packaging configuration (container type and product strength) (see sect. 8.7).
Because such testing may be onerous for multiple strengths and multiple packaging of the
same drug product, matrixing and bracketing techniques have been suggested to minimize the
number of tests needed to demonstrate suitable drug stability [8].
Assays are recommended to be performed at time 0 and 3, 6, 9, 12, 18 and 24 months,
with subsequent assays at 12-month intervals as needed. Usually, three batches of a given
strength and package configuration are tested to define the shelf life. Because many products
have multiple strengths and package configurations, the concept of a “Matrix” design has been
introduced to reduce the considerable amount of testing required. In this situation, a subset of
all combinations of product strength, container type and size, and so on is tested at a given
time point. Another subset is tested at a subsequent time point. The design should be balanced
“such that each combinations of factors is tested to the same extent.” All factor combinations
should be tested at time 0 and at the last time point of the study. The simplest such design, called
a “Basic Matrix 2/3 on Time Design,” has two of the three batches tested at each time point,
with all three batches tested at time 0 and at the final testing time, the time equal to the desired
shelf life. Table 7.3 shows this design for a 36-month product. Tables of matrix designs show
LINEAR REGRESSION AND CORRELATION 157

Table 7.3 Matrix Design for Three Packages and Three Strengths

Package 1 Package 2 Package 3


Batch
strength 3 6 9 12 18 24 36 3 6 9 12 18 24 36 3 6 9 12 18 24 36
1 5 X X X X X X X X X X X X X X X
1 10 X X X X X X X X X X X X X X X
1 15 X X X X X X X X X X X X X X X
2 5 X X X X X X X X X X X X X X X
2 10 X X X X X X X X X X X X X X X
2 15 X X X X X X X X X X X X X X X
3 5 X X X X X X X X X X X X X X X
3 10 X X X X X X X X X X X X X X X
3 15 X X X X X X X X X X X X X X X

Table 7.3A Matrix Design for Three Batches and Two Strengths

Time points for


testing (mo) 0 3 6 9 12 18 24 36
S Batch 1 T T T T T T
T S1 Batch 2 T T T T T T
R Batch 3 T T T T T
E
N Batch 1 T T T T T
G S2 Batch 2 T T T T T T
T Batch 3 T T T T T
H

designs for multiple packages (made from the same blend or batch) and for multiple packages
and strengths. These designs are constructed to be symmetrical in the spirit of optimality for
such designs. For example, this is illustrated in Table 7.3, looking only at the “5” strength for
Package 1. Table 7.3 shows this design for a 36-month product with multiple packages and
strengths (made from the same blend). For example, in Table 7.3, each batch is tested twice, each
package from each batch is tested twice, and each package is tested six times at all time points
between 0 and 36 months.
With multiple strengths and packages, other similar designs with less testing have been
described [9].
The risks of applying such designs are outlined in the Guidance [8]. Because of the limited
testing, there is a risk of less precision and shorter dating. If pooling is not allowed, individual
lots will have short dating, and combinations not tested in the matrix will not have dating
estimates. Read the guidance for further details. The FDA guidance gives examples of other
designs.
The analysis of these designs can be complicated. The simplest approach is to analyze
each strength and configuration separately, as one would do if there were a single strength and
package. Another approach is to model all configurations including interactions. The assump-
tions, strengths, and limitations of these designs and analyses are explained in more detail in
Ref. [9].
A Bracketing design [10] is a design of a stability program such that at any point in
time only extreme samples are tested, such as extremes in container size and dosage. This is
particularly amenable to products that have similar composition across dosage strengths and
that intermediate size and strength products are represented by the extremes [10]. (See also FDA
Guideline on Stability for further discussion as to when this is applicable.)
Suppose that we have a product in three strengths and three package sizes. Table 7.4 is an
example of a Bracketing design [10].
158 CHAPTER 7

Table 7.4 Example of Bracketing Design

Strength Low Medium High


Batch 1 2 3 4 5 6 7 8 9
Container Small T T T T T T
Medium
Large T T T T T T

Table 7.5 Tablet Assays from the Stability Study

Time, X (mo) Assay,a y (mg) Average


0 51, 51, 53 51.7
3 51, 50, 52 51.0
6 50, 52, 48 50.0
9 49, 51, 51 50.3
12 49, 48, 47 48.0
18 47, 45, 49 47.0
a Each assay represents a different tablet.

The testing designated by T should be the full testing as would be required for a single
batch. Note that full testing would require nine combinations, or 27 batches. The matrix design
uses four combinations, or 12 batches.
Consider an example of a tablet formulation that is the subject of a stability study.
Three randomly chosen tablets are assayed at each of six time periods: 0, 3, 6, 9, 12, and
18 months after production, at ambient storage conditions. The data are shown in Table 7.5 and
Figure 7.8.
Given these data, the problem is to establish an expiration date defined as that time when
a tablet contains 90% of the labeled drug potency. The product in this example has a label of
50 mg potency and is prepared with a 4% overage (i.e., the product is manufactured with a
target weight of 52 mg of drug). Note that FDA is currently discouraging the use of overages to
compensate for poor stability.
Figure 7.8 shows that the data are variable. A careful examination of this plot suggests
that a straight line would be a reasonable representation of these data. The application of least
squares line fitting is best justified in situations where a theoretical model exists showing that the
decrease in concentration is linear with time (a zero-order process in this example). The kinetics
of drug loss in solid dosage forms is complex and a theoretical model is not easily derived. In
the present case, we will assume that concentration and time are truly linearly related

C = C0 − K t, (7.14)

Figure 7.8 Plot of stability data from Table 7.3.


LINEAR REGRESSION AND CORRELATION 159

where C is the concentration at time t, C0 the concentration at time 0 (Y intercept, A), K the rate
constant (− slope, − B), and t the time (storage time).
With the objective of estimating the shelf life, the simplest approach to the analysis of
these data is to estimate the slope and intercept of the least squares line, using Eqs. (7.4) and
(7.3). (An interesting exercise would be to first try and estimate the slope and intercept by eye
from Fig. 7.8.) When performing the least squares calculation, note that each value of the time
(X) is associated with three values of drug potency (y). When calculating C0 and K, each “time”
value is counted three times and N is equal to 18. From Table 7.3,
  
X = 144 y = 894 Xy = 6984
 
X2 = 1782 y2 = 44, 476 N = 18
 
X=8 (X − X)2 = 630 (y − y)2 = 74

From Eqs. (7.4) and (7.3), we have


  
N Xy − X y
b=  2  2
N X − ( X)
18(6984) − 144(894) −3024
= = = −0.267 mg/month (7.4)
18(1782) − (144) 2 11, 340
a = y − bX
894 144
= − (−0.267) = 51.80. (7.3)
18 18

The equation of the straight line best fitting the data in Figure 7.8 is

C = 51.8 − 0.267 t. (7.15)


2
The variance estimate, SY,x , represents the variability of tablet potency at a fixed time, and
is calculated from Eq. (7.9)
  
y2 − ( y)2 /N − b 2 (X − X)2
2
SY,x =
N−2
44,476 − (894)2 /18 − (−0.267)2 (630)
= = 1.825.
18 − 2
To calculate the time at which the tablet potency is 90% of the labeled amount, 45 mg,
solve Eq. (7.15) for t when C equals 45 mg.

45 = 51.80 − 0.267 t
t = 25.5 month.

The best estimate of the time needed for these tablets to retain 45 mg of drug is 25.5 months
(see the point marked with a cross in Fig. 7.9). The shelf life for the product will be less than
25.5 months if variability is taken into consideration. The next section, 7.6, presents a discussion
of this topic. This is an average result based on the data from 18 tablets. For any single tablet,
the time for decomposition to 90% of the labeled amount will vary, depending, for example, on
the amount of drug present at time zero. Nevertheless, the shelf-life estimate is based on the
average result.

7.6 CONFIDENCE INTERVALS IN REGRESSION ANALYSIS


A more detailed analysis of the stability data is warranted if one understands that 25.5 months
is not the true shelf life, but only an estimate of the true value. A confidence interval for the
estimate of time to 45 mg potency would give a range that probably includes the true value.
160 CHAPTER 7

Figure 7.9 95% confidence band for “stability” line.

The concept of a confidence interval in regression is similar to that previously discussed for
means. Thus the interval for the shelf life probably contains the true shelf life—that time when
the tablets retain 90% of their labeled potency, on the average. The lower end of this confidence
interval would be considered a conservative estimate of the true shelf life. Before giving the
solution to this problem we will address the calculation of a confidence interval for Y (potency)
at a given X (time). The width of the confidence interval for Y (potency) is not constant, but
depends on the value of X, since Y is a function of X. In the present example, one might wish to
obtain a range for the potency at 25.5 months’ storage time.

7.6.1 Confidence Interval for Y at a Given X


We will construct a confidence interval for the true mean potency (Y) at a given time (X). The
confidence interval can be shown to be equal to


1 (X − X)2
Y ± t(SY,x ) + . (7.16)
N (X − X)2

t is the appropriate value (N − 2 d.f., Table IV.4) for a confidence interval with confidence
coefficient P. For example, for a 95% confidence interval, use t values in the column headed
0.975 in Table IV.4.
In the linear regression model, y is assumed to have a normal distribution with variance
␴Y,x
2
at each X. As can be seen from Eq. (7.16), confidence limits for Y at a specified value
of X depend on the variance, degrees of freedom, number of data points used to fit the line, and
X − X the distance of the specified X (time, in this example) from X, the average time used in
the least squares line fitting. The confidence interval is smallest for the Y that corresponds to
the value of X equal to X, [the term, X −X, in Eq. (7.16) will be zero]. As the value of X is
farther from X, the confidence interval for Y corresponding to the specified X is wider. Thus the
estimate of Y is less precise, as the X corresponding to Y is farther away from X. A plot of the
confidence interval for every Y on the line results in a continuous confidence “band” as shown in
Figure 7.9. The curved, hyperbolic shape of the confidence band illustrates the varying width
of the confidence interval at different values of X, Y. For example, the 95% confidence interval
for Y at X = 25.5 months [Eq. (7.16)] is


1 (25.5 − 8)2
45 ± 2.12(1.35) + = 45 ± 2.1.
18 630

Thus the result shows that the true value of the potency at 25.5 months is probably between
42.9 and 47.1 mg (45 ± 2.1).
LINEAR REGRESSION AND CORRELATION 161

7.6.2 A Confidence Interval for X at a Given Value of Y


Although the interval for the potency may be of interest, as noted above, this confidence interval
does not directly answer the question about the possible variability of the shelf-life estimate.
A careful examination of the two-sided confidence band for the line (Fig. 7.9) shows that 90%
potency (45 mg) may occur between approximately 20 and 40 months, the points marked “a”
in Figure 7.9. To obtain this range for X (time to 90% potency), using the approach of graphical
estimation as described above requires the computation of the confidence band for a sufficient
range of X. Also, the graphical estimate is relatively inaccurate. The confidence interval for the
true X at a given Y can be directly calculated, although the formula is more complex than that
used for the Y confidence interval [Eq. (7.16)].
This procedure of estimating X for a given value of Y is often called “inverse prediction.”
The complexity results from the fact that the solution for X, X = (Y − a)/b, is a quotient of
variables. (Y − a) and b are random variables; both have error associated with their measurement.
The ratio has a more complicated distribution than a linear combination of variables such as is
the case for Y = a + bX. The calculation of the confidence interval for the true X at a specified
value of Y is
 

(X − g X) ± [t(SY,x )/b] (1 − g)/N + (X − X) / (X − X)
2 2

, (7.17)
1−g
where
2
t 2 (SY,x )
g= 
b 2 (X − X)2
t is the appropriate value for a confidence interval with confidence coefficient equal to P; for
example, for a two-sided 95% confidence interval, use values of t in the column headed 0.975
in Table IV.4.
A 95% confidence interval for X will be calculated for the time to 90% of labeled potency.
The potency is 45 mg (Y) when 10% of the labeled amount decomposes. The corresponding time
(X) has been calculated above as 25.5 months. For a two-sided confidence interval, applying
Eq. (7.17), we have

(2.12)2 (1.825)
g= = 0.183
(−0.267)2 (630)
X = 25.5 X=8 N = 18.

The confidence interval is



[25.5 − 0.183(8)] ± [2.12(1.35)/(−0.267)][ 0.817/18 + (17.5)2 /630]
0.817
= 19.8 to 39.0 months.

Thus, using a two-sided confidence interval, the true time to 90% of labeled potency is
probably between 19.8 and 39.0 months. A conservative estimate of the shelf life would be
the lower value, 19.8 months. If g is greater than 1, a confidence interval cannot be calculated
because the slope is not significantly greater than 0.
The Food and Drug Administration has suggested that a one-sided confidence interval
may be more appropriate than a two-sided interval to estimate the expiration date. For most
drug products, drug potency can only decrease with time, and only the lower confidence band of
the potency versus time curve may be considered relevant. (An exception may occur in the case
of liquid products where evaporation of the solvent could result in an increased potency with
time.) The 95% one-sided confidence limits for the time to reach a potency of 45 are computed
162 CHAPTER 7

using Eq. (7.17). Only the lower limit is computed using the appropriate t value that cuts off
5% of the area in a single tail. For 16 d.f., this value is 1.75 (Table IV.4), “g” = 0.1244. The
calculation is

[25.5 − 0.1244(8)] + [1.75(1.35)/(−0.267)][ 0.8756/18 + (17.5)2 /630
0.8756
= 20.6 months.

The one-sided 95% interval for X can be interpreted to mean that the time to decompose
to a potency of 45 is probably greater than 20.6 months. Note that the shelf life based on the
one-sided interval is longer than that based on a two-sided interval (Fig. 7.9).

7.6.3 Prediction Intervals


The confidence limits for Y and X discussed above are limits for the true values, having specified
a value of Y (potency or concentration, for example) corresponding to some value of X, or
an X (time, for example) corresponding to a specified value of Y. An important application of
confidence intervals in regression is to obtain confidence intervals for actual future measurements
based on the least squares line.
1. We may wish to obtain a confidence interval for a value of Y to be actually measured at
some value of X (some future time, for example).
2. In the example of the calibration (sect. 7.2), having observed a new value, y, after the
calibration line has been established, we would want to use the information from the fitted
calibration line to predict the concentration, or potency, X, and establish the confidence
limits for the concentration at this newly observed value of y. This is an example of inverse
prediction.
For the example of the stability study, we may wish to obtain a confidence interval for
an actual assay (y) to be performed at some given future time, after having performed the
experiment used to fit the least squares line (case 1 above).
The formulas for calculating a “prediction interval,” a confidence interval for a future
determination, are similar to those presented in Eqs. (7.16) and (7.17), with one modification. In
Eq. (7.16), we add 1 to the sum under the square root portion of the expression. Similarly, for
the inverse problem, Eq. (7.17) the expression (1 − g)/N is replaced by (N + 1)(1 − g)/N. Thus
the prediction interval for Y at a given X is

1 (X − X)2
Y ± t(SY,x ) 1 + + . (7.18)
N (X − X)2

The prediction interval for X at a specified Y is


 

(X − g X) ± [t(S)/b] (N + 1)(1 − g)/N + (X − X) / (X − X)
2 2

. (7.19)
1−g

The following examples should clarify the computations. In the stability study example,
suppose that one wishes to construct a 95% confidence (prediction) interval for an assay to be
performed at 25.5 months. (An actual measurement is obtained at 25.5 months.) This interval
will be larger than that calculated based on Eq. (7.16), because the uncertainty now includes
assay variability for the proposed assay in addition to the uncertainty of the least squares line.
Applying Eq. (7.18) (Y = 45), we have

1 17.52
45 ± 2.12(1.35) 1 + + = 45 ± 3.55 mg.
18 630

In the example of the calibration line, consider an unknown sample that is analyzed and
shows a value (y) of 90. A prediction interval for X is calculated using Eq. (7.19). X is predicted
LINEAR REGRESSION AND CORRELATION 163

to be 91.9 (see sect. 7.2).


(4.30)2 (12.15)
g= = 0.134
(0.915)2 (2000)

[91.9 − 0.134(90)] ± (4.3)(3.49)/0.915[ 5(0.866)/4 + (1.9)2 /2000]
0.866
= 72.5 to 111.9.

The relatively large uncertainty of the estimate of the true value is due to the small number
of data points (four) and the relatively large variability of the points about the least squares line
2
(SY,x = 12.15).

7.6.4 Confidence Intervals for Slope (B ) and Intercept (A)


A confidence interval can be constructed for the slope and intercept in a manner analogous to
that for means [Eq. (6.2)]. The confidence interval for the slope is
t(SY,x )
b ± t(Sb ) = b ±  . (7.20)
(X − X) 2

A confidence interval for the intercept is




1 X
2
a ± t(Sa ) = a ± t(SY,x )  + . (7.21)
N (X − X)2

A 95% confidence interval for the slope of the line in the stability example is [Eq. (7.20)]
2.12(1.35)
(−0.267) ± √ = −0.267 ± 0.114
630
= −0.381 to −0.153.

A 90% confidence interval for the intercept in the calibration line example (sect. 7.2) is
[Eq. (7.21)]

1 902
5.9 ± 2.93(3.49) + = 5.9 ± 21.2 = −15.3 to 27.1.
4 2000

(Note that the appropriate value of t with 2 d.f. for a 90% confidence interval is 2.93.)

7.7 WEIGHTED REGRESSION


One of the assumptions implicit in the applications of statistical inference to regression pro-
cedures is that the variance of y be the same at each value of X. Many situations occur in
practice when this assumption is violated. One common occurrence is the variance of y being
approximately proportional to X2 . This occurs in situations where y has a constant coefficient of
variation (CV) and y is proportional to X (y = BX), commonly observed in instrumental methods
of analysis in analytical chemistry. Two approaches to this problem are (a) a transformation of
164 CHAPTER 7

Table 7.6 Analytical Data for a Spectrophotometric Analysis

Concentration (X ) Optical density (y ) CV Weight (w )


5 0.105 0.098 0.049 0.04
10 0.201 0.194 0.025 0.01
25 0.495 0.508 0.018 0.0016
50 0.983 1.009 0.018 0.0004
100 1.964 2.013 0.017 0.0001

y to make the variance homogeneous, such as the log transformation (see chap. 10), and (b) a
weighted regression analysis.
Below is an example of weighted regression analysis in which we assume a constant CV
and the variance of y proportional to X2 as noted above. This suggests a weighted regression,
weighting each value of Y by a factor that is inversely proportional to the variance, 1/X2 .
Table 7.6 shows data for the spectrophotometric analysis of a drug performed at 5 concentrations
in duplicate.
Equation (7.22) is used to compute the slope for the weighted regression procedure.

   
w Xy − w X wy w
b=    . (7.22)
w X − ( w X)
2 2 w

The computations are as follows:



 w = 0.04 + 0.04 + . . . + 0.0001 + 0.0001 = 0.1042
w Xy = (0.04)(5)(0.105) + (0.04)(5)(0.098) + . . . + (0.0001)(100)(1.964) + (0.0001)(100)(2.013)
 = 0.19983
 w X = 2(0.04)(5) + 2(0.01)(10) + . . . + 2(0.0001)(100) = 0.74
 wy = (0.04)(0.105) + (0.04)(0.098) + . . . + (0.0001)(1.964) + (0.0001)(2.013) = 0.0148693
w X2 = 2(0.04)(5)2 + 2(0.01)(10)2 + . . . + 2(0.0001)(100)2 = 10
Therefore, the slope b =

0.19983 − (0.74)(0.0148693)/0.1042
= 0.01986.
10 − (0.74)2 /0.1042

The intercept is

a = yw − b(Xw ), (7.23)

   
where yw = wy/ w and Xw = w X/ w

a = 0.0148693/0.1042 − 0.01986(0.74/0.1042) = 0.00166. (7.23a)

The weighted least squares line is shown in Figure 7.10.

7.8 ANALYSIS OF RESIDUALS


Emphasis is placed elsewhere in this book on the importance of carefully examining and
graphing data prior to performing statistical analyses. The approach to examining data in this
context is commonly known as Exploratory Data Analysis (EDA) [11]. One aspect of EDA is the
examination of residuals. Residuals can be thought of as deviations of the observed data from
LINEAR REGRESSION AND CORRELATION 165

Figure 7.10 Weighted regression plot for data from Table 7.7.

the fit to the statistical model. Examination of residuals can reveal problems such as variance
heterogeneity or nonlinearity. This brief introduction to the principle of residual analysis uses
the data from the regression analysis in section 7.7.
The residuals from a regression analysis are obtained from the differences between the
observed and predicted values. Table 7.7 shows the residuals from an unweighted least squares
fit of the data of Table 7.6. Note that the fitted values are obtained from the least squares equation
y = 0.001789 + 0.019874(X).
If the linear model and the assumptions in the least squares analysis are valid, the residuals
should be approximately normally distributed, and no trends should be apparent.
Figure 7.11 shows a plot of the residuals as a function of X. The fact that the residuals show a
fan-like pattern, expanding as X increases, suggests the use of a log transformation or weighting
procedure to reduce the variance heterogeneity. In general, the intelligent interpretation of
residual plots requires knowledge and experience. In addition to the appearance of patterns
in the residual plots that indicate relationships and character of data, outliers usually become
obviously apparent [12].
Figure 7.12 shows the residual plot after a log (In) transformation of X and Y. Much of the
variance heterogeneity has been removed.
For readers who desire more information on this subject, the book Graphical Exploratory
Data Analysis [13] is recommended.

Table 7.7 Residuals from Least Squares Fit of Analytical Data (Table 7.6)

Unweighted Log transform

Actual Predicted value Residual Actual Predicted value Residual


0.105 0.101 +0.00384 −2.254 −2.298 +0.044
0.201 0.201 +0.00047 −1.604 −1.6073 +0.0033
0.495 0.499 −0.00364 −0.703 −0.695 −0.008
0.983 0.995 −0.0126 −0.017 −0.0004 −0.0166
1.964 1.989 −0.025 +0.675 +0.6863 −0.0113
0.098 0.101 −0.00316 −2.323 −2.298 −0.025
0.194 0.201 −0.00653 −1.640 −1.6073 −0.0033
0.508 0.499 +0.00936 −0.677 −0.6950 +0.018
1.009 0.995 +0.0135 +0.009 −0.0042 +0.0132
2.013 1.989 +0.00238 +0.700 0.6863 +0.0137
166 CHAPTER 7

Figure 7.11 Residual plot for unweighted analysis of data of Table 7.6.

Figure 7.12 Residual plot for analysis of In transformed data of Table 7.6.

7.9 NONLINEAR REGRESSION**


Linear regression applies to the solution of relationships where the function of Y is linear in the
parameters. For example, the equation

Y = A + BX

is linear in A and B, the parameters. Similarly, the equation

Y = A + Be −x

is also linear in the parameters. One should also appreciate that a linear equation can exist in
more than two dimensions. The equation

Y = A + BX + CX2 ,
LINEAR REGRESSION AND CORRELATION 167

an example of a quadratic equation, is linear in the parameters, A, B, and C. These parameters


can be estimated by using methods of multiple regression (see App. III and Ref. [1]).
An example of a relationship that is nonlinear in this context is

Y = A + e B X.

Here the parameter B is not in a linear form.


If a linearizing transformation can be made, then this approach to estimating the param-
eters would be easiest. For example, the simple first-order kinetic relationship

Y = Ae −B X

is not linear in the parameters, A and B. However, a log transformation results in a linear
equation

ln Y = ln A − BX.

Using the least squares approach, we can estimate ln A (A is the antilog) and B, where ln
A is the intercept and B is the slope of the straight line when ln Y is plotted versus X. If statistical
tests and other statistical estimates are to be made from the regression analysis, the assumptions
of normality of Y (now ln Y) and variance homogeneity of Y at each X are necessary. If Y is
normal and the variances of Y at each X are homogeneous to start with, the ln transformation
will invalidate the assumptions. (On the other hand, if Y is lognormal with constant CV, the log
transformation will be just what is needed to validate the assumptions.)
Some relationships cannot be linearized. For example, in pharmacokinetics, the one-
compartment model with first order absorption and excretion has the following form

C = D(e −ket − e −ka t )

where D, ke, and ka are constants (parameters). This equation cannot be linearized. The use of
nonlinear regression methods can be used to estimate the parameters in these situations as well
as the situations in which Y is normal with homogeneous variance prior to a transformation, as
noted above.
The solutions to nonlinear regression problems require more advanced mathematics rel-
ative to most of the material in this book. A knowledge of elementary calculus is necessary,
particularly the application of Taylor’s theorem. Also, a knowledge of matrix algebra is useful
in order to solve these kinds of problems. A simple example will be presented to demon-
strate the principles. The general matrix solutions to linear and multiple regression will also
be demonstrated.
In a stability study, the data in Table 7.8 were available for analysis. The equation repre-
senting the degradation process is

C = C0 e −kt . (7.24)

Table 7.8 Data from a Stability Study

Time (t ) Concentration mg/L (C )


1 hr 63
2 hr 34
3 hr 22
168 CHAPTER 7

Figure 7.13 Plot of stability data from Table 7.8.

The concentration values are known to be normal with the variance constant at each value
of time. Therefore, the usual least squares analysis will not be used to estimate the parameters
C0 and k after the simple linearizing transformation:

ln C = ln C0 − kt.

The estimate of the parameters using nonlinear regression as demonstrated here uses
the first terms of Taylor’s expansion, which approximates the function and results in a linear
equation. It is important to obtain good initial estimates of the parameters, which may be
obtained graphically. In the present example, a plot of ln C versus time (Fig. 7.13) results in
initial estimates of 104 for C0 and +0.53 for k. The process then estimates a change in C0 and
a change in k that will improve the equation based on the comparison of the fitted data to
the original data. Typical of least squares procedures, the fit is measured by the sum of the
squares of the deviations of the observed values from the fitted values. The best fit results from
an iterative procedure. The new estimates result in a better fit to the data. The procedure is
repeated using the new estimates, which results in a better fit than that observed in the previous
iteration. When the fit, as measured by the sum of the squares of deviations, is negligibly
improved, the procedure is stopped. Computer programs are available to carry out these tedious
calculations.
The Taylor expansion requires taking partial derivatives of the function with respect to C0
and k. For the equation, C = C0 e −kt , the resulting expression is
 
dC = dC0 (e −k t ) − dk  (C0 )(te −k t ). (7.25)
In Eq. (7.25), dC is the change in C resulting from small changes in C0 and k evaluated
at the point, C0 and k  . dC0 is the change in the estimate of C0 , and dk  is the change in the
 
estimate of k. (e −k t ) and C0 (te −k t ) are the partial derivatives of Eq. (7.24) with respect to C0 and
k, respectively.

Equation (7.25) is linear in dC0 and dk  . The coefficients of dC0 and dk  are (e −k t ) and

−(C0 )(te −k t ), respectively. In the computations below, the coefficients are referred to as X1
and X2, respectively, for convenience. Because of the linearity, we can obtain the least squares
estimates of dC0 and dk  by the usual regression procedures.
The computations for two iterations are shown below. The solution to the least squares
equation is usually accomplished using matrix manipulations. The solution for the coefficients
can be proven to have the following form:

B = (X X)−1 (X Y).


LINEAR REGRESSION AND CORRELATION 169

Table 7.9 Results of First Iteration

Time (t ) C C dC X1 X2
1 63 61.2 1.79 0.5886 −61.2149
2 34 36.0 −2.03 0.3465 −72.0628
3 22 21.2 0.79 0.2039 −63.6248
 2
dC = 7.94

The matrix B will contain the estimates of the coefficients. With two coefficients, this will
be a 2 × 1 (2 rows and 1 column) matrix.
 
In Table 7.9, the values of X1 and X2 are (e −k t ) and (C0 )(te −k t ), respectively, using the
initial estimates of C0 = 104 and k  = + 0.53 (Fig. 7.13). Note that the fit is measured by the
 2
dC = 7.94.
The solution of (X X)−1 (X Y) gives the estimates of the parameters, dC0 and k 
  −1   
   
X X  X Y   
 11.5236 0.06563   0.5296   4.99 
  = 
 0.06563 0.00045079   −16.9611   0.027 

The new estimates of C0 and k are

C0 = 104 + 4.99 = 108.99


k  = 0.53 + 0.027 = + 0.557.


 2 new values of C are calculated in Table 7.10.
With these estimates,
Note that the dC is 5.85, which is reduced from 7.94, from the initial iteration. The
solution of (X X)−1 (X Y) is
     
 12.587 + 0.06964   0.0351   0.378 
    
 +0.06964 0.0004635   −0.909  =  0.002 

Therefore, the new estimates of C0 and k are

C0 = 108.99 + 0.38 = 109.37


k = 0.557 + 0.002 = 0.559.

The reader can verify that the new value of dC 2 is now 5.74. The process is repeated until
2
dC becomes stable. The final solution is C0 = 109.22, k 0.558.
Another way of expressing the decomposition is

C = e ln C0 −kt

Table 7.10 Results of Second Iteration

Time (t) C C dC  X1 X2
1 63 62.4 0.6 0.5729 −62.4431
2 34 35.8 −1.8 0.3282 −71.5505
3 22 20.5 1.5 0.18806 −61.4896
 2
dC = 5.85
170 CHAPTER 7

or

ln C = ln C0 − kt.

The ambitious reader may wish to try a few iterations using this approach. Note that the
partial derivatives of C with respect to C0 and k are (1/C0 ) (e ln C0 −kt )and −t(e ln C0 −kt ), respectively.

7.10 CORRELATION
Correlation methods are used to measure the “association” of two or more variables. Here,
we will be concerned with two observations for each sampling unit. We are interested in
determining if the two values are related, in the sense that one variable may be predicted from a
knowledge of the other. The better the prediction, the better the correlation. For example, if we
could predict the dissolution of a tablet based on tablet hardness, we say that dissolution and
hardness are correlated. Correlation analysis assumes a linear or straight-line relationship between
the two variables.
Correlation is usually applied to the relationship of continuous variables, and is best
visualized as a scatter plot or correlation diagram. Figure 7.14(A) shows a scatter plot for two
variables, tablet weight and tablet potency. Tablets were individually weighed and then assayed.
Each point in Figure 7.14(A) represents a single tablet (X = weight, Y = potency). Inspection
of this diagram suggests that weight and potency are positively correlated, as is indicated by
the positive slope, or trend. Low-weight tablets are associated with low potencies, and vice
versa. This positive relationship would probably be expected on intuitive grounds. If the tablet
granulation is homogeneous, a larger weight of material in a tablet would contain larger amounts
of drug. Figure 7.14(B) shows the correlation of tablet weights and dissolution rate. Smaller tablet
weights are related to higher dissolution rates, a negative correlation (negative trend).
Inspection of Figure 7.14(A) and (B) reveals what appears to be an obvious relationship.
Given a tablet weight, we can make a good “ballpark” estimate of the dissolution rate and

Figure 7.14 Examples of various correlation diagrams or scatter plots. The correlation coefficient, r , is defined
in section 7.10.1.
LINEAR REGRESSION AND CORRELATION 171

potency. However, the relationship between variables is not always as apparent as in these
examples. The relationship may be partially obscured by variability, or the variables may not be
related at all. The relationship between a patient’s blood pressure reduction after treatment with
an antihypertensive agent and serum potassium levels is not as obvious [Fig. 7.14(C)]. There
seems to be a trend toward higher blood pressure reductions associated with higher potassium
levels—or is this just an illusion? The data plotted in Figure 7.14(D), illustrating the correlation
of blood pressure reduction and age, show little or no correlation.
The various scatter diagrams illustrated in Figure 7.14 should give the reader an intuitive
feeling for the concept of correlation. There are many experimental situations where a researcher
would be interested in relationships among two or more variables. Similar to applications of
regression analysis, correlation relationships may allow for prediction and interpretation of
experimental mechanisms. Unfortunately, the concept of correlation is often misused, and more
is made of it than is deserved. For example, the presence of a strong correlation between
two variables does not necessarily imply a causal relationship. Consider data that show a
positive relationship between cancer rate and consumption of fluoridated water. Regardless of
the possible validity of such a relationship, such an observed correlation does not necessarily
imply a causal effect. One would have to investigate further other factors in the environment
occurring concurrently with the implementation of fluoridation, which may be responsible
for the cancer rate increase. Have other industries appeared and grown during this period,
exposing the population to potential carcinogens? Have the population characteristics (e.g.,
racial, age, sex, economic factors) changed during this period? Such questions may be resolved
by examining the cancer rates in control areas where fluoridation was not enforced.
The correlation coefficient is a measure of the “degree” of correlation, which is often
erroneously interpreted as a measure of “linearity.” That is, a strong correlation is sometimes
interpreted as meaning that the relationship between X and Y is a straight line. As we shall see
further in this discussion, this interpretation of correlation is not necessarily correct.

7.10.1 Correlation Coefficient


The correlation coefficient is a quantitative measure of the relationship or correlation between
two variables.

(X − X)(y − y)
Correlation coefficient = r =   . (7.26)
(X − X)2 (y − y)2

A shortcut computing formula is

 
Xy −
N X y
r =    , (7.27)
 2  2  2  2
N X − ( X) N y − ( y)

where N is the number of X, y pairs.


2
The correlation coefficient, r, may be better understood by its relationship to SY,x , the
2
variance calculated from regression line fitting procedures. r represents the relative reduction
in the sum of squares
of the variable y resulting from the fitting of the X, y line. For example,
the sum of squares (y − y)2 for the y values 0, 1, and 5 is equal to 14 [see Eq. (1.4)].

 (0 + 1 + 5)2
(y − y)2 = 02 + 12 + 52 − = 14.
3

If these same y values were associated with  X values, the sum of squares of y from the
regression of y and X will be equal to or less than (y − y)2 , or 14 in this example. Suppose that
X and y values are as follows (Fig. 7.15):
172 CHAPTER 7

Figure 7.15 Reduction in sum of squares due to


regression.

X y Xy

0 0 0 ( X − X )2 = 2
1 1 1

2 5 10 ( y − y)2 = 14
Sum 3 6 11

According to Eq. (7.9), the sum of squares due to deviations of the y values from the
regression line is

 2 
(y − y)2 − b (X − X)2 , (7.28)


where b is the slope of the regression line (y on X). The term b 2 (X − X)2 is the reduction in the
sum of squares due to the straight-line regression fit. Applying Eq. (7.28), the sum of squares is

14 − (2.5)2 (2) = 14 − 12.5 = 1.5 (the slope, b, is 2.5).

r2 is the relative reduction of the sum of squares

14 − 1.5 √
= 0.893 r= 0.893 = 0.945.
14

The usual calculation of r, according to Eq. (7.27), is as follows:

3(11) − (3)(6) 15
 =  = 0.945.
[3(5) − (3)2 ][3(26) − (36)] 6(42)

Thus, according to this notion, r can be interpreted as the relative degree of scatter about
2
the regression line. If X and y values lie exactly on a straight line (a perfect fit), SY,x is 0, and r is
equal to ± 1; +1 for a line of positive slope and −1 for a line of negative slope. For a correlation
coefficient equal to 0.5, r2 = 0.25. The sum of squares for y is reduced 25%. A correlation
coefficient of 0 means that the X, y pairs are not correlated [Fig. 7.14(D)].
Although there are no assumptions necessary to calculate the correlation coefficient, sta-
tistical analysis of r is based on the notion of a bivariate normal distribution of X and y. We will
not delve into the details of this complex probability distribution here. However, there are two
interesting aspects of this distribution that deserve some attention with regard to correlation
analysis.

1. In typical correlation problems, both X and y are variable. This is in contrast to the linear
regression case, where X is considered fixed, chosen, a priori, by the investigator.
LINEAR REGRESSION AND CORRELATION 173

2. In a bivariate normal distribution, X and y are linearly related. The regression of both X on y
and y on X is a straight line.¶ Thus, when statistically testing correlation coefficients, we are
not testing for linearity. As described below, the statistical test of a correlation coefficient is
a test of correlation or independence. According to Snedecor and Cochran, the correlation
coefficient “estimates the degree of closeness of a linear relationship between two variables,
Y and X, and the meaning of this concept is not easy to grasp” [11].

7.10.2 Test of Zero Correlation


The correlation coefficient is a rough measure of the degree of association of two variables. The
degree of association may be measured by how well one variable can be predicted from another;
the closer the correlation coefficient is to + 1 or − 1, the better the correlation, the better the
predictive power of the relationship. A question of particular importance from a statistical point
of view is whether or not an observed correlation coefficient is “real” or due to chance. If two
variables from a bivariate normal distribution are uncorrelated (independent), the correlation
coefficient is 0. Even in these cases, in actual experiments, random variation will result in a
correlation coefficient different from zero. Thus, it is of interest to test an observed correlation
coefficient, r, versus a hypothetical value of 0. This test is based on an assumption that y is a
normal variable [11]. The test is a t test with (N − 2) d.f., as follows:

H0 : ␳ = 0 Ha : ␳ = 0,

where ␳ is the true correlation coefficient, estimated by r.


 √ 
r N − 2
tN−2 = √ . (7.29)
1 − r2

The value of t is referred to a t distribution with (N − 2) d.f., where N is the sample size
(i.e., the number of pairs). Interestingly, this test is identical to the test of the slope of the least
squares fit, Y = a + bX [Eq. (7.13)]. In this context, one can think of the test of the correlation
coefficient as a test of the significance of the slope versus 0.
To illustrate the application of Eq. (7.29), Table 7.11 shows data of diastolic blood pressure
and cholesterol levels of 10 randomly selected men. The data are plotted in Figure 7.16. r is
calculated from Eq. (7.27)

 
Xy −
N X y
r =   
    
N X2 − ( X)2 N y2 − ( y)2
(7.30)
10(260,653) − (3111)(825)
=  = 0.809.
[10(987,893) − 31112 ][10(69,279) − 8252 ]

r is tested for significance using Eq. (7.29).


 √ 

0.809 8
t8 =  = 3.89.
1 − (0.809)2

A value of t equal to 2.31 is needed for significance at the 5% level (see Table IV.4).
Therefore, the correlation between diastolic blood pressure and cholesterol is significant. The
correlation is apparent from inspection of Figure 7.16.

¶ The regression of y on X means that X is assumed to be the fixed variable when calculating the line. This line
is different from that calculated when Y is considered the fixed variable (unless the correlation coefficient is 1,
when both lines are identical). The slope of the line is r Sy /Sx for the regression of y on X and r Sx /Sy for x on Y.
174 CHAPTER 7

Table 7.11 Diastolic Blood Pressure and Serum Cholesterol of 10 Persons

Diastolic blood
Person pressure (DBP), y Cholesterol (C ), X Xy
1 80 307 24,560
2 75 259 19,425
3 90 341 30,690
4 74 317 23,458
5 75 274 20,550
6 110 416 45,760
7 70 267 18,690
8 85 320 27,200
9 88 274 24,112
10
 78
336 26,208
 
y = 825 X = 3111 Xy = 260,653
 2  2
y = 69,279 X = 987,893

Significance tests for the correlation coefficient versus values other than 0 are not very
common. However, for these tests, the t test described above [Eq. (7.29)] should not be used. An
approximate test is available to test for correlation coefficients other than 0 (e.g., H0 : ␳ = 0.5).
Since applications of this test occur infrequently in pharmaceutical experiments, the procedure
will not be presented here. The statistical test is an approximation to the normal distribution, and
the approximation can also be used to place confidence intervals on the correlation coefficient.
A description of these applications is presented in Ref. [11].

7.10.3 Miscellaneous Comments


Before leaving the topic of correlation, the reader should once more be warned about the
potential misuses of interpretations of correlation and the correlation coefficient. In particular,
the association of high correlation coefficients with a “cause and effect” and “linearity” is not
necessarily valid. Strong correlation may imply a direct causal relationship, but the nature of the
measurements should be well understood before firm statements can be made about cause and
effect. One should be keenly aware of the common occurrence of spurious correlations due to
indirect causes or remote mechanisms.
The correlation coefficient does not test the linearity of two variables. If anything, it is
more related to the slope of the line relating the variables. Linearity is assumed for the routine
statistical test of the correlation coefficient. As has been noted above, the correlation coefficient
measures the degree of correlation, a measure of the variability of a predictive relationship.

Figure 7.16 Plot of data from Table 7.11.


LINEAR REGRESSION AND CORRELATION 175

Table 7.12 Two Data Sets Illustrating Some Problems of


Interpreting Correlation Coefficients

Set A Set B

X y X y
−2 0 0 0
−1 3 2 4
0 4 4 16
+1 3 6 36
+2 0

A proper test for linearity (i.e., do the data represent a straight-line relationship between X
and Y?) is described in Appendix II and requires replicate measurements in the regression
model. Usually, correlation problems deal with cases where both variables, X and y, are variable
in contrast to the regression model where X is considered fixed. In correlation problems, the
question of linearity is usually not of primary interest. We are more interested in the degree of
association of the variables. Two examples will show that a high correlation coefficient does not
necessarily imply “linearity” and that a small correlation coefficient does not necessarily imply
lack of correlation (if the relationship is nonlinear).
Table 7.12 shows two sets of data that are plotted in Figure 7.17. Both data sets A and B
show perfect (but nonlinear) relationships between X and y. Set A is defined by Y = 4 − X2 . Set B
is defined by Y = X2 . Yet the correlation coefficient for set A is 0, an implication of no correlation,
and set B has a correlation coefficient of 0.96, very strong correlation (but not linearity!). These
examples should emphasize the care needed in the interpretation of the correlation coefficient,
particularly in nonlinear systems.
Another example of data for which the correlation coefficient can be misleading is shown in
Table 7.13 and Figure 7.18. In this example, drug stability is plotted versus pH. Five experiments
were performed at low pH and one at high pH. The correlation coefficient is 0.994, a highly
significant result (p < 0.01). Can this be interpreted that the data in Figure 7.18 are a good
fit to a straight line? Without some other source of information, it would take a great deal of
imagination to assume that the relationship between pH and t1/2 is linear over the range of pH
equal to 2.0 to 5.5. Even if the relationship were linear, had data been available for points in
between pH 2.0 and 5.5, the fit may not be as good as that implied by the large value of r in
this example. This situation can occur when one value is far from the cluster of the main body
of data. One should be cautious in “over-interpreting” the correlation coefficient in these cases.
When relationships between variables are to be quantified for predictive or theoretical reasons,
regression procedures, if applicable, are recommended. Correlation, per se, is not as versatile or
informative as regression analysis for describing the relationship between variables.

7.11 COMPARISON OF VARIANCES IN RELATED SAMPLES


In section 5.3, a test was presented to compare variances from two independent samples. If the
samples are related, the simple F test for two independent samples is not valid [11]. Related, or

Figure 7.17 Plot of data in Table 7.12 showing problems with interpretation of the correlation coefficient.
176 CHAPTER 7

Table 7.13 Data to Illustrate a Problem that Can Result


in Misinterpretation of the Correlation Coefficient

pH Stability, t1/ 2 (wk)

2.0 48
2.1 50
1.9 50
2.0 46
2.1 47
5.5 12

Figure 7.18 Plot of data from Table 7.10.

paired-sample tests arise, for example, in situations where the same subject tests two treatments,
such as in clinical or bioavailability studies. To test for the equality of variances in related
samples, we must first calculate the correlation coefficient and the F ratio of the variances. The
test statistic is calculated as follows:

F −1
rds =  , (7.31)
(F + 1)2 − 4r 2 F

where F is the ratio of the variances in the two samples and r is the correlation coefficient.
The ratio in Eq. (7.30), rds , can be tested for significance in the same manner as the test for
the ordinary correlation coefficient, with (N − 2) d.f., where N is the number of pairs [Eq. (7.29)].
As is the case for tests of the correlation coefficient, we assume a bivariate normal distribution
for the related data. The following example demonstrates the calculations.
In a bioavailability study, 10 subjects were given each of two formulations of a drug
substance on two occasions, with the results for AUC (area under the blood level versus time
curve) given in Table 7.14.
The correlation coefficient is calculated according to Eq. (7.27).

(64,421)(10) − (781)(815)
r=  = 0.699.
[(62,821)(10) − (781)2 ][(67,087)(10) − (815)2 ]

The ratio of the variances (Table 7.14), F, is

202.8
= 2.75.
73.8

[Note: The ratio of the variances may also be calculated as 73.8/202.8 = 0.36, with the
same conclusions based on Eq. (7.31).]
LINEAR REGRESSION AND CORRELATION 177

Table 7.14 AUC Results of the Bioavailability Study (A vs. B )

Formulation

Subject A B
1 88 88
2 64 73
3 69 86
4 94 89
5 77 80
6 85 71
7 60 70
8 105 96
9 68 84
10 73 78
Mean 78.1 81.5
S2 202.8 73.8

The test statistic, rds , is calculated from Eq. (7.31).

2.75 − 1
rds =  = 0.593,
(2.75 + 1)2 − 4(0.699)2 (2.75)

rds is tested for significance using Eq. (7.29).


 √ 

0.593 8
t8 = √ = 2.08.
1 − 0.5932

Referring to the t table (Table IV.4, 8 d.f.), a value of 2.31 is needed for significance at the
5% level. Therefore, we cannot reject the null hypothesis of equal variances in this example.
Formulation A appears to be more variable, but more data would be needed to substantiate
such a claim.
A discussion of correlation of multiple outcomes and adjustment of the significance level
is given in section 8.2.2.

KEY TERMS
Best-fitting line Nonlinear regression
Bivariate normal distribution Nonlinearity
Confidence band for line One-sided confidence interval
Confidence interval for X and Y Prediction interval
Correlation Reduction of sum of squares
Correlation coefficient Regression
Correlation diagram Regression analysis
Dependent variable Residuals
Fixed value (X) Scatter plot
Independence Simple linear regression
Independent variable Slope
2
Intercept SY,x
Inverse prediction Trend
Lack of fit Variance of correlated samples
Linear regression Weighted regression
Line through the origin
178 CHAPTER 7

EXERCISES
1. A drug seems to decompose in a manner such that appearance of degradation products
is linear with time (i.e., Cd = kt).

t Cd
1 3
2 9
3 12
4 17
5 19

(a) Calculate the slope (k) and intercept from the least squares line.
(b) Test the significance of the slope (test vs. 0) at the 5% level.
(c) Test the slope versus 5 (H0 : B = 5) at the 5% level.
(d) Put 95% confidence limits on Cd at t = 3 and t = 5.
(e) Predict the value of Cd at t = 20. Place a 95% prediction interval on Cd at t = 20.
(f) If it is known that Cd = 0 at t = 0, calculate the slope.
2. A Beer’s law plot is constructed by plotting ultraviolet absorbance versus concentration,
with the following results:

Concentration, X Absorbance, y Xy
1 0.10 0.10
2 0.36 0.72
3 0.57 1.71
5 1.09 5.45
10 2.05 20.50

(a) Calculate the slope and intercept.


(b) Test to see if the intercept is different from 0 (5% level). How would you interpret
a significant intercept with regard to the actual physical nature of the analytical
method?
∗∗
(c) An unknown has an absorbance of 1.65. What is the concentration? Put confidence
limits on the concentration (95%).
3. Five tablets were weighed and then assayed with the following results:

Weight (mg) Potency (mg)


205 103
200 100
202 101
198 98
197 98

(a) Plot potency versus weight (weight = X). Calculate the least squares line.
(b) Predict the potency for a 200-mg tablet.
(c) Put 95% confidence limits on the potency for a 200-mg tablet.

∗∗ This is a more advanced topic.


LINEAR REGRESSION AND CORRELATION 179

4. Tablets were weighed and assayed with the following results:

Weight Assay Weight Assay


200 10.0 198 9.9
205 10.1 200 10.0
203 10.0 190 9.6
201 10.1 205 10.2
195 9.9 207 10.2
203 10.1 210 10.3

(a) Calculate the correlation coefficient.


(b) Test the correlation coefficient versus 0 (5% level).
(c) Plot the data in the table (scatter plot).
5. Tablet dissolution was measured in vitro for 10 generic formulations. These products
were also tested in vivo. Results of these studies showed the following time to 80%
dissolution and time to peak (in vivo).

Formulation Time to 80% dissolution (min) Tp(hr)


1 17 0.8
2 25 1.0
3 15 1.2
4 30 1.5
5 60 1.4
6 24 1.0
7 10 0.8
8 20 0.7
9 45 2.5
10 28 1.1

Calculate r and test for significance (versus 0) (5% level). Plot the data.
6. Shah et al. [14] measured the percent of product dissolved in vitro and the time to
peak (in vivo) of nine phenytoin sodium products, with approximately the following
results:

Product Time to peak (hr) Percentage dissolved in 30 min


1 6 20
2 4 60
3 2.5 100
4 4.5 80
5 5.1 35
6 5.7 35
7 3.5 80
8 5.7 38
9 3.8 85

Plot the data. Calculate the correlation coefficient and test to see if it is significantly
different from 0 (5% level). (Why is the correlation coefficient negative?)
7. In a study to compare the effects of two pain-relieving drugs (A and B), 10 patients took
each drug in a paired design with the following results (drug effectiveness based on a
rating scale).
180 CHAPTER 7

Patient Drug A Drug B


1 8 6
2 5 4
3 5 6
4 2 5
5 4 5
6 7 4
7 9 6
8 3 7
9 5 5
10 1 4

Are the drug effects equally variable?


8. Compute the intercept and slope of the least squares line for the data of Table 7.6 after
a In transformation of both X and Y. Calculate the residuals and compare to the data in
Table 7.7.
9. In a drug stability study, the following data were obtained:

Time (months) Concentration (mg)


0 2.56
1 2.55
3 2.50
9 2.44
12 2.40
18 2.31
24 2.25
36 2.13

(a) Fit a least squares line to the data.


(b) Predict the time to decompose to 90% of label claim (2.25 mg).
(c) Based on a two-sided 95% confidence interval, what expiration date should be
applied to this formulation?
(d) Based on a one-sided 95% confidence interval, what expiration date should be applied
to this formulation?
††
10. Fit the following data to the exponential y = e ax . Use nonlinear least squares.

x y
1 1.62
2 2.93
3 4.21
4 7.86

REFERENCES
1. Draper NR, Smith H. Applied Regression Analysis, 2nd ed. New York: Wiley, 1981.
2. Youden WJ. Statistical Methods for Chemists. New York: Wiley, 1964.
3. U.S. Food and Drug Administration. Current Good Manufacturing Practices (CGMP) 21 CFR.
Washington, DC: Commissioner of the Food and Drug Administration, 2006:210–229.

†† This is an optional, more difficult problem.


LINEAR REGRESSION AND CORRELATION 181

4. Davies OL, Hudson HE. Stability of drugs: accelerated storage tests. In: Buncher CR, Tsay J-Y, eds.
Statistics in the Pharmaceutical Industry. New York: Marcel Dekker, 1994:445–479.
5. Tootill JPR. A critical appraisal of drug stability testing methods. J Pharm Pharmacol 1961; 13(suppl):
75T–86T.
6. Davis J. The Dating Game. Washington, DC: Food and Drug Administration, 1978.
7. Norwood TE. Statistical analysis of pharmaceutical stability data. Drug Dev Ind Pharm 1986; 12:553–
560.
8. International Conference on Harmonization Bracketing and matrixing designs for stability testing of
drug substances and drug products (FDA Draft Guidance) Step 2, Nov 9, 2000.
9. Nordbrock ET. Stability matrix designs. In: Chow S-C, ed. Encyclopedia of Pharmaceutical Statistics.
New York: Marcel Dekker, 2000:487–492.
10. Murphy JR. Bracketing Design. In: Chow S-C, ed. Encyclopedia of Pharmaceutical Statistics.
New York: Marcel Dekker, 2000:77.
11. Snedecor GW, Cochran WG. Statistical Methods, 8th ed. Ames, IA: Iowa State University Press, 1989.
12. Weisberg S. Applied Linear Regression. New York: Wiley, 1980.
13. duToit SHC, Steyn AGW, Stumpf RH. Graphical Exploratory Data Analysis. New York: Springer, 1986.
14. Shah VP, Prasad VK, Alston T, et al. In vitro in vivo correlation for 100 mg phenytoin sodium capsules.
J Pharm Sci 1983; 72:306.

You might also like