Linear Regression and Correlation
Linear Regression and Correlation
Linear Regression and Correlation
Simple linear regression analysis is a statistical technique that defines the functional relationship
between two variables, X and Y, by the “best-fitting” straight line. A straight line is described
by the equation, Y = A + BX, where Y is the dependent variable (ordinate), X is the independent
variable (abscissa), and A and B are the Y intercept and slope of the line, respectively (Fig. 7.1).∗
Applications of regression analysis in pharmaceutical experimentation are numerous. This
procedure is commonly used
1. to describe the relationship between variables where the functional relationship is known
to be linear, such as in Beer’s law plots, where optical density is plotted against drug
concentration;
2. when the functional form of a response is unknown, but where we wish to represent a trend
or rate as characterized by the slope (e.g., as may occur when following a pharmacological
response over time);
3. when we wish to describe a process by a relatively simple equation that will relate the
response, Y, to a fixed value of X, such as in stability prediction (concentration of drug
versus time).
In addition to the specific applications noted above, regression analysis is used to define
and characterize dose–response relationships, for fitting linear portions of pharmacokinetic
data, and in obtaining the best fit to linear physical–chemical relationships.
Correlation is a procedure commonly used to characterize quantitatively the relationship
between variables. Correlation is related to linear regression, but its application and interpreta-
tion are different. This topic is introduced at the end of this chapter.
7.1 INTRODUCTION
Straight lines are constructed from sets of data pairs, X and Y. Two such pairs (i.e., two points)
uniquely define a straight line. As noted previously, a straight line is defined by the equation
Y = A + B X, (7.1)
where A is the Y intercept (the value of Y when X = 0) and B is the slope (Y/X). Y/X is
(Y2 − Y1 )/(X2 − X1 ) for any two points on the line (Fig. 7.1). The slope and intercept define the
line; once A and B are given, the line is specified. In the elementary example of only two points,
a statistical approach to define the line is clearly unnecessary.
In general, with more than two X, y points,† a plot of y versus X will not exactly describe
a straight line, even when the relationship is known to be linear. The failure of experimental
data derived from truly linear relationships to lie exactly on a straight line is due to errors
of observation (experimental variability). Figure 7.2 shows the results of four assays of drug
samples of different, but known potency. The assay results are plotted against the known
amount of drug. If the assays are performed without error, the plot results in a 45◦ line (slope
= 1) which, if extended, passes through the origin; that is, the Y intercept, A, is 0 [Fig. 7.2(A)].
∗ The notation Y = A + BX is standard in statistics. We apologize for any confusion that may result from the
reader’s familiarity with the equivalent, Y = mX + b, used frequently in analytical geometry.
† In the rest of this chapter, y denotes the experimentally observed point, and Y denotes the corresponding point
on the least squares “fitted” line (or the true value of Y, according to context).
148 CHAPTER 7
Figure 7.2 Plot of assay recovery versus known amount: theoretical and actual data.
calculus, the slope and intercept of the least squares line can be calculated from the sample data
as follows:
(X − X)(y − y)
Slope = b = (7.2)
(X − X)2
Intercept = a = y − b X (7.3)
Remember that the slope and intercept uniquely define the line.
There is a shortcut computing formula for the slope, similar to that described previously
for the standard deviation
N Xy − ( X)( y)
b= , (7.4)
N X2 − ( X)2
Figure 7.4 Lack of fit due to (A) experimental error and (B) nonlinearity.
150 CHAPTER 7
Table 7.1 Raw Data from Figure 7.2(A) to Calculate the Least
Squares Line
Table 7.2 Raw Data from Figure 7.2(B) Used to Calculate the
Least Squares Line
where N is the number of X, y pairs. The calculation of the slope and intercept is relatively
simple, and can usually be quickly computed using a computer (e.g., EXCEL) or with a hand
calculator. Some calculators have a built-in program for calculating the regression parameter
estimates, a and b.‡
For the example shown in Figure 7.2(A), the line that exactly passes through the four data
points has a slope of 1 and an intercept of 0. The line, Y = X, is clearly the best line for these data,
an exact fit. The least squares line, in this case, is exactly the same line, Y = X. The calculation of
the intercept and slope using the least squares formulas, Eqs. (7.3) and (7.4), is illustrated below.
Table 7.1 shows the raw data used to construct the line
in Figure 7.2(A).
According to Eq. (7.4) (N = 4, X = 34,400, Xy = 34,400, X =
2
y = 360),
a is computed from Eq. (7.3); a = y − b X(y = X = 90, b = 1). a = 90 − 1(90) = 0. This represents
a situation where the assay results exactly equal the known drug potency (i.e., there is no error).
The actual experimental data depicted in Figure 7.2(B) are shown in Table 7.2. The slope
b and the intercept a are calculated from Eqs. (7.4) and (7.3). According to Eq. (7.4),
(4)(33,600) − (360)(353)
b= = 0.915.
4(34,400) − (360)2
This example suggests several questions and problems regarding linear regression analy-
sis. The line that best fits the experimental data is an estimate of some true relationship between
X and Y. In most circumstances, we will fit a straight line to such data only if we believe that the
true relationship between X and Y is linear. The experimental observations will not fall exactly
on a straight line because of variability (e.g., error associated with the assay). This situation (true
linearity associated with experimental error) is different from the case where the underlying
true relationship between X and Y is not linear. In the latter case, the lack of fit of the data to the
least squares line is due to a combination of experimental error and the lack of linearity of the X,
Y relationship (Fig. 7.4). Elementary techniques of simple linear regression will not differentiate
these two situations: (a) experimental error with true linearity and (b) experimental error and
nonlinearity. (A design to estimate variability due to both nonlinearity and experimental error
is given in App. II.)
We will discuss some examples relevant to pharmaceutical research that make use of
least squares linear regression procedures. The discussion will demonstrate how variability is
estimated and used to construct estimates and tests of the line parameters A and B.
Rearranging Eq. (7.5), an unknown sample that has an assay value of 90 can be predicted
to have a true potency of
y − 5.9
Potency = X =
0.915
90 − 5.9
Potency = = 91.9.
0.915
This point (91.9, 90) is indicated in Figure 7.2 by a cross.
The least squares line fitted with the zero intercept is shown in Figure 7.5. If this line
were to be used to predict actual concentrations based on assay results, we would obtain
answers that are different from those predicted from the line drawn in Figure 7.2(B). However,
both lines have been constructed from the same raw data. “Is one of the lines correct?” or “Is
one line better than the other?” Although one cannot say with certainty which is the better
line, a thorough knowledge of the analytical method will be important in making a choice.
For example, a nonzero intercept suggests either nonlinearity over the range of assays or the
152 CHAPTER 7
presence of an interfering substance in the sample being analyzed. The decision of which
line to use can also be made on a statistical basis. A statistical test of the intercept can be
performed under the null hypothesis that the intercept is 0 (H0 : A = 0, sect. 7.4.1). Rejection of
the hypothesis would be strong evidence that the line with the positive intercept best represents
the data.
1. The X variable is measured without error. Although not always exactly true, X is often measured
with relatively little error and, under these conditions this assumption can be considered
to be satisfied. In the present example, X is the potency of drug in the “known” sample. If
the drug is weighed on a sensitive balance, the error in drug potency will be very small.
Another example of an X variable that is often used, which can be precisely and accurately
measured, is “time.”
2. For each X, y is independent and normally distributed. We will often use the notation Y.x to show
that the value of Y is a function of X.
3. The variance of y is assumed to be the same at each X. If the variance of y is not constant, but
is either known or related to X in some way, other methods (see sect. 7.7) are available to
estimate the intercept and slope of the line [1].
4. A linear relationship exists between X and Y. Y = A + BX, where A and B are the true parameters.
Based on theory or experience, we have reason to believe that X and Y are linearly related.
These assumptions are depicted in Figure 7.6. Except for location (mean), the distribution
of y is the same at every value of X; that is, y has the same variance at every value of X. In the
example in Figure 7.6, the mean of the distribution of y’s decreases as X increases (the slope
is negative).
LINEAR REGRESSION AND CORRELATION 153
Y,x
2
is the variance of the response variable, y. An estimate of Y,x 2
can be obtained from the
closeness of the data to the least squares line. If the experimental points are far from the least
squares line, the estimated variability is larger than that in the case where the experimental
points are close to the least squares line. This concept is illustrated in Figure 7.7. If the data
exactly fit a straight line, the experiment shows no variability. In real experiments the chance of
an exact fit with more than two X, y pairs is very small. An unbiased estimate of Y,x2
is obtained
from the sum of squares of deviations of the observed points from the fitted line as follows:
(y − Y)2 (y − y)2 − b 2 [ (X − X)2 ]
SY,x =
2
= , (7.9)
N−2 N−2
where y is the observed value and Y is the predicted value of Y from the least squares line
(Y = a + bX) (Fig. 7.7). The variance estimate, SY,x2
, has N − 2 rather than (N − 1) d.f. because
two parameters are being estimated from the data (i.e., the slope and intercept).
When Y,x2 2
is unknown, the variances of a and b can be estimated, substituting SY,x for y,x
2
in the formulas for the variances [Eqs. (7.7) and (7.8)]. Equations (7.10) and (7.11) are used as
the variance estimates, Sa2 and Sb2 , when testing hypotheses concerning the parameters A and B.
This procedure is analogous to using the sample estimate of the variance in the t test to compare
sample means.
2
1 X
Sa2 = SY,x
2
× + (7.10)
N (X − X)2
2
SY,x
Sb2 = (7.11)
(X − X)2
§ a and b are calculated as linear combinations of the normally distributed response variable, y, and thus can be
shown to be also normally distributed.
154 CHAPTER 7
|a − A|
td.f. = t2 = (7.12)
Sa2
where td.f. is the t statistic with N − 2 d.f., a is the observed value of the intercept, and A is the
hypothetical value of the intercept. From Eq. (7.10)
2
1 X
Sa2 = 2
SY,x × + . (7.10)
N (X − X)2
|5.9 − 0|
t2 = √ = 0.82.
52.245
Note that this t test has 2 (N − 2) d.f. This is a weak test, and a large intercept must be
observed to obtain statistical significance. To define the intercept more precisely, it would be
necessary to perform a larger number of assays. If there is no reason to suspect a nonlinear
relationship between X and Y, a nonzero intercept, in this example, could be interpreted as
being due to some interfering substance(s) in the product (the “blank”). If the presence of a
nonzero intercept is suspected, one would probably want to run a sufficient number of assays
to establish its presence. A precise estimate of the intercept is necessary if this linear calibration
curve is used to evaluate potency.
LINEAR REGRESSION AND CORRELATION 155
H0 : B = 1 Ha : B = 1
b−B
t= . (7.13)
Sb2
|0.915 − 1|
t= √ = 1.09.
0.006075
This t test has 2 (N − 2) d.f. (the variance estimate has 2 d.f.). There is insufficient evidence
to indicate that the slope is significantly different from 1 at the 5% level. Table IV.4 shows that
a t of 4.30 is needed for significance at ␣ = 0.05 and d.f. = 2. The test in this example has very
weak power. A slope very different from 1 would be necessary to obtain statistical significance.
This example again emphasizes the weakness of the statement “nonsignificant,” particularly
in small experiments such as this one. The reader interested in learning more details of the
use and interpretation of regression in analytical methodology is encouraged to read chapter 5
in Ref. [2].
be linear. The current good manufacturing practices (CGMP) regulations [3] state that statistical
criteria, including sample size and test (i.e., observation or measurement) intervals for each
attribute examined, be used to assure statistically valid estimates of stability (211.166). The
expiration date should be “statistically valid” (211.137, 201.17, 211.62).
The mechanics of determining shelf life may be quite complex, particularly if extreme
conditions are used, such as those recommended for “accelerated” stability studies (e.g., high-
temperature and high-humidity conditions). In these circumstances, the statistical techniques
used to make predictions of shelf life at ambient conditions are quite advanced and beyond
the scope of this book [4]. Although extreme conditions are commonly used in stability testing
in order to save time and obtain a tentative expiration date, all products must eventually
be tested for stability under the recommended commercial storage conditions. The FDA has
suggested that at least three batches of product be tested to determine an expiration date. One
should understand that different batches may show somewhat different stability characteristics,
particularly in situations where additives affect stability to a significant extent. In these cases
variation in the quality and quantity of the additives (excipients) between batches could affect
stability. One of the purposes of using several batches for stability testing is to ensure that
stability characteristics are similar from batch to batch.
The time intervals chosen for the assay of storage samples will depend to a great extent
on the product characteristics and the anticipated stability. A “statistically” optimal design for
a stability study would take into account the planned “storage” times when the drug product
will be assayed. This problem has been addressed in the pharmaceutical literature [5]. How-
ever, the designs resulting from such considerations are usually cumbersome or impractical. For
example, from a statistical point of view, the slope of the potency versus time plot (the rate of
decomposition) is obtained most precisely if half of the total assay points are performed at time
0, and the other half at the final testing time. Note that (X − X)2 the denominator of the expres-
sion defining the variance of a slope [Eq. (7.8)], is maximized under this condition, resulting
in a minimum variability of the slope. This “optimal” approach to designating assay sampling
times is based on the assumption that the plot is linear during the time interval of the test. In
a practical situation, one would want to see data at points between the initial and final assay
in order to assess the magnitude of the decomposition as the stability study proceeds, as well
as to verify the linearity of the decomposition. Also, management and regulatory requirements
are better satisfied with multiple points during the course of the study. A reasonable sched-
ule of assays at ambient conditions is 0, 3, 6, 9, 12, 18, and 24 months and at yearly intervals
thereafter [6].
The example of the data analysis that will be presented here will be for a single batch. If
the stability of different batches is not different, the techniques described here may be applied to
data from more than one batch. A statistician should be consulted for the analysis of multibatch
data that will require analysis of variance techniques [6,7]. The general approach is described
in section 8.7.
Typically, stability or shelf life is determined from data from the first three production
batches for each packaging configuration (container type and product strength) (see sect. 8.7).
Because such testing may be onerous for multiple strengths and multiple packaging of the
same drug product, matrixing and bracketing techniques have been suggested to minimize the
number of tests needed to demonstrate suitable drug stability [8].
Assays are recommended to be performed at time 0 and 3, 6, 9, 12, 18 and 24 months,
with subsequent assays at 12-month intervals as needed. Usually, three batches of a given
strength and package configuration are tested to define the shelf life. Because many products
have multiple strengths and package configurations, the concept of a “Matrix” design has been
introduced to reduce the considerable amount of testing required. In this situation, a subset of
all combinations of product strength, container type and size, and so on is tested at a given
time point. Another subset is tested at a subsequent time point. The design should be balanced
“such that each combinations of factors is tested to the same extent.” All factor combinations
should be tested at time 0 and at the last time point of the study. The simplest such design, called
a “Basic Matrix 2/3 on Time Design,” has two of the three batches tested at each time point,
with all three batches tested at time 0 and at the final testing time, the time equal to the desired
shelf life. Table 7.3 shows this design for a 36-month product. Tables of matrix designs show
LINEAR REGRESSION AND CORRELATION 157
Table 7.3 Matrix Design for Three Packages and Three Strengths
Table 7.3A Matrix Design for Three Batches and Two Strengths
designs for multiple packages (made from the same blend or batch) and for multiple packages
and strengths. These designs are constructed to be symmetrical in the spirit of optimality for
such designs. For example, this is illustrated in Table 7.3, looking only at the “5” strength for
Package 1. Table 7.3 shows this design for a 36-month product with multiple packages and
strengths (made from the same blend). For example, in Table 7.3, each batch is tested twice, each
package from each batch is tested twice, and each package is tested six times at all time points
between 0 and 36 months.
With multiple strengths and packages, other similar designs with less testing have been
described [9].
The risks of applying such designs are outlined in the Guidance [8]. Because of the limited
testing, there is a risk of less precision and shorter dating. If pooling is not allowed, individual
lots will have short dating, and combinations not tested in the matrix will not have dating
estimates. Read the guidance for further details. The FDA guidance gives examples of other
designs.
The analysis of these designs can be complicated. The simplest approach is to analyze
each strength and configuration separately, as one would do if there were a single strength and
package. Another approach is to model all configurations including interactions. The assump-
tions, strengths, and limitations of these designs and analyses are explained in more detail in
Ref. [9].
A Bracketing design [10] is a design of a stability program such that at any point in
time only extreme samples are tested, such as extremes in container size and dosage. This is
particularly amenable to products that have similar composition across dosage strengths and
that intermediate size and strength products are represented by the extremes [10]. (See also FDA
Guideline on Stability for further discussion as to when this is applicable.)
Suppose that we have a product in three strengths and three package sizes. Table 7.4 is an
example of a Bracketing design [10].
158 CHAPTER 7
The testing designated by T should be the full testing as would be required for a single
batch. Note that full testing would require nine combinations, or 27 batches. The matrix design
uses four combinations, or 12 batches.
Consider an example of a tablet formulation that is the subject of a stability study.
Three randomly chosen tablets are assayed at each of six time periods: 0, 3, 6, 9, 12, and
18 months after production, at ambient storage conditions. The data are shown in Table 7.5 and
Figure 7.8.
Given these data, the problem is to establish an expiration date defined as that time when
a tablet contains 90% of the labeled drug potency. The product in this example has a label of
50 mg potency and is prepared with a 4% overage (i.e., the product is manufactured with a
target weight of 52 mg of drug). Note that FDA is currently discouraging the use of overages to
compensate for poor stability.
Figure 7.8 shows that the data are variable. A careful examination of this plot suggests
that a straight line would be a reasonable representation of these data. The application of least
squares line fitting is best justified in situations where a theoretical model exists showing that the
decrease in concentration is linear with time (a zero-order process in this example). The kinetics
of drug loss in solid dosage forms is complex and a theoretical model is not easily derived. In
the present case, we will assume that concentration and time are truly linearly related
C = C0 − K t, (7.14)
where C is the concentration at time t, C0 the concentration at time 0 (Y intercept, A), K the rate
constant (− slope, − B), and t the time (storage time).
With the objective of estimating the shelf life, the simplest approach to the analysis of
these data is to estimate the slope and intercept of the least squares line, using Eqs. (7.4) and
(7.3). (An interesting exercise would be to first try and estimate the slope and intercept by eye
from Fig. 7.8.) When performing the least squares calculation, note that each value of the time
(X) is associated with three values of drug potency (y). When calculating C0 and K, each “time”
value is counted three times and N is equal to 18. From Table 7.3,
X = 144 y = 894 Xy = 6984
X2 = 1782 y2 = 44, 476 N = 18
X=8 (X − X)2 = 630 (y − y)2 = 74
The equation of the straight line best fitting the data in Figure 7.8 is
45 = 51.80 − 0.267 t
t = 25.5 month.
The best estimate of the time needed for these tablets to retain 45 mg of drug is 25.5 months
(see the point marked with a cross in Fig. 7.9). The shelf life for the product will be less than
25.5 months if variability is taken into consideration. The next section, 7.6, presents a discussion
of this topic. This is an average result based on the data from 18 tablets. For any single tablet,
the time for decomposition to 90% of the labeled amount will vary, depending, for example, on
the amount of drug present at time zero. Nevertheless, the shelf-life estimate is based on the
average result.
The concept of a confidence interval in regression is similar to that previously discussed for
means. Thus the interval for the shelf life probably contains the true shelf life—that time when
the tablets retain 90% of their labeled potency, on the average. The lower end of this confidence
interval would be considered a conservative estimate of the true shelf life. Before giving the
solution to this problem we will address the calculation of a confidence interval for Y (potency)
at a given X (time). The width of the confidence interval for Y (potency) is not constant, but
depends on the value of X, since Y is a function of X. In the present example, one might wish to
obtain a range for the potency at 25.5 months’ storage time.
1 (X − X)2
Y ± t(SY,x ) + . (7.16)
N (X − X)2
t is the appropriate value (N − 2 d.f., Table IV.4) for a confidence interval with confidence
coefficient P. For example, for a 95% confidence interval, use t values in the column headed
0.975 in Table IV.4.
In the linear regression model, y is assumed to have a normal distribution with variance
Y,x
2
at each X. As can be seen from Eq. (7.16), confidence limits for Y at a specified value
of X depend on the variance, degrees of freedom, number of data points used to fit the line, and
X − X the distance of the specified X (time, in this example) from X, the average time used in
the least squares line fitting. The confidence interval is smallest for the Y that corresponds to
the value of X equal to X, [the term, X −X, in Eq. (7.16) will be zero]. As the value of X is
farther from X, the confidence interval for Y corresponding to the specified X is wider. Thus the
estimate of Y is less precise, as the X corresponding to Y is farther away from X. A plot of the
confidence interval for every Y on the line results in a continuous confidence “band” as shown in
Figure 7.9. The curved, hyperbolic shape of the confidence band illustrates the varying width
of the confidence interval at different values of X, Y. For example, the 95% confidence interval
for Y at X = 25.5 months [Eq. (7.16)] is
1 (25.5 − 8)2
45 ± 2.12(1.35) + = 45 ± 2.1.
18 630
Thus the result shows that the true value of the potency at 25.5 months is probably between
42.9 and 47.1 mg (45 ± 2.1).
LINEAR REGRESSION AND CORRELATION 161
, (7.17)
1−g
where
2
t 2 (SY,x )
g=
b 2 (X − X)2
t is the appropriate value for a confidence interval with confidence coefficient equal to P; for
example, for a two-sided 95% confidence interval, use values of t in the column headed 0.975
in Table IV.4.
A 95% confidence interval for X will be calculated for the time to 90% of labeled potency.
The potency is 45 mg (Y) when 10% of the labeled amount decomposes. The corresponding time
(X) has been calculated above as 25.5 months. For a two-sided confidence interval, applying
Eq. (7.17), we have
(2.12)2 (1.825)
g= = 0.183
(−0.267)2 (630)
X = 25.5 X=8 N = 18.
Thus, using a two-sided confidence interval, the true time to 90% of labeled potency is
probably between 19.8 and 39.0 months. A conservative estimate of the shelf life would be
the lower value, 19.8 months. If g is greater than 1, a confidence interval cannot be calculated
because the slope is not significantly greater than 0.
The Food and Drug Administration has suggested that a one-sided confidence interval
may be more appropriate than a two-sided interval to estimate the expiration date. For most
drug products, drug potency can only decrease with time, and only the lower confidence band of
the potency versus time curve may be considered relevant. (An exception may occur in the case
of liquid products where evaporation of the solvent could result in an increased potency with
time.) The 95% one-sided confidence limits for the time to reach a potency of 45 are computed
162 CHAPTER 7
using Eq. (7.17). Only the lower limit is computed using the appropriate t value that cuts off
5% of the area in a single tail. For 16 d.f., this value is 1.75 (Table IV.4), “g” = 0.1244. The
calculation is
[25.5 − 0.1244(8)] + [1.75(1.35)/(−0.267)][ 0.8756/18 + (17.5)2 /630
0.8756
= 20.6 months.
The one-sided 95% interval for X can be interpreted to mean that the time to decompose
to a potency of 45 is probably greater than 20.6 months. Note that the shelf life based on the
one-sided interval is longer than that based on a two-sided interval (Fig. 7.9).
. (7.19)
1−g
The following examples should clarify the computations. In the stability study example,
suppose that one wishes to construct a 95% confidence (prediction) interval for an assay to be
performed at 25.5 months. (An actual measurement is obtained at 25.5 months.) This interval
will be larger than that calculated based on Eq. (7.16), because the uncertainty now includes
assay variability for the proposed assay in addition to the uncertainty of the least squares line.
Applying Eq. (7.18) (Y = 45), we have
1 17.52
45 ± 2.12(1.35) 1 + + = 45 ± 3.55 mg.
18 630
In the example of the calibration line, consider an unknown sample that is analyzed and
shows a value (y) of 90. A prediction interval for X is calculated using Eq. (7.19). X is predicted
LINEAR REGRESSION AND CORRELATION 163
The relatively large uncertainty of the estimate of the true value is due to the small number
of data points (four) and the relatively large variability of the points about the least squares line
2
(SY,x = 12.15).
A 95% confidence interval for the slope of the line in the stability example is [Eq. (7.20)]
2.12(1.35)
(−0.267) ± √ = −0.267 ± 0.114
630
= −0.381 to −0.153.
A 90% confidence interval for the intercept in the calibration line example (sect. 7.2) is
[Eq. (7.21)]
1 902
5.9 ± 2.93(3.49) + = 5.9 ± 21.2 = −15.3 to 27.1.
4 2000
(Note that the appropriate value of t with 2 d.f. for a 90% confidence interval is 2.93.)
y to make the variance homogeneous, such as the log transformation (see chap. 10), and (b) a
weighted regression analysis.
Below is an example of weighted regression analysis in which we assume a constant CV
and the variance of y proportional to X2 as noted above. This suggests a weighted regression,
weighting each value of Y by a factor that is inversely proportional to the variance, 1/X2 .
Table 7.6 shows data for the spectrophotometric analysis of a drug performed at 5 concentrations
in duplicate.
Equation (7.22) is used to compute the slope for the weighted regression procedure.
w Xy − w X wy w
b= . (7.22)
w X − ( w X)
2 2 w
0.19983 − (0.74)(0.0148693)/0.1042
= 0.01986.
10 − (0.74)2 /0.1042
The intercept is
a = yw − b(Xw ), (7.23)
where yw = wy/ w and Xw = w X/ w
Figure 7.10 Weighted regression plot for data from Table 7.7.
the fit to the statistical model. Examination of residuals can reveal problems such as variance
heterogeneity or nonlinearity. This brief introduction to the principle of residual analysis uses
the data from the regression analysis in section 7.7.
The residuals from a regression analysis are obtained from the differences between the
observed and predicted values. Table 7.7 shows the residuals from an unweighted least squares
fit of the data of Table 7.6. Note that the fitted values are obtained from the least squares equation
y = 0.001789 + 0.019874(X).
If the linear model and the assumptions in the least squares analysis are valid, the residuals
should be approximately normally distributed, and no trends should be apparent.
Figure 7.11 shows a plot of the residuals as a function of X. The fact that the residuals show a
fan-like pattern, expanding as X increases, suggests the use of a log transformation or weighting
procedure to reduce the variance heterogeneity. In general, the intelligent interpretation of
residual plots requires knowledge and experience. In addition to the appearance of patterns
in the residual plots that indicate relationships and character of data, outliers usually become
obviously apparent [12].
Figure 7.12 shows the residual plot after a log (In) transformation of X and Y. Much of the
variance heterogeneity has been removed.
For readers who desire more information on this subject, the book Graphical Exploratory
Data Analysis [13] is recommended.
Table 7.7 Residuals from Least Squares Fit of Analytical Data (Table 7.6)
Figure 7.11 Residual plot for unweighted analysis of data of Table 7.6.
Figure 7.12 Residual plot for analysis of In transformed data of Table 7.6.
Y = A + BX
Y = A + Be −x
is also linear in the parameters. One should also appreciate that a linear equation can exist in
more than two dimensions. The equation
Y = A + BX + CX2 ,
LINEAR REGRESSION AND CORRELATION 167
Y = A + e B X.
Y = Ae −B X
is not linear in the parameters, A and B. However, a log transformation results in a linear
equation
ln Y = ln A − BX.
Using the least squares approach, we can estimate ln A (A is the antilog) and B, where ln
A is the intercept and B is the slope of the straight line when ln Y is plotted versus X. If statistical
tests and other statistical estimates are to be made from the regression analysis, the assumptions
of normality of Y (now ln Y) and variance homogeneity of Y at each X are necessary. If Y is
normal and the variances of Y at each X are homogeneous to start with, the ln transformation
will invalidate the assumptions. (On the other hand, if Y is lognormal with constant CV, the log
transformation will be just what is needed to validate the assumptions.)
Some relationships cannot be linearized. For example, in pharmacokinetics, the one-
compartment model with first order absorption and excretion has the following form
where D, ke, and ka are constants (parameters). This equation cannot be linearized. The use of
nonlinear regression methods can be used to estimate the parameters in these situations as well
as the situations in which Y is normal with homogeneous variance prior to a transformation, as
noted above.
The solutions to nonlinear regression problems require more advanced mathematics rel-
ative to most of the material in this book. A knowledge of elementary calculus is necessary,
particularly the application of Taylor’s theorem. Also, a knowledge of matrix algebra is useful
in order to solve these kinds of problems. A simple example will be presented to demon-
strate the principles. The general matrix solutions to linear and multiple regression will also
be demonstrated.
In a stability study, the data in Table 7.8 were available for analysis. The equation repre-
senting the degradation process is
C = C0 e −kt . (7.24)
The concentration values are known to be normal with the variance constant at each value
of time. Therefore, the usual least squares analysis will not be used to estimate the parameters
C0 and k after the simple linearizing transformation:
ln C = ln C0 − kt.
The estimate of the parameters using nonlinear regression as demonstrated here uses
the first terms of Taylor’s expansion, which approximates the function and results in a linear
equation. It is important to obtain good initial estimates of the parameters, which may be
obtained graphically. In the present example, a plot of ln C versus time (Fig. 7.13) results in
initial estimates of 104 for C0 and +0.53 for k. The process then estimates a change in C0 and
a change in k that will improve the equation based on the comparison of the fitted data to
the original data. Typical of least squares procedures, the fit is measured by the sum of the
squares of the deviations of the observed values from the fitted values. The best fit results from
an iterative procedure. The new estimates result in a better fit to the data. The procedure is
repeated using the new estimates, which results in a better fit than that observed in the previous
iteration. When the fit, as measured by the sum of the squares of deviations, is negligibly
improved, the procedure is stopped. Computer programs are available to carry out these tedious
calculations.
The Taylor expansion requires taking partial derivatives of the function with respect to C0
and k. For the equation, C = C0 e −kt , the resulting expression is
dC = dC0 (e −k t ) − dk (C0 )(te −k t ). (7.25)
In Eq. (7.25), dC is the change in C resulting from small changes in C0 and k evaluated
at the point, C0 and k . dC0 is the change in the estimate of C0 , and dk is the change in the
estimate of k. (e −k t ) and C0 (te −k t ) are the partial derivatives of Eq. (7.24) with respect to C0 and
k, respectively.
Equation (7.25) is linear in dC0 and dk . The coefficients of dC0 and dk are (e −k t ) and
−(C0 )(te −k t ), respectively. In the computations below, the coefficients are referred to as X1
and X2, respectively, for convenience. Because of the linearity, we can obtain the least squares
estimates of dC0 and dk by the usual regression procedures.
The computations for two iterations are shown below. The solution to the least squares
equation is usually accomplished using matrix manipulations. The solution for the coefficients
can be proven to have the following form:
Time (t ) C C dC X1 X2
1 63 61.2 1.79 0.5886 −61.2149
2 34 36.0 −2.03 0.3465 −72.0628
3 22 21.2 0.79 0.2039 −63.6248
2
dC = 7.94
The matrix B will contain the estimates of the coefficients. With two coefficients, this will
be a 2 × 1 (2 rows and 1 column) matrix.
In Table 7.9, the values of X1 and X2 are (e −k t ) and (C0 )(te −k t ), respectively, using the
initial estimates of C0 = 104 and k = + 0.53 (Fig. 7.13). Note that the fit is measured by the
2
dC = 7.94.
The solution of (X X)−1 (X Y) gives the estimates of the parameters, dC0 and k
−1
X X X Y
11.5236 0.06563 0.5296 4.99
=
0.06563 0.00045079 −16.9611 0.027
2 new values of C are calculated in Table 7.10.
With these estimates,
Note that the dC is 5.85, which is reduced from 7.94, from the initial iteration. The
solution of (X X)−1 (X Y) is
12.587 + 0.06964 0.0351 0.378
+0.06964 0.0004635 −0.909 = 0.002
The reader can verify that the new value of dC 2 is now 5.74. The process is repeated until
2
dC becomes stable. The final solution is C0 = 109.22, k 0.558.
Another way of expressing the decomposition is
C = e ln C0 −kt
Time (t) C C dC X1 X2
1 63 62.4 0.6 0.5729 −62.4431
2 34 35.8 −1.8 0.3282 −71.5505
3 22 20.5 1.5 0.18806 −61.4896
2
dC = 5.85
170 CHAPTER 7
or
ln C = ln C0 − kt.
The ambitious reader may wish to try a few iterations using this approach. Note that the
partial derivatives of C with respect to C0 and k are (1/C0 ) (e ln C0 −kt )and −t(e ln C0 −kt ), respectively.
7.10 CORRELATION
Correlation methods are used to measure the “association” of two or more variables. Here,
we will be concerned with two observations for each sampling unit. We are interested in
determining if the two values are related, in the sense that one variable may be predicted from a
knowledge of the other. The better the prediction, the better the correlation. For example, if we
could predict the dissolution of a tablet based on tablet hardness, we say that dissolution and
hardness are correlated. Correlation analysis assumes a linear or straight-line relationship between
the two variables.
Correlation is usually applied to the relationship of continuous variables, and is best
visualized as a scatter plot or correlation diagram. Figure 7.14(A) shows a scatter plot for two
variables, tablet weight and tablet potency. Tablets were individually weighed and then assayed.
Each point in Figure 7.14(A) represents a single tablet (X = weight, Y = potency). Inspection
of this diagram suggests that weight and potency are positively correlated, as is indicated by
the positive slope, or trend. Low-weight tablets are associated with low potencies, and vice
versa. This positive relationship would probably be expected on intuitive grounds. If the tablet
granulation is homogeneous, a larger weight of material in a tablet would contain larger amounts
of drug. Figure 7.14(B) shows the correlation of tablet weights and dissolution rate. Smaller tablet
weights are related to higher dissolution rates, a negative correlation (negative trend).
Inspection of Figure 7.14(A) and (B) reveals what appears to be an obvious relationship.
Given a tablet weight, we can make a good “ballpark” estimate of the dissolution rate and
Figure 7.14 Examples of various correlation diagrams or scatter plots. The correlation coefficient, r , is defined
in section 7.10.1.
LINEAR REGRESSION AND CORRELATION 171
potency. However, the relationship between variables is not always as apparent as in these
examples. The relationship may be partially obscured by variability, or the variables may not be
related at all. The relationship between a patient’s blood pressure reduction after treatment with
an antihypertensive agent and serum potassium levels is not as obvious [Fig. 7.14(C)]. There
seems to be a trend toward higher blood pressure reductions associated with higher potassium
levels—or is this just an illusion? The data plotted in Figure 7.14(D), illustrating the correlation
of blood pressure reduction and age, show little or no correlation.
The various scatter diagrams illustrated in Figure 7.14 should give the reader an intuitive
feeling for the concept of correlation. There are many experimental situations where a researcher
would be interested in relationships among two or more variables. Similar to applications of
regression analysis, correlation relationships may allow for prediction and interpretation of
experimental mechanisms. Unfortunately, the concept of correlation is often misused, and more
is made of it than is deserved. For example, the presence of a strong correlation between
two variables does not necessarily imply a causal relationship. Consider data that show a
positive relationship between cancer rate and consumption of fluoridated water. Regardless of
the possible validity of such a relationship, such an observed correlation does not necessarily
imply a causal effect. One would have to investigate further other factors in the environment
occurring concurrently with the implementation of fluoridation, which may be responsible
for the cancer rate increase. Have other industries appeared and grown during this period,
exposing the population to potential carcinogens? Have the population characteristics (e.g.,
racial, age, sex, economic factors) changed during this period? Such questions may be resolved
by examining the cancer rates in control areas where fluoridation was not enforced.
The correlation coefficient is a measure of the “degree” of correlation, which is often
erroneously interpreted as a measure of “linearity.” That is, a strong correlation is sometimes
interpreted as meaning that the relationship between X and Y is a straight line. As we shall see
further in this discussion, this interpretation of correlation is not necessarily correct.
Xy −
N X y
r = , (7.27)
2 2 2 2
N X − ( X) N y − ( y)
(0 + 1 + 5)2
(y − y)2 = 02 + 12 + 52 − = 14.
3
If these same y values were associated with X values, the sum of squares of y from the
regression of y and X will be equal to or less than (y − y)2 , or 14 in this example. Suppose that
X and y values are as follows (Fig. 7.15):
172 CHAPTER 7
X y Xy
0 0 0 ( X − X )2 = 2
1 1 1
2 5 10 ( y − y)2 = 14
Sum 3 6 11
According to Eq. (7.9), the sum of squares due to deviations of the y values from the
regression line is
2
(y − y)2 − b (X − X)2 , (7.28)
where b is the slope of the regression line (y on X). The term b 2 (X − X)2 is the reduction in the
sum of squares due to the straight-line regression fit. Applying Eq. (7.28), the sum of squares is
14 − 1.5 √
= 0.893 r= 0.893 = 0.945.
14
3(11) − (3)(6) 15
= = 0.945.
[3(5) − (3)2 ][3(26) − (36)] 6(42)
Thus, according to this notion, r can be interpreted as the relative degree of scatter about
2
the regression line. If X and y values lie exactly on a straight line (a perfect fit), SY,x is 0, and r is
equal to ± 1; +1 for a line of positive slope and −1 for a line of negative slope. For a correlation
coefficient equal to 0.5, r2 = 0.25. The sum of squares for y is reduced 25%. A correlation
coefficient of 0 means that the X, y pairs are not correlated [Fig. 7.14(D)].
Although there are no assumptions necessary to calculate the correlation coefficient, sta-
tistical analysis of r is based on the notion of a bivariate normal distribution of X and y. We will
not delve into the details of this complex probability distribution here. However, there are two
interesting aspects of this distribution that deserve some attention with regard to correlation
analysis.
1. In typical correlation problems, both X and y are variable. This is in contrast to the linear
regression case, where X is considered fixed, chosen, a priori, by the investigator.
LINEAR REGRESSION AND CORRELATION 173
2. In a bivariate normal distribution, X and y are linearly related. The regression of both X on y
and y on X is a straight line.¶ Thus, when statistically testing correlation coefficients, we are
not testing for linearity. As described below, the statistical test of a correlation coefficient is
a test of correlation or independence. According to Snedecor and Cochran, the correlation
coefficient “estimates the degree of closeness of a linear relationship between two variables,
Y and X, and the meaning of this concept is not easy to grasp” [11].
H0 : = 0 Ha : = 0,
The value of t is referred to a t distribution with (N − 2) d.f., where N is the sample size
(i.e., the number of pairs). Interestingly, this test is identical to the test of the slope of the least
squares fit, Y = a + bX [Eq. (7.13)]. In this context, one can think of the test of the correlation
coefficient as a test of the significance of the slope versus 0.
To illustrate the application of Eq. (7.29), Table 7.11 shows data of diastolic blood pressure
and cholesterol levels of 10 randomly selected men. The data are plotted in Figure 7.16. r is
calculated from Eq. (7.27)
Xy −
N X y
r =
N X2 − ( X)2 N y2 − ( y)2
(7.30)
10(260,653) − (3111)(825)
= = 0.809.
[10(987,893) − 31112 ][10(69,279) − 8252 ]
A value of t equal to 2.31 is needed for significance at the 5% level (see Table IV.4).
Therefore, the correlation between diastolic blood pressure and cholesterol is significant. The
correlation is apparent from inspection of Figure 7.16.
¶ The regression of y on X means that X is assumed to be the fixed variable when calculating the line. This line
is different from that calculated when Y is considered the fixed variable (unless the correlation coefficient is 1,
when both lines are identical). The slope of the line is r Sy /Sx for the regression of y on X and r Sx /Sy for x on Y.
174 CHAPTER 7
Diastolic blood
Person pressure (DBP), y Cholesterol (C ), X Xy
1 80 307 24,560
2 75 259 19,425
3 90 341 30,690
4 74 317 23,458
5 75 274 20,550
6 110 416 45,760
7 70 267 18,690
8 85 320 27,200
9 88 274 24,112
10
78
336 26,208
y = 825 X = 3111 Xy = 260,653
2 2
y = 69,279 X = 987,893
Significance tests for the correlation coefficient versus values other than 0 are not very
common. However, for these tests, the t test described above [Eq. (7.29)] should not be used. An
approximate test is available to test for correlation coefficients other than 0 (e.g., H0 : = 0.5).
Since applications of this test occur infrequently in pharmaceutical experiments, the procedure
will not be presented here. The statistical test is an approximation to the normal distribution, and
the approximation can also be used to place confidence intervals on the correlation coefficient.
A description of these applications is presented in Ref. [11].
Set A Set B
X y X y
−2 0 0 0
−1 3 2 4
0 4 4 16
+1 3 6 36
+2 0
A proper test for linearity (i.e., do the data represent a straight-line relationship between X
and Y?) is described in Appendix II and requires replicate measurements in the regression
model. Usually, correlation problems deal with cases where both variables, X and y, are variable
in contrast to the regression model where X is considered fixed. In correlation problems, the
question of linearity is usually not of primary interest. We are more interested in the degree of
association of the variables. Two examples will show that a high correlation coefficient does not
necessarily imply “linearity” and that a small correlation coefficient does not necessarily imply
lack of correlation (if the relationship is nonlinear).
Table 7.12 shows two sets of data that are plotted in Figure 7.17. Both data sets A and B
show perfect (but nonlinear) relationships between X and y. Set A is defined by Y = 4 − X2 . Set B
is defined by Y = X2 . Yet the correlation coefficient for set A is 0, an implication of no correlation,
and set B has a correlation coefficient of 0.96, very strong correlation (but not linearity!). These
examples should emphasize the care needed in the interpretation of the correlation coefficient,
particularly in nonlinear systems.
Another example of data for which the correlation coefficient can be misleading is shown in
Table 7.13 and Figure 7.18. In this example, drug stability is plotted versus pH. Five experiments
were performed at low pH and one at high pH. The correlation coefficient is 0.994, a highly
significant result (p < 0.01). Can this be interpreted that the data in Figure 7.18 are a good
fit to a straight line? Without some other source of information, it would take a great deal of
imagination to assume that the relationship between pH and t1/2 is linear over the range of pH
equal to 2.0 to 5.5. Even if the relationship were linear, had data been available for points in
between pH 2.0 and 5.5, the fit may not be as good as that implied by the large value of r in
this example. This situation can occur when one value is far from the cluster of the main body
of data. One should be cautious in “over-interpreting” the correlation coefficient in these cases.
When relationships between variables are to be quantified for predictive or theoretical reasons,
regression procedures, if applicable, are recommended. Correlation, per se, is not as versatile or
informative as regression analysis for describing the relationship between variables.
Figure 7.17 Plot of data in Table 7.12 showing problems with interpretation of the correlation coefficient.
176 CHAPTER 7
2.0 48
2.1 50
1.9 50
2.0 46
2.1 47
5.5 12
paired-sample tests arise, for example, in situations where the same subject tests two treatments,
such as in clinical or bioavailability studies. To test for the equality of variances in related
samples, we must first calculate the correlation coefficient and the F ratio of the variances. The
test statistic is calculated as follows:
F −1
rds = , (7.31)
(F + 1)2 − 4r 2 F
where F is the ratio of the variances in the two samples and r is the correlation coefficient.
The ratio in Eq. (7.30), rds , can be tested for significance in the same manner as the test for
the ordinary correlation coefficient, with (N − 2) d.f., where N is the number of pairs [Eq. (7.29)].
As is the case for tests of the correlation coefficient, we assume a bivariate normal distribution
for the related data. The following example demonstrates the calculations.
In a bioavailability study, 10 subjects were given each of two formulations of a drug
substance on two occasions, with the results for AUC (area under the blood level versus time
curve) given in Table 7.14.
The correlation coefficient is calculated according to Eq. (7.27).
(64,421)(10) − (781)(815)
r= = 0.699.
[(62,821)(10) − (781)2 ][(67,087)(10) − (815)2 ]
202.8
= 2.75.
73.8
[Note: The ratio of the variances may also be calculated as 73.8/202.8 = 0.36, with the
same conclusions based on Eq. (7.31).]
LINEAR REGRESSION AND CORRELATION 177
Formulation
Subject A B
1 88 88
2 64 73
3 69 86
4 94 89
5 77 80
6 85 71
7 60 70
8 105 96
9 68 84
10 73 78
Mean 78.1 81.5
S2 202.8 73.8
2.75 − 1
rds = = 0.593,
(2.75 + 1)2 − 4(0.699)2 (2.75)
Referring to the t table (Table IV.4, 8 d.f.), a value of 2.31 is needed for significance at the
5% level. Therefore, we cannot reject the null hypothesis of equal variances in this example.
Formulation A appears to be more variable, but more data would be needed to substantiate
such a claim.
A discussion of correlation of multiple outcomes and adjustment of the significance level
is given in section 8.2.2.
KEY TERMS
Best-fitting line Nonlinear regression
Bivariate normal distribution Nonlinearity
Confidence band for line One-sided confidence interval
Confidence interval for X and Y Prediction interval
Correlation Reduction of sum of squares
Correlation coefficient Regression
Correlation diagram Regression analysis
Dependent variable Residuals
Fixed value (X) Scatter plot
Independence Simple linear regression
Independent variable Slope
2
Intercept SY,x
Inverse prediction Trend
Lack of fit Variance of correlated samples
Linear regression Weighted regression
Line through the origin
178 CHAPTER 7
EXERCISES
1. A drug seems to decompose in a manner such that appearance of degradation products
is linear with time (i.e., Cd = kt).
t Cd
1 3
2 9
3 12
4 17
5 19
(a) Calculate the slope (k) and intercept from the least squares line.
(b) Test the significance of the slope (test vs. 0) at the 5% level.
(c) Test the slope versus 5 (H0 : B = 5) at the 5% level.
(d) Put 95% confidence limits on Cd at t = 3 and t = 5.
(e) Predict the value of Cd at t = 20. Place a 95% prediction interval on Cd at t = 20.
(f) If it is known that Cd = 0 at t = 0, calculate the slope.
2. A Beer’s law plot is constructed by plotting ultraviolet absorbance versus concentration,
with the following results:
Concentration, X Absorbance, y Xy
1 0.10 0.10
2 0.36 0.72
3 0.57 1.71
5 1.09 5.45
10 2.05 20.50
(a) Plot potency versus weight (weight = X). Calculate the least squares line.
(b) Predict the potency for a 200-mg tablet.
(c) Put 95% confidence limits on the potency for a 200-mg tablet.
Calculate r and test for significance (versus 0) (5% level). Plot the data.
6. Shah et al. [14] measured the percent of product dissolved in vitro and the time to
peak (in vivo) of nine phenytoin sodium products, with approximately the following
results:
Plot the data. Calculate the correlation coefficient and test to see if it is significantly
different from 0 (5% level). (Why is the correlation coefficient negative?)
7. In a study to compare the effects of two pain-relieving drugs (A and B), 10 patients took
each drug in a paired design with the following results (drug effectiveness based on a
rating scale).
180 CHAPTER 7
x y
1 1.62
2 2.93
3 4.21
4 7.86
REFERENCES
1. Draper NR, Smith H. Applied Regression Analysis, 2nd ed. New York: Wiley, 1981.
2. Youden WJ. Statistical Methods for Chemists. New York: Wiley, 1964.
3. U.S. Food and Drug Administration. Current Good Manufacturing Practices (CGMP) 21 CFR.
Washington, DC: Commissioner of the Food and Drug Administration, 2006:210–229.
4. Davies OL, Hudson HE. Stability of drugs: accelerated storage tests. In: Buncher CR, Tsay J-Y, eds.
Statistics in the Pharmaceutical Industry. New York: Marcel Dekker, 1994:445–479.
5. Tootill JPR. A critical appraisal of drug stability testing methods. J Pharm Pharmacol 1961; 13(suppl):
75T–86T.
6. Davis J. The Dating Game. Washington, DC: Food and Drug Administration, 1978.
7. Norwood TE. Statistical analysis of pharmaceutical stability data. Drug Dev Ind Pharm 1986; 12:553–
560.
8. International Conference on Harmonization Bracketing and matrixing designs for stability testing of
drug substances and drug products (FDA Draft Guidance) Step 2, Nov 9, 2000.
9. Nordbrock ET. Stability matrix designs. In: Chow S-C, ed. Encyclopedia of Pharmaceutical Statistics.
New York: Marcel Dekker, 2000:487–492.
10. Murphy JR. Bracketing Design. In: Chow S-C, ed. Encyclopedia of Pharmaceutical Statistics.
New York: Marcel Dekker, 2000:77.
11. Snedecor GW, Cochran WG. Statistical Methods, 8th ed. Ames, IA: Iowa State University Press, 1989.
12. Weisberg S. Applied Linear Regression. New York: Wiley, 1980.
13. duToit SHC, Steyn AGW, Stumpf RH. Graphical Exploratory Data Analysis. New York: Springer, 1986.
14. Shah VP, Prasad VK, Alston T, et al. In vitro in vivo correlation for 100 mg phenytoin sodium capsules.
J Pharm Sci 1983; 72:306.