0% found this document useful (0 votes)
119 views19 pages

Sample Size and Power

1) The document discusses sample size and power, which are important considerations in experimental design. Sample size determination involves specifying the level of significance (α), difference to be detected (Δ), and standard deviation, among other factors. 2) Power is the probability that a statistical test will show a difference if one truly exists. It depends on sample size - larger samples have more power to detect differences. 3) The document provides methods for calculating sample size for simple comparative experiments using normally distributed or binomial data, given specifications of α, β, Δ, and the standard deviation. Minimum sample sizes are needed to achieve experimental objectives while balancing costs and resources.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
119 views19 pages

Sample Size and Power

1) The document discusses sample size and power, which are important considerations in experimental design. Sample size determination involves specifying the level of significance (α), difference to be detected (Δ), and standard deviation, among other factors. 2) Power is the probability that a statistical test will show a difference if one truly exists. It depends on sample size - larger samples have more power to detect differences. 3) The document provides methods for calculating sample size for simple comparative experiments using normally distributed or binomial data, given specifications of α, β, Δ, and the standard deviation. Minimum sample sizes are needed to achieve experimental objectives while balancing costs and resources.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 19

6 Sample Size and Power

The question of the size of the sample, the number of observations, to be used in scientific
experiments is of extreme importance. Most experiments beg the question of sample size.
Particularly when time and cost are critical factors, one wishes to use the minimum sample size
to achieve the experimental objectives. Even when time and cost are less crucial, the scientist
wishes to have some idea of the number of observations needed to yield sufficient data to answer
the objectives. An elegant experiment will make the most of the resources available, resulting
in a sufficient amount of information from a minimum sample size. For simple comparative
experiments, where one or two groups are involved, the calculation of sample size is relatively
simple. A knowledge of the ␣ level (level of significance), ␤ level (1 − power), the standard
deviation, and a meaningful “practically significant” difference is necessary in order to calculate
the sample size.
Power is defined as 1 − ␤ (i.e., ␤ = 1 − power). Power is the ability of a statistical test to
show significance if a specified difference truly exists. The magnitude of power depends on the
level of significance, the standard deviation, and the sample size. Thus power and sample size
are related.
In this chapter, we present methods for computing the sample size for relatively simple
situations for normally distributed and binomial data. The concept and calculation of power
are also introduced.

6.1 INTRODUCTION
The question of sample size is a major consideration in the planning of experiments, but may not
be answered easily from a scientific point of view. In some situations, the choice of sample size
is limited. Sample size may be dictated by official specifications, regulations, cost constraints,
and/or the availability of sampling units such as patients, manufactured items, animals, and
so on. The USP content uniformity test is an example of a test in which the sample size is fixed
and specified [1].
The sample size is also specified in certain quality control sampling plans such as those
described in MIL-STD-105E [2]. These sampling plans are used when sampling products for
inspection for attributes such as product defects, missing labels, specks in tablets, or ampul leak-
age. The properties of these plans have been thoroughly investigated and defined as described
in the document cited above. The properties of the plans include the chances (probability) of
rejecting or accepting batches with a known proportion of rejects in the batch (sect. 12.3).
Sample-size determination in comparative clinical trials is a factor of major importance.
Since very large experiments will detect very small, perhaps clinically insignificant, differences
as being statistically significant, and small experiments will often find large, clinically significant
differences as statistically insignificant, the choice of an appropriate sample size is critical in the
design of a clinical program to demonstrate safety and efficacy. When cost is a major factor in
implementing a clinical program, the number of patients to be included in the studies may be
limited by lack of funds. With fewer patients, a study will be less sensitive. Decreased sensitivity
means that the comparative treatments will be relatively more difficult to distinguish statistically
if they are, in fact, different.
The problem of choosing a “correct” sample size is related to experimental objectives and
the risk (or probability) of coming to an incorrect decision when the experiment and analysis
are completed. For simple comparative experiments, certain prior information is required in
SAMPLE SIZE AND POWER 129

order to compute a sample size that will satisfy the experimental objectives. The following
considerations are essential when estimating sample size.
1. The ␣ level must be specified that, in part, determines the difference needed to represent a
statistically significant result. To review, the ␣ level is defined as the risk of concluding that
treatments differ when, in fact, they are the same. The level of significance is usually (but
not always) set at the traditional value of 5%.
2. The ␤ error must be specified for some specified treatment difference, . Beta, ␤, is the risk
(probability) of erroneously concluding that the treatments are not significantly different
when, in fact, a difference of size  or greater exists. The assessment of ␤ and , the
“practically significant” difference, prior to the initiation of the experiment, is not easy.
Nevertheless, an educated guess is required. ␤ is often chosen to be between 5% and 20%.
Hence, one may be willing to accept a 20% (1 in 5) chance of not arriving at a statistically
significant difference when the treatments are truly different by an amount equal to (or
greater than) . The consequences of committing a ␤ error should be considered carefully.
If a true difference of practical significance is missed and the consequence is costly, ␤ should
be made very small, perhaps as small as 1%. Costly consequences of missing an effective
treatment should be evaluated not only in monetary terms, but should also include public
health issues, such as the possible loss of an effective treatment in a serious disease.
3. The difference to be detected,  (that difference considered to have practical significance),
should be specified as described in (2) above. This difference should not be arbitrarily or
capriciously determined, but should be considered carefully with respect to meaningfulness
from both a scientific and commercial marketing standpoint. For example, when comparing
two formulas for time to 90% dissolution, a difference of one or two minutes might be
considered meaningless. A difference of 10 or 20 minutes, however, may have practical
consequences in terms of in vivo absorption characteristics.
4. A knowledge of the standard deviation (or an estimate) for the significance test is necessary.
If no information on variability is available, an educated guess, or results of studies reported
in the literature using related compounds, may be sufficient to give an estimate of the
relevant variability. The assistance of a statistician is recommended when estimating the
standard deviation for purposes of determining sample size.
To compute the sample size in a comparative experiment, (a) ␣, (b) ␤, (c) , and (d) ␴
must be specified. The computations to determine sample size are described below (Fig. 6.1).

6.2 DETERMINATION OF SAMPLE SIZE FOR SIMPLE COMPARATIVE EXPERIMENTS


FOR NORMALLY DISTRIBUTED VARIABLES
The calculation of sample size will be described with the aid of Figure 6.1. This explanation is
based on normal distribution or t tests. The derivation of sample-size determination may appear
complex. The reader not requiring a “proof” can proceed directly to the appropriate formulas
below.

Figure 6.1 Scheme to demonstrate calculation of sample size based on ␣, ␤, , and ␴: ␣ = 0.05, ␤ = 0.10,
 = 5, ␴ = 7; H 0 :  = 0, H a :  = 5.
130 CHAPTER 6

6.2.1 Paired-Sample and Single-Sample Tests


We will first consider the case of a paired-sample test where the null hypothesis is that the
two treatment means are equal: H0 :  = 0. In the case of an experiment comparing a new
antihypertensive drug candidate and a placebo, an average difference of 5 mm Hg in blood
pressure reduction might be considered of sufficient magnitude to be interpreted as a difference
of “practical significance” ( = 5). The standard deviation for the comparison was known, equal
to 7, based on a large amount of experience with this drug.
In Figure 6.1, the normal curve labeled A represents the distribution of differences with
mean equal to 0 and ␴ equal to 7. This is the distribution under the null hypothesis (i.e., drug
and placebo are identical). Curve B is the distribution of differences when the alternative, Ha :
 = 5,∗ is true (i.e., the difference between drug and placebo is equal to 5). Note that curve B is
identical to curve A except that B is displaced 5 mm Hg to the right. Both curves have the same
standard deviation, 7.
With the standard deviation, 7, known, the statistical test is performed at the 5% level as
follows [Eq. (5.4)]:

␦− ␦−0
Z= √ = √ . (6.1)
␴/ N 7/ N

For a two-tailed test, if the absolute value of Z is 1.96 or greater, the difference is significant.
According to Eq. (6.1), to obtain the significance
  ␴Z 7(1.96) 13.7
␦ ≥ √ = √ = √ . (6.2)
N N N
√ √
Therefore, values of ␦ equal to or greater than 13.7/ N (or equal to or less than −13.7/ N)
will lead to a declaration of significance. These points are designated as ␦L and ␦U in Figure 6.1,
and represent the cutoff points for statistical significance at the 5% level; that is, observed
differences equal to or more remote from the mean than these values result in “statistically
significant differences.”
√If curve B is the true distribution
√ (i.e.,  = 5), an observed mean difference greater than
13.7/ N (or less than −13.7/ N) will result in the correct decision; H0 will be rejected and√we
conclude that√ a difference exists. If  = 5, observations of a mean difference between 13.7/ N
and −13.7/ N will lead to an incorrect decision, the acceptance of H0 (no difference) (Fig. 6.1).
By definition, the probability of making this incorrect decision is equal to ␤.
In the present √example, ␤ will be set at 10%. In Figure 6.1, ␤ is represented by the area in
curve B below 13.7/ N(␦U ), equal to 0.10. (This area, ␤, represents the probability of accepting
H0 if  = 5.)
We will now compute the value of ␦ that cuts off 10% of the area in the lower tail of the nor-
mal curve with a mean of 5 and a standard deviation of 7 (curve B in Figure 6.1). Table IV.2 shows
that 10% of the area in the standard normal curve is below −1.28. The value of ␦ (mean difference
in blood pressure between the two groups) that corresponds to a given value of Z (−1.28, in this
example) is obtained from the formula for the Z transformation [Eq. (3.14)] as follows:


␦ =  + Z␤ √
N
␦−
Z␤ = √ . (6.3)
␴/ N

Applying Eq. (6.3) to our present example, ␦ = 5 − 1.28(7/ N). The value of ␦ in Eqs. (6.2)
and (6.3) is identically the same, equal to ␦U . This is illustrated in Figure 6.1.

∗  is considered to be the true mean difference, similar to ␮. ␦ will be used to denote the observed mean difference.
SAMPLE SIZE AND POWER 131

Table 6.1 Sample Size as a Function of Beta


with  = 5 and ␴ = 7: Paired Test (␣ = 0.05)

Beta (%) Sample size, N


1 36
5 26
10 21
20 16


From
√ Eq. (6.2), ␦U = 13.7/ N, satisfying the definition of ␣. From Eq. (6.3), ␦U = 5 −
1.28(7)/ N, satisfying the definition of ␤. We have two equations in two unknowns (␦U and N),
and N is evaluated as follows:

13.7 1.28(7)
√ = 5− √
N N
(13.7 + 8.96)2
N= = 20.5 ∼
= 21.
52

In general, Eqs. (6.2) and (6.3) can be solved for N to yield the following equation:
 ␴ 2
N= (Z␣ + Z␤ )2 , (6.4)


where Z␣ and Z␤ † are the appropriate normal deviates obtained from Table IV.2. In our example,
N= (7/5)2 (1.96 + 1.28)2 ∼
= 21. A sample size of 21 will result in a statistical test with 90% power
(␤ = 10%) against an alternative of 5, at the 5% level of significance. Table 6.1 shows how the
choice of ␤ can affect the sample size for a test at the 5% level with  = 5 and ␴ = 7.
The formula for computing the sample size if the standard deviation is known [Eq. (6.4)]
is appropriate for a paired-sample test or for the test of a mean from a single population. For
example, consider a test to compare the mean drug content of a sample of tablets to the labeled
amount, 100 mg. The two-sided test is to be performed at the 5% level. Beta is designated as
10% for a difference of −5 mg (95 mg potency or less). That is, we wish to have a power of 90%
to detect a difference from 100 mg if the true potency is 95 mg or less. If ␴ is equal to 3, how
many tablets should be assayed? Applying Eq. (6.4), we have
2
3
N= (1.96 + 1.28)2 = 3.8.
5

Assaying four tablets will satisfy the ␣ and ␤ probabilities. Note that Z = 1.28 cuts off 90%
of the area under curve B (the “alternative” curve) in Figure 6.2, leaving 10% (␤) of the area in
the upper tail of the curve. Table 6.2 shows values of Z␣ and Z␤ for various levels of ␣ and ␤
to be used in Eq. (6.4). In this example, and most examples in practice, ␤ is based on one tail of
the normal curve. The other tail contains an insignificant area relating to ␤ (the right side of the
normal curve, B, in Fig. 6.1)
Equation (6.4) is correct for computing the sample size for a paired- or one-sample test if
the standard deviation is known.
In most situations, the standard deviation is unknown and a prior estimate of the standard
deviation is necessary in order to calculate sample size requirements. In this case, the estimate
of the standard deviation replaces ␴ in Eq. (6.4), but the calculation results in an answer that is
slightly too small. The underestimation occurs because the values of Z␣ and Z␤ are smaller than

† Z␤ is taken as the positive value of Z in this formula.


132 CHAPTER 6

Table 6.2 Values of Z␣ and Z␤ for Sample-Size Calculations

Z␣

One sided Two sided Z␤a

1% 2.32 2.58 2.32


5% 1.65 1.96 1.65
10% 1.28 1.65 1.28
20% 0.84 1.28 0.84
a The value of ␤ is for a single specified alternative. For a two-sided test,
the probability of rejection of the alternative, if true, (accept H a ) is virtually
all contained in the tail nearest the alternative mean.

√ √
Figure 6.2 Illustration of the calculation of N for tablet assays. X = 95 + ␴ Z␤ / N = 100 − ␴ Z␣ / N.

the corresponding t values that should be used in the formula when the standard deviation is
unknown. The situation is somewhat complicated by the fact that the value of t depends on the
sample size (d.f.), which is yet unknown. The problem can be solved by an iterative method,
but for practical purposes, one can use the appropriate values of Z to compute the sample size
[as in Eq. (6.4)] and add on a few extra samples (patients, tablets, etc.) to compensate for the
use of Z rather than t. Guenther has shown that the simple addition of 0.5Z␣2 , which is equal
to approximately 2 for a two-sided test at the 5% level, results in a very close approximation to
the correct answer [3]. In the problem illustrated above (tablet assays), if the standard deviation
were unknown but estimated as being equal to 3 based on previous experience, a better estimate
of the sample size would be N + 0.5Z␣2 = 3.8 + 0.5(1.96)2 ∼ = 6 tablets.

6.2.2 Determination of Sample Size for Comparison of Means in Two Groups


For a two independent groups test (parallel design), with the standard deviation known and
equal number of observations per group, the formula for N (where N is the sample size for each
group) is
 ␴ 2
N=2 (Z␣ + Z␤ )2 . (6.5)


If the standard deviation is unknown and a prior estimate is available (s.d.), substitute
s.d. for ␴ in Eq. (6.5) and compute the sample size; but add on 0.25Z␣2 to the sample size for each
group.
Example 1: This example illustrates the determination of the sample size for a two indepen-
dent groups (two-sided test) design. Two variations of a tablet formulation are to be compared
with regard to dissolution time. All ingredients except for the lubricating agent were the same
in these two formulations. In this case, a decision was made that if the formulations differed by
10 minutes or more to 80% dissolution, it would be extremely important that the experiment
shows a statistically significant difference between the formulations. Therefore, the pharmaceu-
tical scientist decided to fix the ␤ error at 1% in a statistical test at the traditional 5% level. Data
were available from dissolution tests run during the development of formulations of the drug
SAMPLE SIZE AND POWER 133

and the standard deviation was estimated as 5 minutes. With the information presented above,
the sample size can be determined from Eq. (6.5). We will add on 0.25Z␣2 samples to the answer
because the standard deviation is unknown.
2
5
N=2 (1.96 + 2.32)2 + 0.25(1.96)2 = 10.1.
10

The study was performed using 12 tablets from each formulation rather than the 10 or
11 suggested by the answer in the calculation above. Twelve tablets were used because the
dissolution apparatus could accommodate six tablets per run.
Example 2: A bioequivalence study was being planned to compare the bioavailability of a
final production batch to a previously manufactured pilot-sized batch of tablets that were made
for clinical studies. Two parameters resulting from the blood-level data would be compared:
area under the plasma level versus time curves (AUC) and peak plasma concentration (Cmax ).
The study was to have 80% power (␤ = 0.20) to detect a difference of 20% or more between the
formulations. The test is done at the usual 5% level of significance. Estimates of the standard
deviations of the ratios of the values of each of the parameters [(final product)/(pilot batch)]
were determined from a small pilot study. The standard deviations were different for the
parameters. Since the researchers could not agree that one of the parameters was clearly critical
in the comparison, they decided to use a “maximum” number of patients based on the variable
with the largest relative variability. In this example, Cmax was most variable, the ratio having a
standard deviation of approximately 0.30. Since the design and analysis of the bioequivalence
study is a variation of the paired t test, Eq. (6.4) was used to calculate the sample size, adding
on 0.5Z␣2 , as recommended previously.
 ␴ 2
N= (Z␣ + Z␤ )2 + 0.5(Z␣2 )

2
0.3
= (1.96 + 0.84)2 + 0.5(1.96)2 = 19.6. (6.6)
0.2

Twenty subjects were used for the comparison of the bioavailabilities of the two formula-
tions.
For sample-size determination for bioequivalence studies using FDA recommended
designs, see Table 6.5 and section 11.4.4.
Sometimes the sample sizes computed to satisfy the desired ␣ and ␤ errors can be inordi-
nately large when time and cost factors are taken into consideration. Under these circumstances,
a compromise must be made—most easily accomplished by relaxing the ␣ and ␤ requirements‡
(Table 6.1). The consequence of this compromise is that probabilities of making an incorrect
decision based on the statistical test will be increased. Other ways of reducing the required
sample size are (a) to increase the precision of the test by improving the assay methodology
or carefully controlling extraneous conditions during the experiment, for example, or (b) to
compromise by increasing , that is, accepting a larger difference that one considers to be of
practical importance.
Table 6.3 gives the sample size for some representative values of the ratio ␴/, ␣, and ␤,
where the s.d. (s) is estimated.

6.3 DETERMINATION OF SAMPLE SIZE FOR BINOMIAL TESTS


The formulas for calculating the sample size for comparative binomial tests are similar to those
described for normal curve or t tests. The major difference is that the value of ␴ 2 , which is
assumed to be the same under H0 and Ha in the two-sample independent groups t or Z tests,
is different for the distributions under H0 and Ha in the binomial case. This difference occurs
because ␴ 2 is dependent on P, the probability of success, in the binomial. The value of P will

‡ In practice, ␣ is often fixed by regulatory considerations and ␤ is determined as a compromise.


134

Table 6.3 Sample Size Needed for Two-Sided t Test with Standard Deviation Estimated

One-sample test Two-sample test with N units per group

Alpha = 0.05 Alpha = 0.01 Alpha = 0.05 Alpha = 0.01

Beta = Beta = Beta = Beta =

Estimated S /Δ 0.01 0.05 0.10 0.20 0.01 0.05 0.10 0.20 0.01 0.05 0.10 0.20 0.01 0.05 0.10 0.20
4.0 296 211 170 128 388 289 242 191 588 417 337 252 770 572 478 376
2.0 76 54 44 34 100 75 63 51 148 106 86 64 194 145 121 96
1.5 44 32 26 20 58 54 37 30 84 60 49 37 110 82 69 55
1.0 21 16 13 10 28 22 19 16 38 27 23 17 50 38 32 26
0.8 14 11 9 8 19 15 13 11 25 18 15 12 33 25 21 17
0.67 11 8 7 6 15 12 11 9 18 13 11 9 24 18 15 13
0.5 7 6 5 4 10 8 8 7 11 8 7 6 14 11 10 8
0.4 6 5 4 4 8 7 6 6 8 6 5 4 10 8 7 6
0.33 5 4 4 3 7 6 6 5 6 5 4 4 8 6 6 5
CHAPTER 6
SAMPLE SIZE AND POWER 135

be different depending on whether H0 or Ha represents the true situation. The appropriate


formulas for determining sample size for the one- and two-sample tests are
One-sample test
 
1 p 0 q 0 + p1 q 1
N= (Z␣ + Z␤ )2 , (6.7)
2 2

where  = p1 − p0 ; p1 is the proportion that would result in a meaningful difference, and p0 is


the hypothetical proportion under the null hypothesis.
Two-sample test
 
p 1 q 1 + p2 q 2
N= (Z␣ + Z␤ )2 , (6.8)
2

where  = p1 − p2 ; p1 and p2 are prior estimates of the proportions in the experimental groups.
The values of Z␣ and Z␤ are the same as those used in the formulas for the normal curve or
t tests. N is the sample size for each group. If it is not possible to estimate p1 and p2 prior to
the experiment, one can make an educated guess of a meaningful value of  and set p1 and p2
both equal to 0.5 in the numerator of Eq. (6.8). This will maximize the sample size, resulting in a
conservative estimate of sample size.
Fleiss [4] gives a fine discussion of an approach to estimating , the practically significant
difference, when computing the sample size. For example, one approach is first to estimate
the proportion for the more well-studied treatment group. In the case of a comparative clinical
study, this could very well be a standard treatment. Suppose this treatment has shown a success
rate of 50%. One might argue that if the comparative treatment is additionally successful for 30%
of the patients who do not respond to the standard treatment, then the experimental treatment
would be valuable. Therefore, the success rate for the experimental treatment should be 50% +
0.3 (50%) = 65% to show a practically significant difference. Thus, p1 would be equal to 0.5 and
p2 would be equal to 0.65.
Example 3: A reconciliation of quality control data over several years showed that the
proportion of unacceptable capsules for a stable encapsulation process was 0.8% ( p0 ). A sample
size for inspection is to be determined so that if the true proportion of unacceptable capsules
is equal to or greater than 1.2% ( = 0.4%), the probability of detecting this change is 80%
(␤ = 0.2). The comparison is to be made at the 5% level using a one-sided test. According to
Eq. (6.7),
 
1 0.008 · 0.992 + 0.012 · 0.988
N= (1.65 + 0.84)2
2 (0.008 − 0.012)2
7670
=
2
= 3835.

The large sample size resulting from this calculation is typical of that resulting from
binomial data. If 3835 capsules are too many to inspect, ␣, ␤, and/or  must be increased. In
the example above, management decided to increase ␣. This is a conservative decision in that
more good batches would be “rejected” if ␣ is increased; that is, the increase in ␣ results in an
increased probability of rejecting good batches, those with 0.8% unacceptable or less.
Example 4: Two antibiotics, a new product and a standard product, are to be compared
with respect to the two-week cure rate of a urinary tract infection, where a cure is bacteriological
evidence that the organism no longer appears in urine. From previous experience, the cure rate
for the standard product is estimated at 80%. From a practical point of view, if the new product
shows an 85% or better cure rate, the new product can be considered superior. The marketing
136 CHAPTER 6

division of the pharmaceutical company felt that this difference would support claims of better
efficacy for the new product. This is an important claim. Therefore, ␤ is chosen to be 1% (power
= 99%). A two-sided test will be performed at the 5% level to satisfy FDA guidelines. The test
is two-sided because, a priori, the new product is not known to be better or worse than the
standard. The calculation of sample size to satisfy the conditions above makes use of Eq. (6.8);
here p1 = 0.8 and p2 = 0.85.
 
0.08 · 0.2 + 0.85 · 0.15
N= (1.96 + 2.32)2 = 2107.
(0.80 − 0.85)2

The trial would have to include 4214 patients, 2107 on each drug, to satisfy the ␣ and
␤ risks of 0.05 and 0.01, respectively. If this number of patients is greater than that can be
accommodated, the ␤ error can be increased to 5% or 10%, for example. A sample size of 1499
per group is obtained for a ␤ of 5%, and 1207 patients per group for ␤ equal to 10%.
Although Eq. (6.8) is adequate for computing the sample size for most situations, the
calculation of N can be improved by considering the continuity correction [4]. This would be
particularly important for small sample sizes

   2
 N 8
N = 1+ 1+ ,
4 (N | p2 − p1 |)

where N is the sample size computed from Eq. (6.8) and N is the corrected sample size. In the
example, for ␣ = 0.05 and ␤ = 0.01, the corrected sample size is

   2
 2107 8
N = 1+ 1+ = 2186.
4 (2107 |0.80 − 0.85|)

6.4 DETERMINATION OF SAMPLE SIZE TO OBTAIN A CONFIDENCE INTERVAL


OF SPECIFIED WIDTH
The problem of estimating the number of samples needed to estimate the mean with a known
precision by means of the confidence interval is easily solved by using the formula for the
confidence interval (see sect. 5.1). This approach has been used as an aid in predicting election
results based on preliminary polls where the samples are chosen by simple random sampling.
For example, one may wish to estimate the proportion of voters who will vote for candidate A
within 1% of the actual proportion.
We will consider the application of this problem to the estimation of proportions. In
quality control, one can closely estimate the true proportion of percent defects to any given
degree of precision. In a clinical study, a suitable sample size may be chosen to estimate the
true proportion of successes within certain specified limits. According to Eq. (5.3), a two-sided
confidence interval with confidence coefficient p for a proportion is


p̂q̂
p̂ ± Z . (6.3)
N

To obtain a 99% confidence interval with a width of 0.01 (i.e., construct an interval that is
within ±0.005 of the observed proportion, p̂ ± 0.005),


p̂q̂
Zp = 0.005
N
SAMPLE SIZE AND POWER 137

or

Z2p ( p̂q̂ )
N= (6.9)
(W/2)2

(2.58)2 ( p̂q̂ )
N= .
(0.005)2

A more exact formula for the sample size for small values of N is given in Ref. [5].
Example 5: A quality control supervisor wishes to have an estimate of the proportion of
tablets in a batch that weigh between 195 and 205 mg, where the proportion of tablets in this
interval is to be estimated within ±0.05 (W = 0.10). How many tablets should be weighed? Use
a 95% confidence interval.
To compute N, we must have an estimate of p̂ [see Eq. (6.9)]. If p̂ and q̂ are chosen to
be equal to 0.5, N will be at a maximum. Thus, if one has no inkling as to the magnitude of
the outcome, using p̂ = 0.5 in Eq. (6.9) will result in a sufficiently large sample size (probably,
too large). Otherwise, estimate p̂ and q̂ based on previous experience and knowledge. In the
present example from previous experience, approximately 80% of the tablets are expected to
weigh between 195 and 205 mg ( p̂ = 0.8). Applying Eq. (6.9),

(1.96)2 (0.8)(0.2)
N= = 245.9.
(0.10/2)2

A total of 246 tablets should be weighed. In the actual experiment, 250 tablets were
weighed, and 195 of the tablets (78%) weighed between 195 and 205 mg. The 95% confidence
interval for the true proportion, according to Eq. (5.3), is
 
p̂q̂ (0.78)(0.22)
p ± 1.96 = 0.78 ± 1.96 = 0.78 ± 0.051.
N 250

The interval is slightly greater than ±5% because p is somewhat less than 0.8 (pq is larger
for p = 0.78 than for p = 0.8). Although 5.1% is acceptable, to ensure a sufficient sample size, in
general, one should estimate p closer to 0.5 in order to cover possible poor estimates of p.
If p̂ had been chosen equal to 0.5, we would have calculated

(1.96)2 (0.5)(0.5)
N= = 384.2.
(0.10/2)2

Example 6: A new vaccine is to undergo a nationwide clinical trial. An estimate is desired


of the proportion of the population that would be afflicted with the disease after vaccination. A
good guess of the expected proportion of the population diseased without vaccination is 0.003.
Pilot studies show that the incidence will be about 0.001 (0.1%) after vaccination. What size
sample is needed so that the width of a 99% confidence interval for the proportion diseased in
the vaccinated population should be no greater than 0.0002? To ensure that the sample size is
sufficiently large, the value of p to be used in Eq. (6.9) is chosen to be 0.0012, rather than the
expected 0.0010.

(2.58)2 (0.9988)(0.0012)
N= = 797,809.
(0.0002/2)2

The trial will have to include approximately 800,000 subjects in order to yield the desired
precision.
138 CHAPTER 6

6.5 POWER
Power is the probability that the statistical test results in rejection of H0 when a specified
alternative is true. The “stronger” the power, the better the chance that the null hypothesis will
be rejected (i.e., the test results in a declaration of “significance”) when, in fact, H0 is false. The
larger the power, the more sensitive is the test. Power is defined as 1 − ␤. The larger the ␤ error,
the weaker is the power. Remember that ␤ is an error resulting from accepting H0 when H0 is
false. Therefore, 1 − ␤ is the probability of rejecting H0 when H0 is false.
From an idealistic point of view, the power of a test should be calculated before an exper-
iment is conducted. In addition to defining the properties of the test, power is used to help
compute the sample size, as discussed above. Unfortunately, many experiments proceed with-
out consideration of power (or ␤). This results from the difficulty of choosing an appropriate
value of ␤. There is no traditional value of ␤ to use, as is the case for ␣, where 5% is usually
used. Thus, the power of the test is often computed after the experiment has been completed.
Power is best described by diagrams such as those shown previously in this chapter
(Figs. 6.1 and 6.2). In these figures, ␤ is the area of the curves represented by the alternative
hypothesis that is included in the region of acceptance defined by the null hypothesis.
The concept of power is also illustrated in Figure 6.3. To illustrate the calculation of power,
we will use data presented for the test of a new antihypertensive agent (sect. 6.2), a paired sample
test, with ␴ = 7 and H0 :  = 0. The test is performed at the 5% level of significance. Let us
suppose that the sample size is limited by cost. The sponsor of the test had sufficient funds
to pay for a study that included only 12 subjects. The design described earlier in this chapter
(sect. 6.2) used 26 patients with ␤ specified equal to 0.05 (power = 0.95). With 12 subjects,
the power will be considerably less than 0.95. The following discussion shows how power is
calculated.
The cutoff points for statistical significance (which specify the critical region) are defined
by ␣, N, and ␴. Thus, the values of ␦ that will lead to a significant result for a two-sided test are
as follows:
 
␦
Z= √
␴/ N
±Z␴
␦= √ .
N

In our example, Z = 1.96 (␣ = 0.05), ␴ = 7, and N = 12.

±(1.96)(7)
␦= √ = ±3.96.
12

Figure 6.3 Illustration of beta or power (1 − ␤).


SAMPLE SIZE AND POWER 139

Values of ␦ greater than 3.96 or less than −3.96 will lead to the decision that the products
differ at the 5% level. Having defined the values of ␦ that will lead to rejection of H0 , we obtain
the power for the alternative, Ha :  = 5, by computing the probability that an average result, ␦,
will be greater than 3.96, if Ha is true (i.e.,  = 5).
This concept is illustrated in Figure 6.3. Curve B is the distribution with mean equal to 5
and ␴ = 7. If curve B is the true distribution, the probability of observing a value of ␦ below
3.96 is the probability of accepting H0 if the alternative hypothesis is true ( = 5). This is the
definition of ␤. This probability can be calculated using the Z transformation.

3.96 − 5
Z= √ = −0.51.
7/ 12

Referring to Table IV.2, the area below +3.96 (Z = −0.51) for curve B is approximately
0.31. The power is 1 − ␤ = 1 − 0.31 = 0.69. The use of 12 subjects results in a power of 0.69 to
“detect” a difference of +5 compared to the 0.95 power to detect such a difference when 26
subjects were used. A power of 0.69 means that if the true difference were 5 mm Hg, the statistical
test will result in significance with a probability of 69%; 31% of the time, such a test will result in
acceptance of H0 .
A power curve is a plot of the power, 1 − ␤, versus alternative values of . Power curves can
be constructed by computing ␤ for several alternatives and drawing a smooth curve through
these points. For a two-sided test, the power curve is symmetrical around the hypothetical
mean,  = 0, in our example. The power is equal to ␣ when the alternative is equal to the
hypothetical mean under H0 . Thus, the power is 0.05 when  = H0 (Fig. 6.4) in the power curve.
The power curve for the present example is shown in Figure 6.4.
The following conclusions may be drawn concerning the power of a test if ␣ is kept
constant:

1. The larger the sample size, the larger the power.


2. The larger the difference to be detected (Ha ), the larger the power. A large sample size will
be needed in order to have strong power to detect a small difference.
3. The larger the variability (s.d.), the weaker the power.
4. If ␣ is increased, power is increased (␤ is decreased) (Fig. 6.3). An increase in ␣ (e.g., 10%)
results in a smaller Z. The cutoff points are shorter, and the area of curve B below the cutoff
point is smaller.

Power is a function of N, , ␴, and ␣.

Figure 6.4 Power curve for N = 12, ␣ = 0.05, ␴ = 7, and H0 :  = 0.


140 CHAPTER 6

A simple way to compute the approximate power of a test is to use the formula for sample
size [Eqs. (6.4) and (6.5). for example] and solve for Z␤ . In the previous example, a single sample
or a paired test, Eq. (6.4) is appropriate:
 ␴ 2
N= (Z␣ + Z␤ )2 (6.4)

√
Z␤ = N − Z␣ . (6.10)

Once having calculated Z␤ , the probability determined directly from Table IV.2 is equal to
the power, 1 − ␤. See the discussion and examples below.
In the problem discussed above, applying Eq. (6.10) with  = 5, ␴ = 7, N = 12, and
Z␣ = 1.96,

5√
Z␤ = 12 − 1.96 = 0.51.
7

According to the notation used for Z (Table 6.2), ␤ is the area above Z␤ . Power is the area
below Z␤ (power = 1 − ␤). In Table IV.2, the area above Z = 0.51 is approximately 31%. The
power is 1 − ␤. Therefore, the power is 69%.§
If N is small and the variance is unknown, appropriate values of t should be used in place
of Z␣ and Z␤ . Alternatively, we can adjust N by subtracting 0.5Z␣2 or 0.25Z␣2 from the actual
sample size for a one- or two-sample test, respectively. The following examples should make
the calculations clearer.
Example 7: A bioavailability study has been completed in which the ratio of the AUCs for
two comparative drugs was submitted as evidence of bioequivalence. The FDA asked for the
power of the test as part of their review of the submission. (Note that this analysis is different
from that presently required by FDA.) The null hypothesis for the comparison is H0 : R = 1,
where R is the true average ratio. The test was two-sided with ␣ equal to 5%. Eighteen subjects
took each of the two comparative drugs in a paired-sample design. The standard deviation was
calculated from the final results of the study, and was equal to 0.3. The power is to be determined
for a difference of 20% for the comparison. This means that if the test product is truly more than
20% greater or smaller than the reference product, we wish to calculate the probability that the
ratio will be judged to be significantly different from 1.0. The value of  to be used in Eq. (6.10)
is 0.2.

0.2 16
Z␤ = − 1.96 = 0.707.
0.3

Note that the value of N is taken as 16. This is the inverse of the procedure for determining
sample size, where 0.5Z␣2 was added to N. Here we subtract 0.5Z␣2 (approximately 2) from N;
18 − 2 = 16. According to Table IV.2, the area corresponding to Z = 0.707 is approximately 0.76.
Therefore, the power of this test is 76%. That is, if the true difference between the formulations
is 20%, a significant difference will be found between the formulations 76% of the time. This
is very close to the 80% power that was recommended before current FDA guidelines were
implemented for bioavailability tests (where  = 0.2).
Example 8: A drug product is prepared by two different methods. The average tablet
weights of the two batches are to be compared, weighing 20 tablets from each batch. The average
weights of the two 20-tablet samples were 507 and 511 mg. The pooled standard deviation was
calculated to be 12 mg. The director of quality control wishes to be “sure” that if the average
weights truly differ by 10 mg or more, the statistical test will show a significant difference, when

§ The value corresponding to Z in Table IV.2 gives the power directly. In this example, the area in the table
corresponding to a Z of 0.51 is approximately 0.69.
SAMPLE SIZE AND POWER 141

he was asked, “How sure?”, he said 95% sure. This can be translated into a ␤ of 5% or a power
of 95%. This is a two independent groups test. Solving for Z␤ from Eq. (6.5), we have


 N
Z␤ = − Z␣
␴ 2

10 19
= − 1.96 = 0.609. (6.11)
12 2

As discussed above, the value of N is taken as 19 rather than 20, by subtracting 0.25Z␣2
from N for the two-sample case. Referring to Table IV.2, we note that the power is approximately
73%. The experiment does not have sufficient power according to the director’s standards. To
obtain the desired power, we can increase the sample size (i.e., weigh more tablets). (See Exercise
Problem 10.)

6.6 SAMPLE SIZE AND POWER FOR MORE THAN TWO TREATMENTS
(ALSO SEE CHAP. 8)
The problem of computing power or sample size for an experiment with more than two treat-
ments is somewhat more complicated than the relatively simple case of designs with two
treatments. The power will depend on the number of treatments and the form of the null
and alternative hypotheses. Dixon and Massey [5] present a simple approach to determining
power and sample size. The following notation will be used in presenting the solution to this
problem.
Let M1 , M2 , M3 . . . Mk be the hypothetical population means of the k treatments. The null
hypothesis is M1 = M2 = M3 = Mk . As for the two sample cases, we must specify the alternative
of Mi . The alternative means are expressed as a grand mean, Mt ± some deviation, Di ,
values 
where (Di ) = 0. For example, if three treatments are compared for pain, Active A, Active B,
and Placebo (P), the values for the alternative hypothesized means, based on a VAS scale for
pain relief, could be 75 + 10 (85), 75 + 10 (85), and 75 − 20 (55) for the two actives and placebo,
respectively. The sum of the deviations from the grand mean, 75, is 10 + 10 − 20 = 0. The power
is computed based on the following equation:


(Mi − Mt )2 /k
␺ =
2
, (6.12)
S2 /n

where n is the number of observations in each treatment group (n is the same for each treatment)
and S2 is the common variance. The value of ␺ 2 is referred to Table 6.4 to estimate the required
sample size.
Consider the following example of three treatments in a study measuring the analgesic
properties of two actives and a placebo as described above. Fifteen subjects are in each treatment
group and the variance is 1000. According to Eq. (6.12),

 
(85 − 75)2 + (85 − 75)2 + (55 − 75)2 /3
␺ =
2
= 3.0.
1000/15

Table 6.4 gives the approximate power for various values of ␺ , at the 5% level, as a function
of the number of treatment groups and the d.f. for error for 3 and 4 treatments. (More detailed
tables, in addition to graphs,
√ are given in Dixon and Massey [5].) Here, we have 42 d.f. and three
treatments with ␺ = 3 = 1.73. The power is approximately 0.72 by simple linear interpolation
(42 d.f. for ␺ = 1.7). The correct answer with more extensive tables is closer to 0.73.
142 CHAPTER 6

Table 6.4 Factors for Computing Power for


Analysis of Variance

d.f. error ␺ Power


Alpha = 0.05, k = 3
10 1.6 0.42
2.0 0.76
2.4 0.80
3.0 0.984
20 1.6 0.62
1.92 0.80
2.00 0.83
3.0 >0.99
30 1.6 0.65
1.9 0.80
2.0 0.85
3.0 >0.99
60 1.6 0.67
1.82 0.80
2.0 0.86
3.0 >0.99
inf 1.6 0.70
1.8 0.80
2.0 0.88
3.0 >0.99

alpha = 0.05, k = 4

10 1.4 0.48
2.0 0.80
2.6 0.96
20 1.4 0.56
2.0 0.88
2.6 986
30 1.4 0.59
2.0 0.90
2.6 >0.99
60 1.4 0.61
2.0 0.92
2.6 >0.99
inf 1.4 0.65
2.0 0.94
2.6 >0.99

Table 6.4 can also be used to determine sample size. For example, how many patients
per treatment group are needed to obtain a power of 0.80 in the above example? Applying
Eq. (6.12),

{(85 − 75)2 + (85 − 75)2 + (55 − 75)2 }/3


= ␺ 2.
1000/n

Solve for ␺ 2

␺ 2 = 0.2n.
SAMPLE SIZE AND POWER 143

We can calculate n by trial and error. For example, with N = 20,

0.2N = 4 = ␺ 2 and ␺ = 2.

For ␺ = 2 and N = 20 (d.f.√ = 57), the power is approximately 0.86 (for d.f. = 60, power
0.86). For N = 15 (d.f. = 42, ␺ = 3), we have calculated (above) that the power is approximately
0.72. A sample size of between 15 and 20 patients per treatment group would give a power of
0.80. In this example, we might guess that 17 patients per group would result in approximately
80% power. Indeed, more exact tables show that a sample size of 17(␺ = (0.2 × 17) = 1.85)
corresponds to a power of 0.79.
The same approach can be used for two-way designs, using the appropriate error term
from the analysis of variance.

6.7 SAMPLE SIZE FOR BIOEQUIVALENCE STUDIES (ALSO SEE CHAP. 11)
In its early evolution, bioequivalence was based on the acceptance or rejection of a hypothesis
test. Sample sizes could then be determined by conventional techniques as described in section
6.2. Because of inconsistencies in the decision process based on this approach, the criteria for
acceptance was changed to a two-sided 90% confidence interval, or equivalently, two one-sided
t test, where the hypotheses are (␮1 /␮2 ) < 0.8 and (␮1 /␮2 ) > 1.25 versus the alternative of
0.8 < (␮1 /␮2 ) < 1.25. This test is based on the antilog of the difference between the averages of
the log-transformed parameters (the geometric mean). This test is equivalent to a two-sided 90%
confidence interval for the ratio of means falling in the interval 0.80 to 1.25 in order to accept
the hypothesis of equivalence. Again, for the currently accepted log-transformed data, the 90%
confidence interval for the antilog of the difference between means must lie between 0.80 and
1.25, that is, 0.8 < antilog (␮1 /␮2 ) < 1.25. The sample-size determination in this case is not as
simple as the conventional determination of sample size described earlier in this chapter. The
method for sample-size determination for nontransformed data has been published by Phillips
[6] along with plots of power as a function of sample size, relative standard deviation (computed
from the ANOVA), and treatment differences. Although the theory behind this computation is
beyond the scope of this book, Chow and Liu [7] give a simple way of approximating the power
and sample size. The sample size for each sequence group is approximately
 2
CV
N = (t␣, 2N−2 + t␤, 2N−2 )2 , (6.13)
(V − ␦)

where N is the number of subjects per sequence, t the appropriate value from the t distribution, ␣
the significance level (usually 0.10), 1 − ␤ the power (usually 0.8), CV the coefficient of variation,
V the bioequivalence limit, and ␦ the difference between products.
One would have to have an approximation of the magnitude of the required sample size
in order to approximate the t values. For example, suppose that RSD = 0.20, ␦ = 0.10, power is
0.8, and an initial approximation of the sample size is 20 per sequence (a total of 40 subjects).
Applying Eq. (6.13)

n = (1.69 + 0.85)2 [0.20/(0.20 − 0.10)]2 = 25.8.

Use a total of 52 subjects. This agrees closely with Phillip’s more exact computations.
Dilletti et al. [8] have published a method for determining sample size based on the log-
transformed variables, which is the currently preferred method. Table 6.5 showing sample sizes
for various values of CV, power, and product differences is taken from their publication.
Based on these tables, using log-transformed estimates of the parameters would result in
a sample size estimate of 38 for a power of 0.8, ratio of 0.9, and CV = 0.20. If the assumed ratio
is 1.1, the sample size is estimated as 32.
Equation (6.13) can also be used to approximate these sample sizes using log values for V
and ␦: n = (1.69 + 0.85)2 [0.20/(0.223 − 0.105)]2 = 19 per sequence or 38 subjects in total, where
0.223 is the log of 1.25 and 0.105 is the absolute value of the log of 0.9.
144 CHAPTER 6

Table 6.5 Sample Sizes for Given CV Power and Ratio (T /R ) for Log-Transformed Parametersa

CV Power ␮r , ␮x
(%) (%) 0.85 0.90 0.95 1.00 1.05 1.10 1.15 1.20
5.0 70 10 6 4 4 4 4 6 16
7.5 16 6 6 4 6 6 10 34
10.0 28 10 6 6 6 8 16 58
12.5 42 14 8 8 8 12 24 90
15.0 60 18 10 10 10 16 32 128
17.5 80 22 12 12 12 20 44 172
20.0 102 30 16 14 16 26 56 224
22.5 128 36 20 16 20 30 70 282
25.0 158 44 24 20 22 38 84 344
27.5 190 52 28 24 26 44 102 414
30.0 224 60 32 28 32 52 120 490
3.0 80 12 6 4 4 4 6 8 22
7.5 22 8 6 6 6 8 12 44
10.0 36 12 8 6 8 10 20 76
12.5 54 16 10 8 10 14 30 118
15.0 78 22 12 10 12 20 42 168
17.5 104 30 16 14 16 26 56 226
20.0 134 38 20 16 18 32 72 294
22.5 168 46 24 20 24 40 90 368
25.0 206 56 28 24 28 48 110 452
27.5 248 68 34 28 34 58 132 544
30.0 292 80 40 32 38 68 156 642
5.0 90 14 6 4 4 4 6 8 28
7.5 28 10 6 6 6 8 16 60
10.0 48 14 8 8 8 14 26 104
12.5 74 22 12 10 12 18 40 162
15.0 106 30 16 12 16 26 58 232
17.5 142 40 20 16 20 34 76 312
20.0 186 50 26 20 24 44 100 406
22.5 232 64 32 24 30 54 124 510
25.0 284 78 38 28 36 66 152 626
27.5 342 92 44 34 44 78 182 752
30.0 404 108 52 40 52 92 214 888

a Source: From Ref. [8].

For ␦ = 1.10 (log = 0.0953), the sample size is: n = (1.69 + 0.85)2 [0.20/ (0.223 − 0.0953)]2 =
16 per sequence or 32 subjects in total.
If the difference between products is specified as zero (ratio = 1.0), the value for t␤, 2n−2
in Eq. (6.3) should be two sided (Table 6.2). For example, for 80% power (and a large sample
size) use 1.28 rather than 0.84. In the example above with a ratio of 1.0 (0 difference between
products), a power of 0.8, and a CV = 0.2, use a value of (approximately) 1.34 for t␤, 2n−2 .

n = (1.75 + 1.34)2 [0.2/0.223]2 = 7.7 per group or 16 total subjects.

An Excel program to calculate the number of subjects required for a crossover study under
various conditions of power and product differences, for both parametric and binary (binomial)
data, is available on the disk accompanying this volume.
This approach to sample-size determination can also be used for studies where the out-
come is dichotomous, often used as the criterion in clinical studies of bioequivalence (cured or
not cured) for topically unabsorbed products or unabsorbed oral products such as sucralfate.
This topic is presented in section 11.4.8.
SAMPLE SIZE AND POWER 145

KEY TERMS
Alpha level Power curve
Attribute “Practical” significance
Beta error Sample size
Confidence interval Sampling plan
Delta Sensitivity
Power Z transformation

EXERCISES
1. Two diets are to be compared with regard to weight gain of weanling rats. If the weight
gain due to the diets differs by 10 g or more, we would like to be 80% sure that we obtain
a significant result. How many rats should be in each group if the s.d. is estimated to be 5
and the test is performed at the 5% level?
2. How many rats per group would you use if the standard deviation were known to be equal
to 5 in Problem 1?
3. In Example 3 where two antibiotics are being compared, how many patients would be
needed for a study with ␣ = 0.05, ␤ = 0.10, using a parallel design, and assuming that the
new product must have a cure rate of 90% to be acceptable as a better product than the
standard? (Cure rate for standard = 80%).
4. It is hypothesized that the difference between two drugs with regard to success rate is 0
(i.e., the drugs are not different). What size sample is needed to show a difference of 20%
significant at the 5% level with a ␤ error of 10%? (Assume that the response rate is about
50% for both drugs, a conservative estimate.) The study is a two independent samples design
(parallel groups).
5. How many observations would be needed to estimate a response rate of about 50% within
± 15% (95% confidence limits)? How many observations would be needed to estimate a
response rate of 20 ± 15%?
6. Your boss tells you to make a new tablet formulation that should have a dissolution
time (90% dissolution) of 30 minutes. The previous formulation took 40 minutes to 90%
dissolution. She tells you that she wants an ␣ level of 5% and that if the new formulation
really has a dissolution time of 30 minutes or less, she wants to be 99% sure that the
statistical comparison will show significance. (This means that the ␤ error is 1%.) The s.d.
is approximately 10. What size sample would you use to test the new formulation?
7. In a clinical study comparing the effect of two drugs on blood pressure, 20 patients were
to be tested on each drug (two groups). The change in blood pressure from baseline mea-
surements was to be determined. The s.d., measured as the difference among individuals’
responses, is estimated from past experience to be 5.
(a) If the statistical test is done at the 5% level, what is the power of the test against an
alternative of 3 mm Hg difference between the drugs (H0 : ␮1 = ␮2 or ␮1 − ␮2 = 0). This
means: What is the probability that the test will show significance if the true difference
between the drugs is 3 mm Hg or more (Ha : ␮1 − ␮2 = 3)?
(b) What is the power if there are 50 people per group? ␣ is 5%.
8. A tablet is produced with a labeled potency of 100 mg. The standard deviation is known
to be 10. What size sample should be assayed if we want to have 90% power to detect a
difference of 3 mg from the target? The test is done at the 5% level.
9. In a bioequivalence study, the ratio of AUCs is to be compared. A sample size of 12 subjects
is used in a paired design. The standard deviation resulting from the statistical test is 0.25.
What is the power of this test against a 20% difference if ␣ is equal to 0.05?
10. How many samples would be needed to have 95% power for Example 8?
146 CHAPTER 6

11. In a bioequivalence study, the maximum blood level is to be compared for two drugs. This
is a crossover study (paired design) where each subject takes both drugs. Eighteen subjects
entered the study with the following results. The observed difference is 10 ␮g/mL. The s.d.
(from this experiment) is 40. A practical difference is considered to be 15 ␮g/mL. What is
the power of the test for a 15-␮g/mL difference for a two-sided test at the 5% level?
12. How many observations would you need to estimate a proportion within ±5% (95%
confidence interval) if the expected proportion is 10%?
13. A parallel design is used to measure the effectiveness of a new antihypertensive drug. One
group of patients receives the drug and the other group receives placebo. A difference of
6 mm Hg is considered to be of practical significance. The standard deviation (difference
from baseline) is unknown but is estimated as 5 based on some preliminary data. Alpha is
set at 5% and ␤ at 10%. How many patients should be used in each group?
14. From Table 6.3, find the number of samples needed to determine the difference between the
dissolution of two formulations for ␣ = 0.05, ␤ = 0.10, S = 25, for a “practical” difference
of 25 (minutes).

REFERENCES
1. United States Pharmacopeia, 23rd rev, and National Formulary, 18th ed. Rockville, MD: USP
Pharmacopeial Convention, Inc., 1995.
2. U.S. Department of Defense Military Standard. Military Sampling Procedures and Tables for Inspection
by Attributes (MIL-STD-105E). Washington, DC: U.S. Government Printing Office, 1989.
3. Guenther WC. Sample size formulas for normal theory tests. Am Stat 1981; 35:243.
4. Fleiss J. Statistical Methods for Rates and Proportions, 2nd ed. New York: Wiley, 1981.
5. Dixon WJ, Massey FJ Jr. Introduction to Statistical Analysis, 3rd ed. New York: McGraw-Hill, 1969.
6. Phillips KE. Power of the two one-sided tests procedure in bioequivalence. J Pharmacokinet Biopharm
1991; 18:137.
7. Chow S-C, Liu J-P. Design and Analysis of Bioavailability and Bioequivalence Studies. New York:
Marcel Dekker, 1992.
8. Dilletti E, Hauschke D, Steinijans VW. Sample size determination: extended tables for the multiplicative
model and bioequivalence ranges of 0.9 to 1.11 and 0.7 to 1.43. Int J Clin Pharmacol Toxicol 1991; 29:1.

You might also like