0% found this document useful (0 votes)
20 views5 pages

Festing 2014

Uploaded by

László Sági
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
20 views5 pages

Festing 2014

Uploaded by

László Sági
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 5

Randomized Block Experimental Designs Can Increase the Power

and Reproducibility of Laboratory Animal Experiments

Michael F. W. Festing

Abstract sclerosis (ALS) in the standard transgenic mouse model of


this disease, but only one had any effect in humans. A de-
Randomized block experimental designs have been widely tailed study showed that there are a number of confounding
used in agricultural and industrial research for many decades. factors that need to be controlled when using this model.
Usually they are more powerful, have higher external validity, The authors devised a better protocol, which controlled these
are less subject to bias, and produce more reproducible results confounding factors, and then rescreened these 50 drugs plus
than the completely randomized designs typically used in re- another 20. They found that none of the drugs was effective in

Downloaded from http://ilarjournal.oxfordjournals.org/ by guest on August 3, 2016


search involving laboratory animals. Reproducibility can be the mouse model. Similarly, Prinz and colleagues (2011)
further increased by using time as a blocking factor. These were able to reproduce only 20 to 25% of the results of 67
benefits can be achieved at no extra cost. A small experiment published studies and claimed that, within the pharmaceutical
investigating the effect of an antioxidant on the activity of a industry, it is accepted anecdotally that less than half of aca-
liver enzyme in four inbred mouse strains, which had two rep- demic papers give reproducible results.
lications (blocks) separated by a period of two months, illus- In many cases, lack of repeatability is due to failure to ad-
trates this approach. The widespread failure to use these here to some of the most basic requirements of good experi-
designs more widely in research involving laboratory animals mental design. A survey of 271 animal experiments showed
has probably led to a substantial waste of animals, money, and that 87% of papers did not report randomization of treatments
scientific resources and slowed down the development of new to subjects (although this does not necessarily mean that it
treatments for human and animal diseases. was not done), and 86% did not report blinding in situations
where it would be appropriate (Kilkenny et al. 2009). Such
Key Words: experimental design; randomized block; failures can lead to biased results and unrepeatable (in the
repeatability; animal experiments; reproducibility same laboratory) or unreproducible (in a different laboratory)
experiments. Lack of reproducibility in other laboratories
Introduction may also be caused by treatment x environment interactions.
For example, animal houses may differ in the physical envi-

A
fundamental assumption in experimental biology is ronment, management, or microflora in such a way as to alter
that if an experiment is well designed, correctly exe- the relative treatment differences. Results may also be unre-
cuted, properly analysed, and adequately documented, peatable or unreproducible because the wrong strain of ani-
the results should be reproducible, apart from the occasional mals was used. There is no effective genetic quality control
type I error (false positive) associated with the chosen signifi- of outbred stocks. In one study, for example, the investigators
cance level. However, several recent publications have found obtained 26 weekly samples of 30 Sprague-Dawley rats from
that excessive numbers of animal experiments are unreprodu- a commercial supplier and tested them for response to a syn-
cible (i.e., the results could not be repeated by different inves- thetic polypeptide, a response controlled by a single gene in
tigators). For example, Begley and Ellis (2012) attempted to the major histocompatibility complex (Simonian et al. 1968).
repeat 53 landmark experiments concerned with cancer re- On average, about 80% of the rats were responders and, for
search but were only able to do so with six of them. In the first 12 weeks, the percentage of responders in each sam-
some cases, the original authors were unable to repeat their ple varied about this mean. However, in weeks 13, 17, 18, 19,
own experiments. In another paper, investigators (Scott and 20, only about 5% of the rats were responders. These rats
et al. 2008) noted that there were more than 50 reports of cannot have come from the same colony and may have re-
drugs that alleviated the symptoms of amyotrophic lateral sponded differently to other experimental treatments, but
there was no indication from the breeder that different rats
Michael F. W. Festing, D.Sc. (retired) was a Senior Scientist at the MRC Tox- had been supplied. There have been other examples of
icology Unit in Leicester, UK.
Address correspondence and reprint requests to Michael F. W. Festing, c/o
the wrong animals being supplied by commercial breeders
MRC Toxicology Unit, Hodgkin Building, PO Box 138, Lancaster Road, (Festing 1982). And, of course, any single experiment has a
Leicester LE1 9HN or email [email protected]. 5% chance of getting a false positive result due to statistical

ILAR Journal, Volume 55, Number 3, doi: 10.1093/ilar/ilu045


© The Author 2014. Published by Oxford University Press on behalf of the Institute for Laboratory Animal Research. All rights reserved.
For permissions, please email: [email protected] 472
sampling, assuming a 5% significance level is chosen. Final- experiments and should help to minimize measurement
ly, some results may be unreproducible because the authors errors because the work can be done under less time
detected serious errors and later withdrew the paper, un- pressure.
known to other investigators, or because the paper is fraudu- (5) Increase the external validity of an experiment because
lent (Steen 2011). For all these reasons, it is legitimate to each block samples a different environment and/or time
require evidence that the results of important experiments period.
are reproducible. However, repeating experiments is time
consuming and expensive both financially and in the use of
animals. Because the work is not new, it may also be difficult Sample Size Determination
to obtain funding and get the results published. An alternative
Either a power analysis or, for smaller more fundamental
could be to design better experiments with built-in repeatabil-
studies, the resource equation method can be used to set sam-
ity. This could be done using randomized block designs, with
ple size. A power analysis requires an estimate of the standard
the blocking factor being time (i.e., the experiment is split up
deviation (for measurement variables). As the magnitude of
over a period of hours, days, or months).
the likely block effect and the standard deviation are usually
unknown, the estimate will probably have to come from an
Randomized Block Experimental Designs unblocked (completely randomized) experiment. The payoff
will be taken in increased power (experience has shown that

Downloaded from http://ilarjournal.oxfordjournals.org/ by guest on August 3, 2016


The “randomized block” (RB) design is a generic name for a RB designs are nearly always more powerful than completely
family of experimental designs in which the experimental ma- randomized designs of the same size, with the possible excep-
terial is split up into a number of “mini-experiments” that are tion of very small experiments). However, when similar
recombined in the final statistical analysis. Typically, in each blocked experiments are done frequently, the estimate of
block there is a single experimental unit to which each treat- the standard deviation from these could be used in future
ment is assigned (although there can be more than one). They power analyses to estimate the sample size.
include crossover designs, within-subjects designs, matched The resource equation method (Mead 1988) is E = (total
designs, and Latin square designs. These have a number of number of experimental units)-(number of treatments), and
useful properties and should be more widely used in research E should be between about 10 and 20, but with some leeway.
involving laboratory animals. They can be used to: It depends on the law of diminishing returns and aims to en-
sure that there is an adequate estimate of the error variance. It
(1) Spread the experiment over a period of time and/or space. is particularly useful for small fundamental studies and more
So if there are, say, four treatments and the cage of animals complex designs with many treatments, such as factorial de-
is the experimental unit, each block will consist of four cag- signs and where there is no estimate of the standard deviation,
es. Block 1 might be started this week, block 2 next week, thereby preventing the use of a power analysis.
and so on. If six blocks are needed, then the experiment
will be extended over a 6-week period. Each block may in-
volve a different batch of animals, which could be of a
The Statistical Analysis of Randomized
slightly different age or weight with possible differences
Block Designs
in the batches of diet, or its age since it was manufactured.
In an RB design, typically each observation can be classified by
Cages may be placed at different levels in a rack. None of
two factors, one “fixed” (usually called the treatment that is de-
these variables is of interest, so they are removed as a block
liberately varied and is of scientific interest) and the other
effect in the statistical analysis. If the relationship between
“random” (which may be called a “block” or replicate), which
the observations alters, then treatment effects will become
is of no scientific interest but which could cause noise if not re-
statistically less significant. If the relative values remain un-
moved in the statistical analysis. There can be any number of
changed, then this implies a good level of repeatability
replications (blocks). Randomization is done separately within
with real treatment effects being more likely to be detected.
each block. An individual observation will be made up of a
(2) Increase the power of an experiment by matching the ex- grand mean μ , a deviation “t” due to the treatment it receives,
perimental units in each block, say, on age, weight, or lo- a deviation “b” due to the block, and an individual deviation “e”:
cation in the animal house. This means that powerful
experiments can often be done even though the experi- Yij ¼ m þ ti þ bJ þ eij
mental units are somewhat heterogeneous, providing where Yij is the individual observation, μ represents the overall
that matching is possible. This is particularly important mean, ti represents a deviation due to the ith treatment, bJ repre-
with large experiments in which it is often difficult to ob- sents a deviation due to the jth block, and eij represents a random
tain a sufficiently homogeneous group of animals. deviation associated with the individual experimental unit.
(3) Take account of material which has a natural structure. In this case, i = 1 … t where t is the number of treatments and
Within-litter experiments are an obvious example. j = 1 … b, where b is the number of blocks.
(4) Split the experiment up into smaller bits in order to make Assuming a single treatment factor and a single block factor,
it more manageable. This would be useful with large the experiment is analyzed by a two-way analysis of variance

Volume 55, Number 3, doi: 10.1093/ilar/ilu045 2014 473


without interaction. However, a factorial treatment structure Table 1 Gst levels (nmol conjugate formed per
can also be used (as in the example below), so there can be minute per mg of protein) in individual mice in an
two or more treatment factors. A Latin square design has two RB experiment in two blocks separated by
blocking factors, often designated “rows” and “columns.” approximately three months. Note that all Block 2
Note that with these designs no two observations come values are higher than the corresponding Block 1
from the same block and treatment so the estimate of the stan- values
dard deviation is obtained as the square root of the error mean
Strain Treatmenta Block 1 Block 2
square in the analysis of variance.
Randomized block designs can sometimes have within-block
replication (e.g., two or more experimental units per block and NIH C 444 764
treatment combination), but this is not discussed here. NIH T 614 831
BALB/c C 423 586
BALB/c T 625 782
An Example
A/J C 408 609
The example given below comes from a series of studies that A/J T 856 1002
were aimed at exploring the effect of antioxidants on suscept- 129/Ola C 447 606
ibility to cancer. In this example, diallyl sulphide (DS), a sub-

Downloaded from http://ilarjournal.oxfordjournals.org/ by guest on August 3, 2016


129/Ola T 719 766
stance found in garlic, was administered by gavage in three
daily doses of 0.2 mg/g to 8-week-old female mice of four Block means 567 743
inbred strains, and the activity of a number of liver enzymes a
C= vehicle control; T= Treated with DS.
was compared in treated and vehicle-treated controls. This
work was carried out in 1993 as part of MAFF Project
FS1710 entitled “Mechanisms of modulation of carcinogens The units are nmol conjugate formed per minute per mg of
by antioxidants: genetic control of the anticarcinogenic re- protein.
sponse in mice.” The work was done under UK legislation There was a large block effect, with all block A values be-
and all animals were humanely euthanized as directed under ing lower than the corresponding block B values. Why was
the Animals (scientific procedures) Act 1986. Were this an this? The protocols were identical. It could have been due
original research paper rather than an example, then full to slight differences in calibration of instruments or minor dif-
details would need to be given according to the Animal Re- ferences in reagents and solutions used to assess the enzyme
search: Reporting of in vivo Experiments (ARRIVE) guide- activity. Possibly the animals supplied were of a slightly dif-
lines (Kilkenny and Altman 2010). ferent age, had a different microflora, were on a different
The purpose of the experiment was to assess whether the batch of diet, or perhaps the different environment of isolator
activity of the liver enzymes was altered by the DS treatment versus animal room altered their response. There are many
and to see if there were any important strain differences in re- variables that can influence such results, and it is impossible
sponse. Altogether there were eight treatment combinations: to identify or control them all. What is important is the relative
4 inbred strains × 2 treatments in a factorial arrangement, in magnitude of each observation. This was maintained as
two blocks. Both treatment and strain were regarded as fixed shown by the strong correlation of 0.88 between the two
effects, block being a random effect. Note that strain is a clas- blocks, shown in Figure 1. Large block effects are common
sification that cannot be randomized. So in this experiment in this type of design. They highlight the importance of hav-
the only randomization was the decision of which of two ing concurrent controls and randomization, as well as the dan-
mice of each strain within a block would receive the treatment ger of using historical data where differences of the sort seen
and which would be the control. However, the order in which here between blocks might be mistaken for treatment effects.
the animals were sacrificed in each block was randomized. The analysis of variance (ANOVA; Table 2) shows a
The work was started when the MRC Toxicology Unit was large treatment effect, no significant difference between strains
being relocated from South London to Leicester. The new an- (p = 0.091) but some evidence of a strain by treatment interac-
imal house was not yet ready, and the first block of the exper- tion (p = 0.028). When a significant two-way interaction is ob-
iment was done with the mice housed in a plastic film isolator. served, the individual means need to be looked at separately.
The second block was done approximately two months later These are shown graphically in Figure 2 with strain NIH being
with the animals housed in one of the new animal rooms, so less responsive to treatment with DS than the other strains. The
the two blocks had different environmental conditions al- residual mean square (2957) provides an estimate of the pooled
though these were not quantified. The determinations of en- within-group variance, so the standard deviation is 54.5 units.
zyme activity were done separately for each block using The experiment is a bit small according to the resource equa-
freshly made up solutions. tion, with E = 7. But power has probably been increased by us-
The raw data showing the activity of one of the liver en- ing inbred strains and the randomized block design.
zymes, glutathione-S-transferase (Gst), assayed using the An ANOVA makes three assumptions about the data: (1)
chlorodinitrobenzene (CDNB) method, is shown in Table 1. The experimental units are independent (i.e., the treatments

474 ILAR Journal


Downloaded from http://ilarjournal.oxfordjournals.org/ by guest on August 3, 2016
Figure 1 Plot of Gst levels in Block A versus Block B for the
randomized block experiment. The correlation between the blocks Figure 2 Bar plot showing Gst levels in control (left) and treated
of r = 0.88 is large and statistically highly significant ( p < 0.01). (right) means for each mouse strain. Each bar is the mean of two ob-
servations taken approximately two months apart. Error bars are ±
the least significant difference (α = 0.05) so if they overlap there is
Table 2 The analysis of variance of the data no significant difference between the means, and if they don’t there
shown in Table 1. By convention, p values of less is a significant difference ( p < 0.05). All bars are the same length be-
than 0.05 are considered “statistically significant” cause sample sizes are identical and a pooled standard deviation has
so there is no evidence of strain differences in mean been used. Note that the control values are reasonably similar, but
Gst levels, but some evidence of strain differences there are slight strain differences in response (strain × treatment in-
in response, but also see text teraction, p < 0.05 with strain A/J responding most, and NIH not sig-
nificantly responding). But see text for discussion.
Sum Mean
Source Df Sq Sq F P
ing that the assumption of a normal distribution of the resid-
Blocks 1 1242 1242 42.01 <0.000 uals is slightly marginal. The ANOVA is quite robust to
Strain 3 286 95 3.22 0.091 deviations from these assumptions, but a transformation of
Treatment 1 2275 2275 76.93 <0.000
the scale of measurement can often be used if there is a seri-
ous deviation from normality to see if the fit is better. In this
Strain × treatment 3 495 165 5.58 0.028 case, a log transformation of the raw data provides a slightly
Residuals 7 207 29 better fit (not shown). The analysis of variance of the log-
transformed data differs in that there is no statistically signifi-
Abbreviations: Df, degrees of freedom; Sum Sq, sums of squares, cant strain × treatment interaction. It is a matter of judgment
Mean Sq, mean square, F is a test statistic, P is the p-value; the prob- whether to use the raw or the log-transformed scales. Howev-
ability that the observed differences could have arisen by chance in the
absence of a true treatment or strain effect.
er, in this case it makes little difference to the conclusions.
There is clearly a strong treatment effect with no large strain
differences. Even if the strains differed slightly in response,
have been individually assigned to the experimental units by that magnitude of difference is unlikely to be of much biolog-
randomization); (2) The residuals (deviation of each observa- ical significance. It is also worth noting that the ANOVA
tion from its group mean) have a normal distribution; and (3) provides a single overall test of whether there is a treatment,
The variances are homogeneous. The first assumption is met strain, and interaction effect. But that is three tests. Applying a
in this case by randomization. It is good practice to investigate Bonferroni correction would imply that a p value of 0.05/3 =
the second and third assumptions using “residual model diag- 0.017 should be used, in which case the interaction would not
nostic plots,” as shown in Figure 3. The top plot is concerned be “significant.” But it is the overall interpretation that is im-
with studying the homogeneity of variances. There should portant. That rarely depends on such minute details of the
be a scattering of points with no pattern, as is the case here. statistical analysis. Even a well-designed experiment will
The lower plot is a normal probability plot of the residuals. If give slightly different results if it is repeated. If it were impor-
they have a normal distribution, points should lie on a straight tant to find strain differences in Gst levels or in the Gst re-
line. In this case, there are four points lying off the line, show- sponse to chemicals, which was not a purpose in this case,

Volume 55, Number 3, doi: 10.1093/ilar/ilu045 2014 475


The example experiment, to determine the effect of DS on
Gst levels, could have been done on a single occasion using
16 outbred CD-1 mice assigned at random to the two treat-
ments. But it would have given neither an indication of
whether there was a genetic component to the response nor
whether the results were repeatable. The good agreement be-
tween the two blocks, separated by two months and a differ-
ent animal house environment, suggests that the results are
likely to be robust. There was no evidence of strain differenc-
es in Gst levels, although there was a hint of strain differences
in the treatment response, depending on the scale of measure-
ment. Altogether, the randomized block design gave extra in-
formation and had higher external validity at virtually no
extra cost, with some assurance that the results should be
reproducible.

Conclusions

Downloaded from http://ilarjournal.oxfordjournals.org/ by guest on August 3, 2016


Randomized block experimental designs include within-
subject, crossover, and matched designs in which the experi-
mental material is split up into a number of mini-experiments
that are combined in the statistical analysis. They are widely
Figure 3 Residuals diagnostic plots for the example experiment used in many other research disciplines and, because of their
using the raw data. The top plot of residuals as a function of fitted useful properties, should be more widely used in laboratory
values is used to assess homogeneity of variances. If these are homo- animal research. They can be more convenient, more power-
geneous, there should be a scattering of points with no pattern, as is ful, and can use fewer animals than completely randomized
the case here. The lower plot is a normal Q-Q plot and should give a designs. If the blocks are separated in time and there is
straight-line fit to the points if the residuals have a normal distribu- good agreement between them, then this gives some assur-
tion. In this case, there is some deviation from that ideal associated ance that the experiment is reproducible. Their more wide-
with points 1, 8 and 9 (see text for discussion). spread use would save money, animals, and other scientific
resources and would speed up the development of new treat-
then further work would be needed with a different experi- ments for diseases of humans and animals.
mental design and a wider choice of strains.

Discussion References
The purpose of this paper is to bring randomized block exper- Begley CG, Ellis LM. 2012. Drug development: Raise standards for preclin-
imental designs to the attention of scientists using laboratory an- ical cancer research. Nature 483:531–533.
Festing MFW. 1982. Genetic contamination of laboratory animal colonies: an
imals. Because of their valuable properties, these designs have
increasingly serious problem. ILAR News 25:6–10.
been widely used in agricultural and industrial research for Kilkenny C, Altman DG. 2010. Improving bioscience research reporting:
many decades. Mead (1988), who has experience in medicine, ARRIVE-ing at a solution. Lab Anim 44:377–378.
agriculture, and industry complains that about 85% of all exper- Kilkenny C, Parsons N, Kadyszewski E, Festing MF, Cuthill IC, Fry D,
iments are randomized complete block designs, and suggests Hutton J, Altman DG. 2009. Survey of the quality of experimental de-
that investigators should be more flexible in their choice of de- sign, statistical analysis and reporting of research using animals. PLoS
One 4:e7824.
sign. Yet these designs are rarely used in experiments involving Mead R. 1988. The design of experiments. Cambridge, New York: Cam-
laboratory animals. This cannot be because they are unsuitable. bridge University Press.
There is nothing about research with laboratory animals that Prinz F, Schlange T, Asadullah K. 2011. Believe it or not: how much can we
sets it apart from all other disciplines. The reason must be rely on published data on potential drug targets? Nat Rev Drug Discov
10:712.
that scientists are unfamiliar with these designs. Mead also
Scott S, Kranz JE, Cole J, Lincecum JM, Thompson K, Kelly N, Bostrom A,
says that a statistician should be fully involved with a research Theodoss J, Al-Nakhala BM, Vieira FG, Ramasubbu J, Heywood JA.
scientist in designing his/her experiments and that “this is the 2008. Design, power, and interpretation of studies in the standard murine
only efficient approach to designing experiments.” Yet in the model of ALS. Amyotroph Lateral Scler 9:4–15.
last 50 years few statisticians of stature have been closely in- Simonian SJ, Gill TJ III, Gershoff SN. 1968. Studies on synthetic polypep-
tide antigens. XX. Genetic control of the antibody response in the rat
volved (e.g., published papers or books) in this area of research.
to structurally different synthetic polypeptide antigens. J Immunol
The failure over such a long period of time to use the most ef- 101:730–742.
ficient designs must surely have led to a serious waste of ani- Steen RG. 2011. Misinformation in the medical literature: what role do error
mals, time, and other scientific resources. and fraud play? J Med Ethics 37:498–503.

476 ILAR Journal

You might also like