(Psychometrics) Roger E. Kirk - Experimental Design-Procedures for the Behavioral Sciences-Sage Publication, Inc. (2012)
(Psychometrics) Roger E. Kirk - Experimental Design-Procedures for the Behavioral Sciences-Sage Publication, Inc. (2012)
(Psychometrics) Roger E. Kirk - Experimental Design-Procedures for the Behavioral Sciences-Sage Publication, Inc. (2012)
1.1 Introduction
Sir Ronald Fisher, the statistician, eugenicist, evolutionary biologist, geneticist, and father of
modern experimental design, observed that experiments are “only experience carefully planned
in advance, and designed to form a secure basis of new knowledge” (Fisher, 1935a, p. 8). The
design of experiments to investigate scientific or research hypotheses involves a number of
interrelated activities:
Experimental design, the subject of this book, refers to a plan for assigning subjects to
experimental conditions and the statistical analysis associated with the plan. Selecting an
appropriate plan and performing the correct statistical analysis are important facets of scientific
research. However, the most important facet—identifying relevant research questions—is
outside the scope of this book. The reader should remember that a carefully conceived and
executed design is of no value if the scientific hypothesis that led to the experiment is without
merit. Careful planning should always precede the data collection phase of an experiment.
Data collection is usually the most costly and time-consuming aspect of an experiment.
Advanced planning helps to ensure that the data can be used to maximum advantage. No
amount of statistical wizardry can salvage a badly designed experiment.
Chapters 1 to 3 provide an overview of important design concepts and analysis tools that are
used throughout the remainder of the book. Chapter 3 describes a procedure developed by
Ronald A. Fisher called the analysis of variance. The procedure is used to decompose the
total variation displayed by a set of observations into two or more identifiable sources of
variation. Analysis of variance enables researchers to interpret the variability in designed
experiments. Fisher showed that by comparing the variability among subjects treated differently
to the variability among subjects treated alike, researchers can make informed choices between
competing hypotheses in science and technology. A detailed examination of each analysis of
variance design begins in Chapter 4. This examination includes a description of the design,
conditions under which the design is appropriate, assumptions associated with the design, a
computational example, and advantages and disadvantages of the design.
Two kinds of computational algorithms are provided for the designs. The first, referred to as the
classical sum-of-squares approach, uses scalar algebra and is suitable for calculators. The
second, called the cell means model approach, uses matrix algebra and is more suitable for
Some questions currently cannot be subjected to scientific investigation. For example, the
questions “Can three or more angels dance on the head of a pin?” and “Does life exist in more
than one galaxy in the universe?” cannot be answered because no procedures now exist for
observing either angels or life on other galaxies. Scientists confine their research hypotheses to
questions that can be answered by procedures that are available or that can be developed.
Thus, the question concerning the existence of life on other galaxies currently cannot be
investigated, but with continuing advances in space science, it is likely that eventually the
question will be answered.
Much research departs from this pattern because nature rather than the researcher
manipulates the independent variable. It would be unethical, for example, to study the effects
of prenatal malnutrition on IQ by deliberately providing pregnant women with inadequate diets.
Instead, the question is investigated by locating children whose mothers were malnourished
during pregnancy and then comparing their IQs with those of children whose mothers were not
malnourished. Research strategies in which the independent variable is not manipulated by the
researcher include surveys, case studies, and naturalistic observation. These research
strategies pose special problems for researchers who want to make causal inferences, as I
discuss in Section 1.3.
In the radiation example cited earlier, the presence or absence of radiation is the independent
variable—the variable that is manipulated by the researcher. More generally, an independent
variable is any suspected causal event that is under investigation. The terms independent
variable and treatment are interchangeable. A dependent variable is the measurement that is
used to assess the effects, if any, of manipulating the independent variable. In the radiation
example, the dependent variable is the amount of food consumed by the rats.
The independent variable in the radiation example is the presence or absence of radiation. The
treatment has two levels. If the researcher is interested in the nature of the relationship
between the radiation dose and food consumption, three or more levels of radiation must be
used. The levels could be 0 microwatts, 20,000 microwatts, 40,000 microwatts, and 60,000
microwatts of radiation. This treatment is an example of a quantitative independent variable
in which different treatment levels are different amounts of the independent variable.
When the independent variable is quantitative, the levels of the variable are generally chosen
so that they are equally spaced. Usually there is little interest in the exact values of the
treatment levels used in the experiment. In the radiation example, the research hypothesis also
could be investigated using three other levels of radiation—say, 25,000, 50,000, and 75,000
microwatts in addition to the 0 microwatt control level. The treatment levels should cover a
sufficiently wide range so that the effects of the independent variable can be detected if such
effects exist. In addition, the number and spacing of the levels should be sufficient to define the
shape of the function that relates the independent and dependent variables. Selection of the
appropriate levels of the independent variable can be based on the results of previous
experiments or on theoretical considerations. It may be beneficial to carry out a small pilot
experiment to identify the most appropriate treatment levels. A pilot experiment also is useful for
determining the number of experimental units required to test the statistical hypothesis.
Under the conditions described in Chapters 3 and 4, the levels of a quantitative independent
variable can be selected randomly from a population of treatment levels. If this procedure is
followed, a researcher can extrapolate from the results of the experiment to treatment levels
that are not included in the experiment. If the treatment levels are not randomly sampled, the
results of an experiment apply only to the specific levels included in the experiment.
Often a different type of independent variable is used. For example, if the treatment levels are
unmodulated radiation, amplitude-modulated radiation, and pulse-modulated radiation, the
treatment is called a qualitative independent variable. The different treatment levels
represent different kinds rather than different amounts of the independent variable. The
particular levels of a qualitative independent variable used in an experiment are generally of
specific interest to a researcher. And the levels chosen are usually dictated by the research
hypothesis.
Several independent variables can be used in an experiment, but the designs described in this
book are limited to the assessment of one dependent variable at a time. If it is necessary to
evaluate two or more dependent variables simultaneously, a multivariate analysis of variance
design can be used.3 The selection of the most fruitful variables to measure should be
determined by a consideration of the sensitivity, reliability, distribution, and practicality of the
possible dependent variables. From previous experience, a researcher may know that one
dependent variable is more sensitive than another to the effects of a treatment or that one
variable is more reliable—that is, gives more consistent results—than another variable.
Because behavioral research generally involves a sizable investment in time and material
resources, the dependent variable should be reliable and maximally sensitive to the
phenomenon under investigation. Choosing a dependent variable that possesses these two
characteristics can minimize the amount of time and effort required to investigate a research
hypothesis.
Other factors to consider in selecting a dependent variable are whether the population
distributions for the various treatment levels are approximately normal and whether the
populations have equal variances. I have more to say about these factors in Chapter 3 when I
discuss the assumptions underlying the analysis of variance (ANOVA). If theoretical
considerations do not dictate the selection of a dependent variable and if several alternative
variables are equally sensitive and reliable, in addition to being normally distributed with equal
variances, a researcher should select the variable that is most easily measured.
Nuisance Variables
In addition to independent and dependent variables, all experiments include one or more
nuisance variables. Nuisance variables are undesired sources of variation in an experiment
that affect the dependent variable. As the name implies, the effects of nuisance variables are of
no interest per se. There are many potential sources of nuisance variables. For example, the
calibration of equipment may change during the course of an experiment; the presentation of
instructions may vary slightly from subject to subject; errors may occur in measuring or
recording a subject's response; environmental factors such as room illumination, noise level,
and room temperature may not be constant for all subjects; and subjects may experience
lapses in attention, concentration, and interest.
In the radiation example, potential nuisance variables include the sex of the rats, differences in
the weights of the rats prior to the experiment, presence of infectious diseases in one or more
cages where the rats are housed, temperature variation among the cages, and differences in
previous feeding experiences of the rats. If not controlled, nuisance variables can affect the
outcome of an experiment. For example, if rats in the radiated groups suffer from some
undetected disease, differences among the groups will reflect the effects of the disease in
addition to radiation effects—if such effects exist.
The effect of a nuisance variable can take several forms. For example, a nuisance variable can
systematically distort results in a particular direction, in which case the effect is called bias.
Alternatively, a nuisance variable can increase the variability of the phenomenon being
measured and thereby increase the error variance. Error variance is variability among
observations that cannot be attributed to the effects of the independent variable. You also can
think of error variance as differences in the performance of subjects who are treated alike.
Sometimes a nuisance variable systematically distorts results in a particular direction and
Nuisance variables are undesired sources of variation and hence are threats to drawing valid
inferences from research. Other threats to valid inference making are described in Sections 1.5
and 1.6.
Research is performed for the following purposes: (1) to explore, (2) to describe or classify, (3)
to establish relationships, or (4) to establish causality. Over the years, researchers have
developed a variety of research strategies to accomplish these purposes. These strategies
include the experiment, quasi-experiment, survey, case study, and naturalistic observation.
Experiments
(necessity of A).4
As noted earlier, this book is concerned with two aspects of experiments: the plan for assigning
subjects to experimental conditions and the statistical analysis associated with the plan.
Because the statistical analysis procedures for experiments also are applicable to other
research strategies, I briefly describe some of these strategies next.
Quasi-Experiments
Quasi-experiments are similar to experiments except that the subjects are not randomly
assigned to the independent variable. Quasi-experiments are used instead of experiments
when random assignment is not possible or when, for practical or ethical reasons, it is
necessary to use preexisting or naturally occurring groups such as subjects with a particular
There is always the possibility that some variable other than the higher fluoride level was
responsible for the observed difference in tooth decay between Newburgh and Kingston.
However, every effort, short of random assignment, was made to eliminate other variables as
explanations for the observed difference. Random assignment is the best safeguard against
undetected nuisance variables. As a general principle, the difficulty of unambiguously
interpreting the outcome of research varies inversely with the degree of control that a
researcher is able to exercise over randomization.
Surveys
Surveys rely on the technique of self-report to obtain information about such variables as
people's attitudes, opinions, behaviors, and demographic characteristics. The data are
collected by means of an interview or a questionnaire. Although surveys cannot establish
causality, they can explore, describe, classify, and establish relationships among variables.
Case Studies
In a case study, a researcher observes selected aspects of a subject's behavior over a period
of time. The subject is usually a person, but it may be a setting such as a business, school, or
neighborhood. Often the subject possesses an unusual or noteworthy condition. For example,
Luria (1968) studied the Russian mnemonist Shereshevskii, who used mnemonic tricks and
devices to remember phenomenal amounts of material. Significant discoveries also may result
from studying less remarkable subjects. Jean Piaget's theory of intellectual development, for
example, evolved from his intensive observation of his own three children. He presented tasks
in a nonstandard manner to one child at a time in informal settings and observed the child's
verbal and motor responses. Piaget did not attempt to systematically manipulate preselected
independent variables, nor did he focus on just one or two dependent variables. Instead, his
approach was quite flexible, which allowed him to alter his procedures and follow up on new
hypotheses. His flexible case study approach uncovered knowledge about children's cognitive
development that might not have been discovered by a more rigid experimental approach.
Case studies can lead to interesting insights that merit further investigation. However, case
studies are particularly susceptible to the effects of nuisance variables. Furthermore, questions
arise about the degree to which the findings generalize to other populations.
Naturalistic Observation
Naturalistic observation is one of the oldest methods for studying individuals and events. In
some sciences, most notably astronomy, the strategy has led to extremely accurate predictions.
Classic examples of naturalistic observation are Charles Darwin's voyages on the HMS Beagle
as he compiled the data that led to the theory of evolution and Jane Goodall's (1971, 1986)
study of chimpanzees in their natural habitat in Tanzania, which gave us a new appreciation for
this highly social animal.
As a research strategy, naturalistic observation has two advantages over more controlled
strategies such as the experiment. First, findings from naturalistic observations generalize
readily to other real-life situations. Second, the strategy avoids the reactive arrangements
problem that is described in Section 1.5. This problem is avoided because subjects are
unaware that their behavior is being studied; hence, they do not react in an unnatural way as
they might if they were aware that they were being studied. Unfortunately, there are some
serious limitations associated with naturalistic observation. Although the strategy is useful for
describing what happened, it does not yield much information about why something happened.
To find out why something happened, it is necessary to tamper with the natural course of
events. Also, the strategy is an inefficient way to answer “What if?” questions because the
event of interest may occur infrequently or not at all in a natural setting.
In this section, I described five widely used research strategies. The strategies are presented in
order of decreasing control of the independent and dependent variables. Research always
involves a series of trade-offs—a theme I return to time and again. As our control of the
independent and dependent variables decreases, our ability to unambiguously interpret the
outcome of the research decreases, but our ability to generalize the results to the real world
increases.
The classification scheme for research strategies that I have described is widely used, but it is
not exhaustive. There are numerous other ways of classifying research strategies. Each
discipline tends to develop its own nomenclature and categories. This section describes some
other ways of categorizing research strategies.
The term ex post facto study (after-the-fact study) refers to any nonexperimental research
strategy in which subjects are singled out because they have already been exposed to a
particular condition or because they exhibit a particular characteristic. In such studies, the
researcher does not manipulate the independent variable or assign the experimental conditions
to the subjects. The retrospective cohort study and the case-control study, described in the
following section, are examples of ex post facto studies.
Retrospective and prospective studies are nonexperimental research strategies in which the
independent and dependent variables occur before or after, respectively, the beginning of the
study. Retrospective studies use historical records to look backward in time; prospective studies
look forward in time. A retrospective study is particularly useful for studying the relationship
between variables that occur infrequently or variables whose occurrence is difficult to predict.
For example, much of our knowledge about the health effects of ionizing radiation came from
studying persons exposed to the World War II bombings of Hiroshima and Nagasaki. A
retrospective study also is useful when there is a long time interval between a presumed cause
and effect. For example, a decade or more can pass between exposure to a carcinogen and the
clinical detection of cancer.
There are two types of retrospective studies: retrospective cohort studies and case-control
studies. In a retrospective cohort study, also called a historical cohort study, records are
used to identify two groups of subjects: those who have and those who have not been exposed
to the independent variable. Once the exposed and nonexposed groups have been identified,
they are compared in terms of the frequency of occurrence of the dependent variable.
Consider, for example, McMichael, Spirtas, and Kupper's (1974) study of workers in the rubber
industry. Employment records were used to identify 6678 workers who were alive on January 1,
1964. The mortality experience of these workers over the following 9-year period was compared
with the mortality experience of persons in the same age and gender categories in the U.S.
population. The researchers found that the rubber workers had much higher death rates from
cancers of the stomach, prostate, and hematopoietic tissues.
In a case-control study, also called a case-referent study, records are used to identify two
groups of subjects: those who exhibit evidence of the dependent variable, called cases, and
those who do not, called controls. The cases and controls are then compared in terms of their
previous exposure to the independent variable. Consider, for example, the study by Clarke,
Morgan, and Newman (1982), who investigated the relationship between cigarette smoking and
cancer of the cervix. One hundred eighty-one women with cervical cancer (cases) and 905
women without cervical cancer (controls) were interviewed to determine their smoking histories.
The researchers found that a much larger proportion of the cases than the controls had
smoked cigarettes.
Neither the retrospective cohort study nor the case-control study can establish a causal
relationship. However, the research strategies can suggest interesting relationships that warrant
experimental investigation. In the retrospective cohort study, more than one dependent variable
can be investigated, but only one independent variable can be studied at a time. In the case-
control study, multiple independent variables can be investigated, but only one dependent
variable can be studied at a time. Despite these and other limitations, both research strategies
have been particularly useful in the health sciences.
As noted earlier, a prospective study, also called a follow-up study, longitudinal study, or
cohort study, is a nonexperimental research strategy in which the independent and dependent
variables are observed after the onset of the investigation. Subjects are classified as exposed or
nonexposed based on whether they have been exposed to a naturally occurring independent
variable. The exposed and nonexposed groups are then followed for a period of time, and the
incidence of the dependent variable is recorded. A classic example is the Framingham Study (T.
Gordon & Kannel, 1970), which attempted to identify factors related to the dependent variable
of cardiovascular disease. In the study, more than 5000 persons living in Framingham,
Massachusetts, who did not have clinical evidence of atherosclerotic heart disease were
examined at 2-year intervals for more than 30 years. The study identified several factors,
including hypertension, elevated serum cholesterol, and cigarette smoking, that were related to
cardiovascular disease.
Prospective studies have advantages over retrospective studies. First, the purported cause
(independent variable) clearly precedes the effect (dependent variable); second, the amount
and quality of information are not limited by the availability of historical records or the
recollections of subjects; and third, measures of the incidence of the dependent variable can
be computed. But prospective studies have some serious limitations, too. If the dependent
variable is a rare event, a prohibitively large sample may be required to find a sufficient number
of subjects who develop the rare event. Also, the investigation of a chronic process using a
prospective study may require years to complete. Unfortunately, lengthy studies often suffer
from logistic problems such as keeping in touch with the subjects and turnover of the research
staff. The distinguishing features of retrospective and prospective studies are summarized in
Table 1.4-1.
The term longitudinal study refers to any research strategy in which the same individuals are
observed at two or more times. Usually the time interval between observations is fairly long. For
example, in the Framingham Study mentioned earlier, subjects were examined at 2-year
intervals for more than 30 years in an attempt to identify factors related to cardiovascular
disease.
A longitudinal study involves studying the same individuals over time. Identifying changes in
individuals over time is not difficult, but identifying the cause of the changes can be a problem
because it is difficult to control all nuisance variables over an extended period of time. As a
result, a researcher is often faced with competing explanations for the observed changes. The
longer the study, the more numerous the competing explanations. There are other problems
with longitudinal studies. Over the course of a long study, subjects move, die, or decide to drop
out of the study. Often the attrition rates for the groups being followed are different, which
introduces another source of bias. Also, longitudinal studies tend to be more expensive and
require a longer commitment of a researcher's time than cross-sectional studies, which are
described next.
A cross-sectional study is any research strategy in which two or more cohorts are observed at
the same time. As used here, a cohort denotes a person or group of people who have
experienced a significant life event such as a birth, marriage, or illness during a given time
interval—say, a calendar year or a decade. The Newburgh-Kingston Caries-Fluorine Study
mentioned earlier involved several cohort comparisons: children living in Newburgh versus
those living in Kingston and 6- to 9-year-olds versus older children.
Cross-sectional studies tend to be less expensive than longitudinal studies, and they provide
more immediate results. Also, attrition of subjects is less likely to be a problem in cross-
sectional studies. However, as mentioned earlier in discussing the Newburgh-Kingston Caries-
Fluorine Study, there is always the possibility that even in a well-designed cross-sectional
study, variables other than those under investigation are responsible for the observed
difference in the dependent variable. As noted earlier, random assignment is the best
safeguard against undetected nuisance variables.
The two research strategies described in this section combine features of longitudinal and
cross-sectional studies. A longitudinal-overlapping study, also called a sequential study,
can be used to compress the time required to perform a longitudinal study. Suppose that a
researcher wants to observe children at 2-year intervals from ages 5 through 13. A longitudinal
study would require 8 years. This time can be compressed to 4 years by observing a group of
5-year-olds and a second group of 9-year-olds. The 5-year-old children are observed at ages 5,
7, and 9; the 9-year-old children are observed at ages 9, 11, and 13. Note the overlapping age:
Both groups include 9-year-olds. The layout for this study is diagrammed in Figure 1.4-1, where
O1, O2, and O3 denote the first, second, and third observations of the children in each group,
respectively. In addition to cutting the length of the study in half, this research strategy enables
a researcher to compare 5- and 9-year-olds after completing the first set of observations. This
comparison would be delayed for 4 years in a longitudinal study. The earlier discussion of the
advantages and disadvantages of cross-sectional studies is applicable to a longitudinal-
overlapping study.
Figure 1.4-1 ▪ Simplified layout for a longitudinal-overlapping study, where O1, O2, and
O3 denote, respectively, the first, second, and third observations (Obs.) on the children
In a time-lag study, observations are made at two or more times but different subjects
(cohorts) are measured at each time. Consider, for example, the annual administration of the
Scholastic Aptitude Test to high school juniors and seniors. For a number of years, the test
score means for seniors have been declining. This example of a time-lag study shares some of
the characteristics of longitudinal and cross-sectional studies. The test scores are obtained at
two or more times, as in a longitudinal study, but as in a cross-sectional study, different senior
classes are observed at each testing period. The layout for this study is diagrammed in Figure
1.4-2, where the groups represent five senior classes that are each observed once and Oi
denotes one of the i = 1, …, 5 observations.
Group5.
A time-series study involves making multiple observations of one or more subjects or cohorts
before and after the introduction of an independent variable. The independent variable may or
may not be controlled by the researcher. Consider a study to determine the effect of banning
the importation of assault rifles in 2005 on the incidence of homicides and suicides. One way to
evaluate the effect of the ban is to compare the number of homicides and suicides in 2004 with
the number in 2005. Suppose that the data in Figure 1.4-3(a) are obtained. Because of the
reduction from 2004 to 2005, one might infer that the ban reduced the number of homicides
and suicides. However, other nuisance variables such as an unusually cool summer could have
been responsible for the reduction. A time-series study would provide stronger evidence for or
against the effectiveness of banning the importation of assault rifles. Following this approach, a
researcher would record the number of homicides and suicides for several years before and
after the ban and note trends in the data. Consider the hypothetical data in Figures 1.4-3(b–d).
Figure 1.4-3(b) suggests that the decrease in the number of homicides and suicides from 2004
to 2005 reflected nothing more than random year-to-year variation. Figure 1.4-3(c) suggests
that the ban had only a temporary effect. Figure 1.4-3(d) suggests that the ban had no effect
because similar reductions were observed during the years prior to and after the ban. These
hypothetical examples illustrate the importance of obtaining multiple observations so that
change can be viewed within a context.
Figure 1.4-3 ▪ Part (a) shows the decrease in the number of homicides and suicides
following a ban on the importation of assault rifles in 2004. A time-series study can place
the data in perspective. The hypothetical data in part (b) suggest that the decrease in the
number of homicides and suicides from 2004 to 2005 reflected random year-to-year
variation, part (c) suggests that the ban had a temporary effect, and part (d) suggests that
the ban had no effect.
A single-case study, not to be confused with the case studies described in Section 1.3, has
many of the characteristics of a time-series study. However, in a single-case study, multiple
observations of a single subject are made before and after the introduction of an independent
variable, and the researcher controls the independent variable.
Single-case studies were widely used in the behavioral sciences in the late 1880s and early
1900s. Examples include the pioneering work of Ebbinghaus (1850–1909) on forgetting,
Wundt's (1832–1920) research on sensory and perceptual processes, and Titchener's (1867–
1927) measurement of sensory thresholds. Researchers began to use large samples and
random assignment in the 1920s and 1930s, primarily because of the influence of R. A. Fisher
(1890–1962). B. F. Skinner's (1904–1990) research on schedules of reinforcement in the 1940s
and 1950s rekindled an interest in single-case studies. This research strategy has proven to be
particularly useful in assessing the effects of an intervention in clinical psychology research.6
The simplest single-case study uses an A-B design. The letter A denotes a baseline phase
during which no intervention is in effect; the letter B denotes the intervention phase. The
baseline phase serves three purposes: It provides data about a subject's performance prior to
instituting an intervention, it provides a basis for predicting a subject's future performance in the
absence of an intervention, and it indicates the normal variability in the subject's performance.
Consider an experiment to reduce the occurrence of thumb sucking of a 6-year-old named Bill.
Bill usually sucked his thumb at bedtime while his mother read to him. During the baseline
phase that lasted 3 days, Bill's mother read to him while an older sibling recorded the percent
of story-reading time during which Bill sucked his thumb. During the treatment phase, when Bill
began sucking his thumb, his mother would stop reading and remain quiet until Bill removed
his thumb from his mouth. By the end of the seventh treatment day, Bill had stopped sucking
his thumb when his mother read to him. The layout for this study is diagrammed in Figure 1.4-
4, where Oi denotes one of the i = 1, …, n observations of the dependent variable.
Figure 1.4-4 ▪ Simplified layout for a single-case study, where O1, O2, …, Oi denote
observations on a subject during the baseline period (A phase) and Oi+1, Oi+2, …, On
denote observations during the treatment period (B phase). Any difference between the A
and B phases in the mean of the observations or change in the trend of the observations
is attributed to the intervention.
In this example, the treatment appears to be related to the cessation of thumb sucking. But
there is always the possibility that coincidental changes in a nuisance variable were completely
or partly responsible for the cessation of thumb sucking. Statistical regression, which is
described in Section 1.5, is a potential nuisance variable in this kind of research because the
behavior that is to be altered is one that occurs frequently. Because of statistical regression,
there is a tendency for the frequency of behaviors that have a high rate of occurrence to
decrease in the absence of any intervention, as well as a tendency for the frequency of
behaviors that have a low rate of occurrence to increase. In the thumb-sucking example, a
stronger case for the efficacy of the treatment could be made if thumb sucking reappears when
the treatment is withdrawn—that is, when Bill's mother continues reading even though Bill
sucks his thumb. This modified design with the sequence of events
is diagrammed in Figure 1.4-5. Note that there are two opportunities to observe the effects of
the treatment: the transition from the baseline to the treatment (A-B) and the transition from the
treatment to the baseline (B-A). The presence of two transitions in the A-B-A design decreases
the probability that changes in the dependent variable are the result of coincidental changes in
a nuisance variable. A problem with this design is that the experiment ends on a baseline
phase—a phase during which thumb sucking is expected to reappear. The solution to this
problem is to reintroduce the B phase following the second A phase so that the experiment
ends with the intervention phase. The design is called an A-B-A-B design and is shown in
Figure 1.4-6. This design has the added advantage of providing three transitions: from A to B,
from B to A, and from A to B. Hence there are three opportunities to evaluate the efficacy of the
treatment. In a single-subject study, the use of one or more reversals in which a treatment is
withdrawn to see whether the dependent variable returns to the baseline level can raise ethical
questions. For example, if a treatment is successful in stopping an autistic child from
repeatedly hitting his or her head against a wall, the withdrawal of the treatment and the
subsequent return to head banging could result in physical injury to the child. In this example,
the withdrawal of the treatment would be unacceptable.
Figure 1.4-5 ▪ Simplified layout for a single-case study, where O1, O2, …, On denote
Figure 1.4-6 ▪ Simplified layout for a single-case study, where O1, O2, …, On denote
I have described a variety of research strategies in Section 1.3 and this section. In the next two
sections, I briefly examine some threats to drawing valid inferences from research. In Section
1.7, I describe some general approaches to controlling nuisance variables and minimizing
threats to valid inference making.
Two goals of research are to draw valid conclusions about the effects of an independent
variable and to make valid generalizations to populations and settings of interest. Shadish,
Cook, and Campbell (2002), drawing on the earlier work of Campbell and Stanley (1966), have
1.Statistical conclusion validity is concerned with threats to valid inference making that
result from random error and the ill-advised selection of statistical procedures.
2.Internal validity is concerned with correctly concluding that an independent variable is, in
fact, responsible for variation in the dependent variable.
3.Construct validity of causes and effects is concerned with the possibility that operations
that are meant to represent the manipulation of a particular independent variable can be
construed in terms of other variables.
4.External validity is concerned with the generalizability of research findings to and across
populations of subjects and settings.
This book is concerned with three of the threats to valid inference making: threats to statistical
conclusion validity, internal validity, and external validity. In the discussion that follows, I focus
on the three threats. The reader is encouraged to consult the original sources: Campbell and
Stanley (1966) and Shadish et al. (2002). The latter book should be read by all researchers
who, for one reason or another, are unable to randomly assign subjects to treatment
conditions.
1.Low statistical power. A researcher may fail to reject a false null hypothesis because the
sample size is inadequate, irrelevant sources of variation are not controlled or isolated, or
inefficient test statistics are used.
2.Violated assumptions of statistical tests. Test statistics require the tenability of certain
assumptions. If these assumptions are not met, incorrect inferences may result. This threat
is discussed in Section 3.5.
3.Fishing for significant results and the error rate problem. With certain test statistics, the
probability of drawing one or more erroneous conclusions increases as a function of the
number of tests performed. This threat to valid inference making is discussed in detail in
Chapter 5.
4.Reliability of measures. The use of a dependent variable that has low reliability may inflate
the estimate of the error variance and result in not rejecting a false null hypothesis.
5.Reliability of treatment implementation. Failure to standardize the administration of
treatment levels may inflate the estimate of the error variance and result in not rejecting a
false null hypothesis.
6.Random irrelevancies in the experimental setting. Variation in the environment (physical,
social, etc.) in which a treatment level is administered may inflate the estimate of the error
variance and result in not rejecting a false null hypothesis.
1.History. Events other than the administration of a treatment level that occur between the
time a subject is assigned to the treatment level and the time the dependent variable is
measured may affect the dependent variable.
2.Maturation. Processes not related to the administration of a treatment level that occur
within subjects simply as a function of the passage of time (growing older, stronger, larger,
more experienced, etc.) may affect the dependent variable.
3.Testing. Repeated testing of subjects may result in familiarity with the testing situation or
acquisition of information that can affect the dependent variable.
4.Instrumentation. Changes in the calibration of a measuring instrument, shifts in the criteria
used by observers and scorers, or unequal intervals in different ranges of a measuring
instrument can affect the measurement of the dependent variable.
5.Statistical regression. When the measurement of the dependent variable is not perfectly
reliable, there is a tendency for extreme scores to regress or move toward the mean.
Statistical regression operates to (a) increase the scores of subjects originally found to
score low on a test, (b) decrease the scores of subjects originally found to score high on a
test, and (c) not affect the scores of subjects at the mean of the test. The amount of
statistical regression is inversely related to the reliability of the test.
6.Selection. Differences among the dependent-variable means may reflect prior differences
among the subjects assigned to the various levels of the independent variable.
7.Mortality. The loss of subjects in the various treatment conditions may alter the distribution
of subject characteristics across the treatment groups.
8.Interactions with selection. Some of the foregoing threats to internal validity may interact
with selection to produce effects that are confounded with or indistinguishable from
treatment effects. Among these are selection history effects and selection maturation
effects. For example, selection maturation effects occur when subjects with different
maturation schedules are assigned to different treatment levels.
9.Ambiguity about the direction of causal influence. In some types of research—for
example, correlational studies—it may be difficult to determine whether X is responsible for
the change in Y or vice versa. This ambiguity is not present when X is known to occur
before Y.
10.
Diffusion or imitation of treatments. Sometimes the independent variable involves
information that is selectively presented to subjects in the various treatment levels. If the
subjects in different levels can communicate with one another, differences among the
treatment levels may be compromised.
11.Compensatory rivalry by respondents receiving less desirable treatments. W h e n
subjects in some treatment levels receive goods or services generally believed to be
desirable and this becomes known to subjects in treatment levels that do not receive those
goods and services, social competition may motivate the subjects in the latter group, the
control subjects, to attempt to reverse or reduce the anticipated effects of the desirable
treatment levels. Saretsky (1972) named this the “John Henry” effect in honor of the steel
driver who, upon learning that his output was being compared with that of a steam drill,
worked so hard that he outperformed the drill and died of overexertion.
12.
Resentful demoralization of respondents receiving less desirable treatments. I f
subjects learn that the treatment level to which they have been assigned received less
desirable goods or services, they may experience feelings of resentment and
demoralization. Their response may be to perform at an abnormally low level, thereby
increasing the magnitude of the difference between their performance and that of subjects
assigned to the desirable treatment level.
differently than subjects who are not aware that they are being observed.
6.Multiple-treatment interference. When subjects are exposed to more than one treatment,
the results may generalize to only populations that have been exposed to the same
combination of treatments.
Experimenter-Expectancy Effect
Demand Characteristics
Subject-Predisposition Effects
Cooperative-subject effect. The first predisposition is that of the cooperative subject whose
main concern is to please the researcher and be a “good subject.” Cooperative subjects are
particularly susceptible to the experimenter-expectancy effect. They try, consciously or
unconsciously, to provide data that support the researcher's hypothesis. This subject
predisposition is called the cooperative-subject effect.
Screw you effect. A second group of subjects tends to be uncooperative and may even try to
sabotage the experiment. Masling (1966) has called this predisposition the “screw you effect.”
It can result from resentment over being required to participate in an experiment, from a bad
experience in a previous experiment such as being deceived or made to feel inadequate, or
from a dislike for the course or the professor associated with the experiment. Uncooperative
subjects may try, consciously or unconsciously, to provide data that do not support the
researcher's hypothesis.
Evaluation apprehension. A third group of subjects are apprehensive about being evaluated.
Subjects with evaluation apprehension (Rosenberg, 1965) aren't interested in the
experimenter's hypothesis, much less in sabotaging the experiment. Instead, their primary
concern is to gain a positive evaluation from the researcher. The data they provide are colored
by a desire to appear intelligent, well adjusted, and so on and to avoid revealing characteristics
that they consider undesirable.
Faithful subjects. A fourth group of subjects have been labeled faithful subjects (Fillenbaum,
1966). Faithful subjects try to put aside their own hypotheses about the purpose of an
experiment and to follow the researcher's instructions to the letter. Often they are motivated by
a desire to advance scientific knowledge. The data produced by overly cooperative or
uncooperative subjects or by subjects with evaluation apprehension can cause a researcher to
draw a wrong conclusion. The data of faithful subjects, however, are not contaminated by such
predispositions; faithful subjects simply try to do exactly what they are told to do.
Placebo Effect
The last source of bias that I describe is the placebo effect. A placebo is an inert substance or
neutral stimulus that is administered to subjects as if it was the actual treatment condition.
When subjects begin an experiment, they are not entirely naive. They have ideas,
understandings, and perhaps a few misunderstandings about what will happen. If subjects
expect that an experimental condition will have a particular effect, they are likely to behave in a
manner consistent with their expectation. For example, subjects who believe that a medication
will relieve a particular symptom may report feeling better even though they have received a
chemically inert substance instead of the medication. Any change in the dependent variable
attributable to receiving a placebo is called the placebo effect.
In the previous sections, I described a variety of threats to valid inference making: threats to
statistical conclusion validity, internal validity, and external validity; the experimenter-expectancy
effect; demand characteristics; subject-predisposition effects; and the placebo effect. This list
of threats is far from complete. For a fuller discussion of threats to valid inference making, the
reader should consult Shadish et al. (2002) and Rosenthal (1979). In the following section, I
describe some procedures for controlling nuisance variables and minimizing threats to valid
inference making.
1.7 Controlling Nuisance Variables and Minimizing Threats to Valid Inference Making
Four general approaches are used to control nuisance variables. One approach is to hold the
nuisance variable constant for all subjects. Examples are using only male rats of the same
weight and presenting all instructions to subjects by means of an iPad, computer, or DVD
player. Although a researcher may attempt to hold all nuisance variables constant, inevitably
some variable will escape attention.
A second approach—one that is used in conjunction with the first—is to assign subjects
randomly to the experimental conditions. Then known as well as unsuspected sources of
variation are distributed over the entire experiment and thus do not selectively affect just one or
a limited number of treatment levels. Random assignment has two other purposes. It permits
the computation of an unbiased estimate of error effects—those effects not attributable to the
manipulation of the independent variable—and it helps to ensure that the error effects are
statistically independent. Through random assignment, a researcher creates two or more
groups of subjects that at the time of assignment are probabilistically similar on the average.
When random assignment is used, a researcher increases the magnitude of random variation
among observations to minimize bias, which is the distortion of results in a particular direction.
Random variation can be taken into account in evaluating the outcome of an experiment; it is
more difficult to account for bias.
A third approach to controlling nuisance variables is to include the variable as one of the factors
in the experimental design. This approach is illustrated in Section 2.2.
The three approaches for controlling nuisance variables illustrate the application of
experimental control as opposed to the fourth approach, which is statistical control. In some
experiments, it may be possible through the use of regression procedures (see Chapter 13) to
statistically remove the effects of a nuisance variable. This use of statistical control is referred to
as the analysis of covariance.
In addition to the four general approaches just described, a variety of other procedures are
used to control nuisance variables and minimize threats to valid inference making.
Single-blind procedure. In a single-blind experiment, subjects are not informed about the
nature of their treatment or, when feasible, the purpose of the experiment. A single-blind
procedure helps to minimize the effects of demand characteristics. Sometimes the purpose of
an experiment cannot be withheld from subjects because of informed consent requirements
that are imposed on the researcher. (Informed consent requirements are discussed in Section
1.8.)
Partial-blind procedure. Many treatments are of such a nature that they are easily identified
by a researcher. In this case, a partial-blind procedure can be used in which the researcher
does not know until just before administering the treatment level which level will be
administered. In this way, experimenter-expectancy effects are minimized until the
administration of the treatment level.
Deception. Deception occurs when subjects are not told the relevant details of an experiment
or when they are told that the experiment has one purpose when in fact the purpose is really
something else. Deception is used to direct a subject's attention away from the purpose of an
experiment so as to minimize the effects of demand characteristics. Deception should never be
used without a prior careful analysis of the ethical ramifications. (Ethical issues are discussed in
Section 1.8.)
Quasi-control group. This procedure uses a second control group, called a quasi-control
group, to assess the effects of demand characteristics. The quasi-control group is exposed to
all of the instructions and conditions that are given to the experimental group except that the
treatment condition of interest is not administered. This group, unlike a regular control group,
does not receive a placebo. Following the presentations of the instructions, the quasi-control
subjects are asked to produce the data that they would have produced if they had actually
received the treatment condition.
In a double-blind experiment, the quasi-control procedure can be carried one step further:
Subjects can be asked to pretend that they have received the treatment condition and to
behave accordingly—that is, to be simulators. At the conclusion of the experiment, the
researcher is asked to identify the real subjects, control subjects, and simulators. Comparisons
among the groups can be useful in detecting experimenter-expectancy effects and demand
characteristics.
In recent years, the research community has witnessed a renewed resolve to protect the rights
and interests of humans and animals. Codes of ethics for research with human subjects have
been adopted by a number of professional societies. Of particular interest are those of the
American Educational Research Association (2011), American Evaluation Association (2008),
American Psychological Association (2002), American Sociological Association (1999), and
American Statistical Association (1999). These codes specify what is required and what is
forbidden. In addition, they point out the ideal practices of the profession as well as ethical
pitfalls. The 1970s saw the passage of laws to govern the conduct of research with human
subjects. One law, which was originally enforced by the U.S. Department of Health, Education,
and Welfare (HEW), now the Department of Health and Human Services (HHS), requires that
all research funded by HHS involving human subjects be reviewed by an institutional review
board (Weinberger, 1974, 1975). As a result, most institutions that conduct research have
human subjects committees that screen all research proposals. These committees can
disapprove research proposals or require additional safeguards for the welfare of subjects.
In addition to codes of ethics of professional societies, legal statutes, and peer review, perhaps
the most important regulatory force within society is the individual researcher's ethical code.
Researchers should be familiar with the codes of ethics and statutes relevant to their research
areas and incorporate them into their personal codes of ethics.
Space does not permit an extensive examination of ethical issues here. For this the reader
should consult the references above and the thorough and balanced treatment by Diener and
Crandall (1978). However, I cannot leave the subject without listing some general guidelines.
1.A researcher should be knowledgeable about issues of ethics and values, take these into
account in making research decisions, and accept responsibility for decisions and actions
that have been taken. The researcher also is responsible for the ethical behavior of
collaborators, assistants, and employees who have parallel obligations.
2.Subjects should be informed of aspects of research that might be expected to influence
their willingness to participate. Failure to make full disclosure places an added
responsibility on the researcher to protect the welfare and dignity of the subject. Subjects
should understand that they have the right to decline to participate in an experiment and to
withdraw at any time; pressure should not be used to gain cooperation.
3.Research subjects should be protected from physical and mental discomfort, harm, and
danger. If risk of such consequences exists, a researcher must inform the subject of this. If
harm does befall a subject, the researcher has an obligation to remove or correct the
consequences.
4.Special care should be taken to protect the rights and interests of less powerful subjects
such as children, minorities, patients, the poor, and prisoners.
5.Research deception should never be used without a prior careful ethical analysis. When the
methodological requirements of a study demand concealment or deception, the researcher
should take steps to ensure the subject's understanding of the reason for this action and
afterward restore the quality of the relationship that existed. Where scientific or other
compelling reasons require that this information be withheld, the researcher acquires a
special responsibility to ensure that there are no damaging consequences for the subject.
6.Private information about subjects may be collected only with their consent. All such
research information is confidential. Publication of research results should be in a form that
A number of guides for research with animals have been published. Those engaged in such
research should be familiar with the American Psychological Association's (1996) Guidelines for
Ethical Conduct in the Care and Use of Animals.
1.Terms to remember:
a.statistical hypothesis (1.1)9
c. Most clairvoyant people are able to communicate with beings from outer space.
d.Rats are likely to fixate an incorrect response if it is followed by an intense noxious
stimulus.
*4.[1.2] For each of the following studies, identify the (i) independent variable, (ii) dependent
variable, and (iii) possible nuisance variables.
*a.Televised scenes portraying physical, cartoon, and verbal violence were shown to 20
preschool children. The facial expressions of the children were videotaped and then
classified by judges.
*b.Power spectral analyses of changes in cortical electroencephalogram (EEG) were made
during a 1- to 5-hour period of morphine administration in 10 female Sprague-Dawley
rats.
c. The effects of four amounts of flurazepam on hypnotic suggestibility in men and women
were investigated.
d.The keypecking rates of 20 female Silver King pigeons on fixed ratio reinforcement
schedules of FR10, FR50, and FR100 were recorded.
*5.[1.2] For the independent variables in Exercise 4, indicate (i) which are quantitative and (ii)
which are qualitative.
6.[1.3]
(a)List the ways in which experiments and quasi-experiments differ.
(b)Why wasn't the Newburgh-Kingston Caries-Fluorine Study an experiment?
7.[1.3] Describe how you would design the study described in Exercise 4a (a) as an
experiment and (b) as a naturalistic observation study.
*8.[1.3–1.4] (i) Classify each of the following according to the most descriptive or definitive
category in Sections 1.3 and 1.4: Use only one classification. (ii) What features of the
studies prompted your classification?
*a.The effect of participation in the Boy Scouts, the independent variable, on the
propensity for assuming leadership roles as an adult was investigated for a random
sample of 400 men who were lifelong residents of Columbus, Ohio, and between the
ages of 30 and 60. The subjects were classified as having held or not held a leadership
role during the previous 5-year period. Records were then used to determine those men
who had participated in the Boy Scouts.
*b.In a study of 86 lonely people, it was found that they display some of the characteristics
of shy people: Lonely people disclose less personal information about themselves to
opposite-sex friends than do nonlonely people, and they use inappropriate levels (too
intimate or too impersonal) of self-disclosure in initial interactions.
*c.Two hundred thirty-two sixth graders took a test that measured arithmetic achievement.
Two hundred of the students were matched on the basis of their achievement scores.
than that for the nation at large, χ2(30) = 45.44, p < .05.
g.A survey by the Centers for Disease Control and Prevention in Atlanta, Georgia, found
that 27.6% of 15-year-old girls in 1999 had had premarital sex at least once. The
comparable percentages for 2003 and 2008 were 53% and 77.2%, respectively.
h.The relationship between birth order and participation in dangerous sports such as
hang gliding, auto racing, and boxing was investigated. Records at Florida State
University were screened to obtain 50 men who were first-born, second-born, and so on
and to identify their recreational activities while at the university.
i. Pediatricians in Oklahoma provided the names of 421 new mothers. The mothers’ infant
feeding practices were subsequently determined. Eight years later, elementary school
records for 372 of these children indicated that the breast-fed babies had a higher level
of performance in school than did those who had been bottle-fed.
j. Employment records were used to identify 86 men who had worked for a company in
Cleveland, Ohio, that manufactured chemicals used as fire retardants. A second group
of men, n = 89, was identified who worked for two other companies in Cleveland and
had no exposure to the chemicals. Evidence of primary thyroid dysfunction was found in
four of the exposed men; none of the unexposed men showed evidence of thyroid
dysfunction.
*9.[1.5] Identify potential threats to internal validity for these studies.
*a.Exercise 8a
*b.Exercise 8b
c. Exercise 8d
d.Exercise 8g
e.Exercise 8h
f. Exercise 8i
*10.
[1.5] Identify potential threats to external validity in these studies.
*a.Exercise 8c
*b.Exercise 8e
c. Exercise 8f
d.Exercise 8j
*11.
[1.6] For the experiments in Exercise 8, indicate those for which the following are potential
threats to valid inference making.
*a.Experimenter-expectancy effect
b.Demand characteristics
c. Subject-predisposition effects
12.
[1.7] Two approaches to controlling nuisance variables and minimizing threats to valid
inference making are holding the nuisance variable constant and using random assignment
or random sampling. Indicate which experiments in Exercise 8 used these approaches and
which approach was used.
13.
[1.8] Section 1.8 lists eight general guidelines for the ethical treatment of subjects.
Recognizing that all of the guidelines are important, select the five that you think are the
most important and rank order them (assign 1 to the most important guideline). What do
your selection and rankings reveal about your own ethical code?
independently of other entities. An experimental unit may contain several observational units.
For example, in an educational experiment, the experimental unit is often the classroom, but
the individual students are the observational units. Administering an educational intervention to
a classroom can result in nonindependence of the observational units. For a discussion of this
problem, see Levin (1992).
2Readers who are interested only in the traditional approach to the analysis of variance can,
without loss of continuity, omit the material on the cell means model.
3For a discussion of these designs, see R. J. Harris (2001); Lattin, Carroll, and Green (2003);
Meyers, Gamst, and Guarono (2006); Stevens (2002); and Todman and Dugard (2007).
4Causality is a complex concept. For other definitions and views of causality, see Pearl (2000);
Shadish (2010); Shadish, Cook, and Campbell (2002); Sobel (2008); and West and Thoemmes
(2010).
6Barlow, Nock, and Hersen (2009); Kazden (1982); and Morgan and Morgan (2009) provide in-
depth discussions of single-case studies.
7The list of categories and threats to valid inference making are taken from Campbell and
Stanley (1966) and Shadish et al. (2002). Responsibility for the interpretation of items in their
lists is mine.
8Problems or portions thereof for which answers are given in Appendix F are denoted by *.
9The numbers in parentheses indicate the section in which the term is first described.
http://dx.doi.org/10.4135/9781483384733.n1
2.1 Introduction
A variety of research strategies were described in Chapter 1. In this chapter, I describe some
basic experimental designs that are used with these research strategies. Recall from Section
1.1 that an experimental design is a plan for assigning subjects to experimental conditions and
the statistical analysis associated with the plan. This chapter focuses on the assignment of
subjects to experimental conditions and on the general features of some basic designs. The
statistical analysis associated with the designs is presented in Chapters 4 to 16.
One of the simplest experimental designs is the randomization and analysis plan that is used
with a t test for independent samples. A two-sample t statistic is often used to test the null
hypothesis that the difference between two population means is equal to some value, usually
zero. Consider an experiment to help cigarette smokers break the habit. The independent
variable is two kinds of therapy; the dependent variable is the number of cigarettes smoked per
day 6 months after therapy. For notational convenience, the two kinds of therapy are called
treatment A. The levels of treatment A that correspond to the specific therapies are denoted by
the lowercase letter a and a subscript: a1 denotes cognitive behavioral therapy, and a2 denotes
hypnosis. A particular but unspecified level of treatment A is denoted by aj, where j ranges over
the values 1 and 2. The number of cigarettes smoked per day 6 months after therapy by
subject i in treatment level j is denoted by Yij.
The null and alternative hypotheses for the cigarette smoking experiment are, respectively,
where μ1 and μ2 denote the mean cigarette consumption of the respective populations. The
Greek letter μ (mu) is pronounced “mew.” Assume that 30 smokers who want to stop smoking
are available to participate in the experiment. I want to assign n = 15 smokers to each of the p =
2 treatment levels so that each possible assignment has the same probability. This can be done
by numbering the smokers from 1 to 30 and drawing numbers from the random numbers table
in Appendix Table E.1. The first n numbers drawn between 1 and 30 are assigned to treatment
level a1; the remaining subjects are assigned to a2. The layout for the experiment is shown in
Figure 2.2-1. The subjects who received treatment level a1 are called Group1; those who
received treatment level a2 are called Group2. The two sample means, Ῡ1 and Ῡ2, reflect the
effects of the treatment levels that were administered to the subjects.1 The computational
procedures and assumptions associated with a t test for independent samples are discussed in
Section 4.2.
Figure 2.2-1 ▪ Layout for a t test for independent-samples design. The treatment level is
denoted by Treat. Level; the dependent variable is denoted by Dep. Var. Thirty subjects
are randomly assigned to two levels of treatment A with the restriction that 15 subjects
are assigned to each level. The mean cigarette consumptions for subjects in treatment
levels a1 and a2 are denoted by Ῡ1 and Ῡ2, respectively.
The t test for independent-samples design involves randomly assigning subjects to two levels
of a treatment. A completely randomized analysis of variance design, described next, extends
this design strategy to two or more treatment levels. Consider an experiment to evaluate the
effectiveness of three therapies in helping cigarette smokers break the habit. The null and
alternative hypotheses for the experiment are, respectively,
The null hypothesis can be tested by using a completely randomized analysis of variance
design. This design is denoted by the letters CR-p, where CR stands for completely
randomized and p is the number of levels of the treatment. In this example, p is equal to 3. For
convenience, the three kinds of therapy are called treatment A. The three levels of treatment A
are denoted by the lowercase letter a and a subscript: a1 is behavioral therapy, a2 is hypnosis,
and a3 is a medication delivered by means of a patch applied to a smoker's back.
Assume that 45 smokers who want to stop smoking are available to participate in the
experiment. The subjects are randomly assigned to the treatment levels with the restriction that
15 subjects are assigned to each therapy. The phrase randomly assigned i s important.
Randomization distributes the idiosyncratic characteristics of the subjects over the three
treatment levels so that they do not selectively bias the outcome of the experiment. And
randomization helps to prevent the experimenter's personal biases from being introduced into
the experiment. As I discuss in Chapter 3, randomization also helps to obtain an unbiased
estimate of the random error variation in the experiment, and it helps to ensure that the error
effects are statistically independent. The layout for the experiment is shown in Figure 2.2-2. A
comparison of the layout in this figure with that in Figure 2.2-1 for a t test for independent-
samples design reveals that they are the same except that a completely randomized design
can have more than two treatment levels. When p = 2, the layouts for the designs are identical.
Figure 2.2-2 ▪ Layout for a completely randomized design (CR-3 design). The treatment
level is denoted by Treat. Level; the dependent variable is denoted by Dep. Var. Forty-five
subjects are randomly assigned to three levels of treatment A, with the restriction that 15
subjects are assigned to each level. The mean cigarette consumptions for subjects in
treatment levels a1, a2, and a3 are denoted by Ῡ1, Ῡ2, and Ῡ3, respectively.
Thus far, I have identified the null hypothesis that I want to test, μ1 = μ2 = μ3, and described
the manner in which the subjects are assigned to the three treatment levels. In the following
paragraphs, I discuss the composite nature of an observation, describe the experimental design
model equation for a CR-p design, and examine the meaning of the terms treatment effect and
error effect.
independent variable, (2) individual characteristics of the subject or experimental unit, (3)
chance fluctuations in the subject's performance, (4) measurement and recording errors that
occur during data collection, and (5) any other nuisance variables such as environmental
conditions that have not been controlled. Consider the cigarette consumption of subject 2 in
treatment level a2 in Figure 2.2-2. Suppose that 6 months after therapy, this subject is smoking
three cigarettes a day (Y22 = 3). What factors have affected the value of Y22? One factor is the
effectiveness of the therapy received—hypnosis in this case. Other factors are the subject's
cigarette consumption prior to therapy, the subject's level of motivation to stop smoking, and
the weather during the previous 6 months, to mention only a few. In summary, Y22 i s a
composite that reflects (1) the effects of treatment level a2, (2) effects unique to the subject, (3)
effects attributable to chance fluctuations in the subject's behavior, (4) errors in measuring and
recording the subject's cigarette consumption, and (5) any other effects that have not been
controlled.
My conjectures about Y22 or any other observation can be expressed more formally by an
experimental design model equation. The model equation for the smoking experiment is
where
According to the model equation for this completely randomized design, each observation is the
sum of three parameters μ, αj, and εi(j). The values of the parameters are unknown, but in
Section 3.2, I show how they can be estimated from sample data.
The meanings of the terms grand mean, μ, and treatment effect, αj, in the model equation seem
fairly clear; the meaning of error effect, εi(j), requires a bit more explanation. Why do
observations, Yijs, in the same treatment level vary from one subject to the next? This variation
must be due to differences among the subjects and to other uncontrolled variables because
the parameters μ and αj in the model equation are constants for all subjects in the same
treatment level. To put it another way, observations in the same treatment level are different
because the error effects, εi(j)s, for the observations are different. Recall that error effects
reflect idiosyncratic characteristics of the subjects—those characteristics that differ from one
subject to another—and any other variables that have not been controlled. Researchers
attempt to minimize the size of error effects by holding constant sources of variation that might
contribute to the error effects and by the judicial choice of an experimental design. Designs
described in the following sections permit a researcher to isolate and remove some sources of
variation that would ordinarily be included in the error effect.
An experimental design model is an example of a linear model. A linear model consists of two
parts: a linear model equation, for example, Yij = μ + α + εi(j), and assumptions about the
model parameters. The assumptions for this model are described in Section 3.3. The model is
called a linear model because the observation, Yij, is equal to a linear combination of the model
parameters: μ + αj + εi(j).
The two designs just described require the use of independent samples. Two samples are
independent if, for example, a researcher samples randomly from two populations or uses a
random procedure to assign subjects to two groups. Dependent samples, on the other hand,
can be obtained by any of the following procedures:
1.Observing each subject under each treatment level in the experiment—that is, obtaining
repeated measures on the subjects.
2.Forming sets of subjects who are similar with respect to a variable that is correlated with the
dependent variable. This procedure is called subject matching.
3.Obtaining sets of identical twins or littermates and assigning one member of the pair
randomly to one treatment level and the other member to the other treatment level.
4.Obtaining pairs of subjects who are matched by mutual selection—for example, husband
and wife pairs or business partners.
In behavioral, medical, and educational research, the subjects are often people whose
aptitudes and experiences differ markedly. Individual differences are inevitable, but it is often
possible to isolate or partition out a portion of these effects so that they do not appear in
estimates of the error effects. One design for accomplishing this is a t test for dependent
samples. As the name suggests, the design uses dependent samples. A t test for dependent
samples also uses a more complex randomization and analysis plan than a t test for
independent samples, but the added complexity is usually accompanied by greater power3—a
point that I develop when I discuss a randomized block analysis of variance design in the next
section.
Let's reconsider the cigarette smoking experiment. It is reasonable to assume that the difficulty
in breaking the smoking habit is related to the number of cigarettes that a person smokes per
day. The design of the experiment can be improved by isolating this variable. Suppose that
instead of randomly assigning 30 subjects to the treatment levels, I form pairs of subjects such
that the subjects in each pair have similar cigarette consumptions prior to the experiment. The
subjects in each pair constitute a block of matched subjects. A simple way to match the
subjects is to rank them in terms of the number of cigarettes they smoke per day. The subjects
ranked 1 and 2 are assigned to block 1, those ranked 3 and 4 are assigned to block 2, and so
on. In this example, 15 blocks of matched subjects can be formed. After all of the blocks have
been formed, the two subjects in each block are randomly assigned to the two kinds of therapy.
The layout for this experiment is shown in Figure 2.2-3. If my hunch is correct—that the
difficulty in breaking the smoking habit is related to the number of cigarettes that a person
smokes per day—this design should result in a greater likelihood of rejecting the null
hypothesis:
Figure 2.2-3 ▪ Layout for a t test for dependent samples, where each block contains two
subjects whose cigarette consumptions prior to the experiment were similar. The two
subjects in a block are randomly assigned to the treatment levels. The mean cigarette
consumptions for subjects in treatment levels a1 and a2 are denoted by Ῡ1 a n d Ῡ2,
respectively.
than does a t test for independent samples. Later I show that the increased power to reject the
null hypothesis results from isolating the nuisance variable of number of cigarettes smoked per
day and thereby reducing the size of the error effects.
Earlier you learned that the layout and randomization procedures for a t test for independent
samples and a completely randomized analysis of variance design are the same except that a
completely randomized design can have more than two treatment levels. The same comparison
can be drawn between a t test for dependent samples and a randomized block analysis of
variance design. A randomized block analysis of variance design is denoted by the letters
RB-p, where RB stands for randomized block and p is the number of levels of the treatment.
Suppose that in the cigarette smoking experiment, I want to evaluate the effectiveness of three
kinds of therapy in helping smokers break the habit. The three kinds are behavioral therapy,
denoted by a1; hypnosis, denoted by a2; and a medication, denoted by a3. I suspect that the
difficulty in breaking the smoking habit is related to the number of cigarettes that a person
smokes per day. I can use the blocking procedure described in connection with a t test for
dependent samples to isolate and control this nuisance variable. If a sample of 45 smokers is
available, I can form 15 blocks that contain three subjects who have had similar consumptions
of cigarettes prior to the experiment. The dependent variable for a subject in block i a n d
treatment level j is denoted by Yij. The layout for the experiment is shown in Figure 2.2-4. A
comparison of the layout in this figure with that in Figure 2.2-3 for a t test for dependent-
samples design reveals that they are the same except that the randomized block design has p
= 3 treatment levels. When p = 2, the layouts for the designs are identical.
Figure 2.2-4 ▪ Layout for a randomized block design (RB-3 design), where each block
contains three matched subjects whose cigarette consumptions prior to the experiment
were similar. The subjects in a block are randomly assigned to the treatment levels. The
mean cigarette consumptions for subjects in treatment levels a1, a2, and a3 are denoted
by Ῡ1, Ῡ2, and Ῡ3, respectively; the mean cigarette consumptions for subjects in Block1,
The first hypothesis states that the population means for the three therapies are equal. The
second hypothesis, which is usually of little interest, states that the population means for the 15
levels of the nuisance variable, cigarette consumption prior to the experiment, are equal. I
expect a test of this null hypothesis to be significant. If the nuisance variable of the number of
cigarettes smoked prior to the experiment does not account for an appreciable proportion of the
total variation in the experiment, little has been gained by isolating the effects of the variable.
Before exploring this point, I describe the experimental design model equation for an RB-p
design.
Experimental design model equation. The model equation for the smoking experiment is
where
Yijis the observation for the subject in block i and treatment level j.
is the population grand mean of μ11, μ21, …, μ15, 3. You can think of μ as the average value
μ around which the treatment and block means vary; μ is a constant for the 45 observations in
the experiment.
is the treatment effect for population j and is equal to μ.j – μ, the deviation of the grand mean
αj from the jth population treatment mean. The treatment effect reflects the effects of the jth
therapy and is a constant for the 15 observations in treatment level aj.
(pi) is the block effect for population i and is equal to µi. – µ, the deviation of the grand mean
πi from the ith population block mean. The block effect reflects the effects of smoking a certain
number of cigarettes per day prior to therapy.
is the error effect associated with Yij and is equal to Yij – µ – αj – πi. The error effect
represents effects unique to subject i in treatment level j, effects attributable to chance
εij fluctuations in subject i's behavior, and any other effects that have not been controlled such
as environmental conditions—in other words, all effects not attributable to treatment level j
According to the equation for this randomized block design, each observation is the sum of four
parameters: µ, αj, πi, and εij. The error effect is that portion of an observation that remains after
the grand mean, treatment effect, and block effect have been subtracted from it; that is, εij = Yij
– µ – αj – πi. The sum of the squared error effects for this randomized block design,
will be smaller than the sum for the completely randomized design,
if πi is greater than zero for one or more blocks. As I show in Section 3.3, the F statistic that is
used to test the null hypothesis in analysis of variance can be thought of as the ratio of error
and treatment effects,
where f() denotes a function of the effects in parentheses. It is apparent from an examination of
this ratio that the smaller the sum of the squared error effects, the larger the F statistic and,
hence, the greater the probability of rejecting a false null hypothesis. Thus, by isolating a
nuisance variable that accounts for an appreciable portion of the total variation in a randomized
block design, a researcher is rewarded with a more powerful test of a false null hypothesis.
As you have seen, blocking with respect to the nuisance variable, the number of cigarettes
smoked per day, enables me to isolate this variable and remove it from the error effects. But
what if the nuisance variable does not account for any of the variation in the experiment? In
other words, what if all of the block effects are equal to zero (µi. – µ = 0 for all i)? Then the sum
of the squared error effects for the randomized block and the completely randomized designs
will be equal, and the effort used to form blocks of matched subjects in the randomized block
design will be for naught. The larger the correlation between the nuisance variable and the
dependent variable, the more likely it is that the block effects account for an appreciable
proportion of the total variation in the experiment.
The Latin square design described in this section derives its name from an ancient puzzle that
was concerned with the number of different ways that Latin letters can be arranged in a square
matrix so that each letter appears once in each row and once in each column. An example of a
3 × 3 Latin square is shown in Figure 2.2-5. I use the letter a and subscripts in place of Latin
letters. The Latin square design is denoted by the letters LS-p, where LS stands for Latin
square and p is the number of levels of the treatment. A Latin square design enables a
researcher to isolate the effects of not one but two nuisance variables. The levels of one
nuisance variable are assigned to the rows of the square; the levels of the other nuisance
variable are assigned to the columns. The levels of the treatment are assigned to the cells of
the square.
Figure 2.2-5 ▪ Three-by-three Latin square, where aj denotes one of the j = 1, …, p levels
Let's return to the cigarette smoking experiment. With a Latin square design, I can isolate the
effects of cigarette consumption and the effects of a second nuisance variable—say, the length
of time in years that a person has smoked. The advantage of being able to isolate two nuisance
variables comes at a price. The randomization procedures for a Latin square design, which are
described in Chapter 14, are more complex than those for a randomized block design. Also, the
number of rows and columns of a Latin square must each equal the number of treatment
levels, which is three in this example. I can assign three levels of cigarette consumption to the
rows of the Latin square: b1 is less than one pack per day, b2 is one to three packs per day,
and b3 is more than three packs per day. The other nuisance variable, the duration of the
smoking habit in years, can be assigned to the columns of the square: c1 is less than 1 year,
c2 is 1 to 5 years, and c3 is more than 5 years. The dependent variable for the ith subject in the
jth treatment level, k th row, and lth column is denoted by Yijkl. The layout of the design is
shown in Figure 2.2-6.
Figure 2.2-6 ▪ Layout for a Latin square design (LS-3 design) that is based on the Latin
square in Figure 2.2-5. The treatment combination is denoted by Treat. Comb. Treatment
A represents three kinds of therapy, nuisance variable B represents the number of
cigarettes smoked per day, and nuisance variable C represents the length of time in years
that a person has smoked. Subjects in Group1, for example, received behavioral therapy
(a1), smoked less than one pack of cigarettes per day (b1), and had smoked for less than
1 year (c1). The mean cigarette consumptions for the subjects in the nine groups are
The first hypothesis states that the population means for the three therapies are equal. The
second and third hypotheses make similar assertions about the population means for the two
nuisance variables: number of cigarettes smoked per day and duration of the smoking habit in
years. Tests of these nuisance variables are expected to be significant. As discussed earlier, if
the nuisance variables do not account for an appreciable proportion of the total variation in the
experiment, little has been gained by isolating the effects of the variables.
Experimental design model equation. The model equation for this version of our smoking
experiment is
where
Yijkl is the observation for subject i in treatment level j, row k, and column l.
is the population grand mean of µ111, µ123, … , µ331. You can think of µ as the
µ average value around which the treatment, row, and column means vary; µ i s a
constant for all scores in the experiment.
is the treatment effect for population j and is equal to µ.j – µ, the deviation of the grand
αj mean from the jth population treatment mean. The treatment effect reflects the effects
of the jth therapy and is a constant for the scores in treatment level aj.
(beta) is the row effect for population k and is equal to µ.k. – µ, the deviation of the
βk grand mean from the kth population row mean. The row effect reflects the effects of
smoking a certain number of cigarettes per day prior to therapy.
(gamma) is the column effect for population l and is equal to µ.l – µ, the deviation of
γl the grand mean from the lth population column mean. The column effect reflects the
effects of smoking for a certain number of years prior to therapy.
is the pooled error effect associated with Yijkl and is equal to Yijkl – µ – αj – βk – γl.
εPooled
The nature of this pooled error effect is discussed in Chapter 14.
According to the equation for this Latin square design, each observation is the sum of five
parameters: µ, αj, βk, γl, and εPooled. The sum of the squared error effects for this Latin
square design,
will be smaller than the sum for the randomized block design,
Building block designs. Thus far, I have described three of the simplest analysis of variance
designs: completely randomized design, randomized block design, and Latin square design. I
call these three ANOVA designs building block designs because all complex ANOVA designs
can be constructed by modifying or combining these simple designs. Furthermore, the
randomization procedures, data analysis procedures, and assumptions for complex ANOVA
designs are extensions of those for the three building block designs. The following section
provides a preview of a factorial design that is constructed from two completely randomized
designs.
Factorial designs differ from those described previously because two or more treatments can be
evaluated simultaneously in a single experiment.5 The simplest factorial design from the
standpoint of randomization procedures, data analysis, and assumptions is based on a
completely randomized analysis of variance design and hence is called a completely
randomized factorial design. A two-treatment, completely randomized factorial design is
denoted by the letters CRF-pq, where the letters CR denote the building block design, F
indicates that the design is a factorial design, and p and q stand for the number of levels of
treatments A and B, respectively.
Consider an experiment to evaluate the effects of two treatments on the speed of reading. Let
treatment A consist of two levels of room illumination: a1 is 15-foot candles and a2 is 30-foot
candles. Treatment B consists of three levels of type size: b1 is 9-point type, b2 is 12-point
type, and b3 is 15-point type. This design is designated by the letters CRF-23, where 2 is the
number of levels of treatment A and 3 is the number of levels of treatment B. The layout for the
design is obtained by combining the treatment levels of a CR-2 design with those of a CR-3
design so that each treatment level of the CR-2 design appears once with each level of the CR-
3 design. The resulting CRF-23 design has 2 × 3 = 6 treatment combinations as follows:
a1b1, a1b2, a1b3, a2b1, a2b2, a2b3. When treatment levels are combined in this way, the
treatments are said to be completely crossed. Completely crossed treatments are a
characteristic of all completely randomized factorial designs. Assume that 30 sixth-graders are
available to participate in the experiment. The children are randomly assigned to the six
treatment combinations, with the restriction that five children receive each combination. The
Figure 2.2-7 ▪ Layout for a completely randomized factorial design (CRF-23 design) where
30 subjects are randomly assigned to six combinations of treatments A and B, with the
restriction that five subjects are assigned to each combination. Treatment A represents
two levels of room illumination; treatment B represents three levels of type size. Subjects
in Group1, for example, read in a room with 15-foot candles of illumination (a1), and the
material was typed using 9-point type (b1). The mean reading speeds for the subjects in
Experimental design model equation. The model equation for the experiment is
where
The last hypothesis is unique to factorial designs. It states that the joint effects (interaction) of
treatments A and B are equal to zero for all combinations of the two treatments. Two treatments
are said to interact if differences in performance under the levels of one treatment are different
at two or more levels of the other treatment. Figure 2.2-8 illustrates two possible outcomes of
the reading experiment: Part (a) illustrates the presence of an interaction, and part (b)
illustrates the absence of an interaction. When two treatments interact as in Figure 2.2-8(a), a
graph of the population means always reveals at least two nonparallel lines for one or more
segments of the lines. I say more about interactions in Chapter 9.
Figure 2.2-8 ▪ Two possible outcomes of the reading experiment. Part (a) illustrates an
interaction between treatments A and B; part (b) illustrates the absence of an interaction.
Nonparallelism of the lines for one or more segments of the lines signifies interaction.
Comparison of CR-p and CRF-pq designs. Earlier I observed that a completely randomized
design is the building block design for a completely randomized factorial design. The
similarities between the two designs become apparent when you compare the randomization
procedures for the designs. For both designs, the subjects are randomly assigned to the
treatment levels (combinations), with the restriction that the same number of subjects is
assigned to each level (combination).6 When I describe the assumptions for the two designs in
Chapters 3 and 9, we find additional similarities.
In the last section, I described four of the simpler analysis of variance (ANOVA) designs. As you
will see, this discussion only scratched the surface. There is a bewildering array of analysis of
variance designs available to researchers. Furthermore, there is no universally accepted
nomenclature for analysis of variance designs; some designs have as many as five different
names.7 And most of the design nomenclatures do not indicate which ANOVA designs share
similar randomization plans and layouts. My design nomenclature in Table 2.3-1 is based on
the concept of building block designs. Recall that all complex designs can be constructed by
modifying or combining three simple designs: completely randomized design (CR-p),
randomized block design (RB-p), and Latin square design (LS-p). These three designs provide
the organizational structure for the design nomenclature and classification system that is
outlined in Table 2.3-1. The letter p in the table denotes the number of levels of a treatment. If a
design includes a second or third treatment, the number of their levels is denoted by q and r,
respectively.
The category systematic designs i n Table 2.3-1 is of historical interest only. According to
Leonard and Clark (1939), agricultural field research using systematic designs on a practical
scale dates back to 1834. Prior to the work of R. A. Fisher in the 1920s and 1930s, as well as
that of J. Neyman and E. S. Pearson on the theory of statistical inference, investigators used
systematic schemes instead of random procedures to assign plots of land or other suitable
experimental units to treatment levels—hence the designation systematic designs for these
early field experiments. Over the past 100 years, systematic designs have fallen into disuse
because designs that use random assignment are more likely to provide valid estimates of
treatment effects and error effects, and they can be analyzed with the powerful tools of
statistical inference such as the analysis of variance.
The impetus for the development of better research procedures came from the need to improve
agricultural techniques. Today the experimental design nomenclature is replete with terms from
agriculture. Modern principles of experimental design, particularly the random assignment of
experimental units to treatment levels, received general acceptance as a result of the work of
Fisher (1935b) and Fisher and MacKenzie (1922, 1923). Experimental designs that use the
randomization principle are called randomized designs. The randomized designs in Table 2.3-
1 are subdivided into several distinct categories based on (1) the number of treatments, (2)
whether the subjects are subdivided into homogeneous blocks or groups prior to assigning
them to treatment levels, (3) the nature of any confounding, (4) the use of crossed versus
nested treatments, and (5) the use of a covariate.
A quick perusal of Table 2.3-1 reveals why researchers sometimes have difficulty selecting an
appropriate ANOVA design; there are a lot of designs from which to choose. Because of the
wide variety of designs available, it is important to identify them clearly in research reports. One
often sees statements such as “a two-treatment factorial design was used.” It should be evident
that a more precise description is required. This description could refer to any of the 11 factorial
designs in Table 2.3-1.
In Section 2.2, I briefly described 4 of the 34 types of ANOVA designs in Table 2.3-1. These
descriptions highlighted the ways in which the designs differ: (1) randomization, (2)
experimental design model equation, (3) number of treatments, (4) inclusion of a nuisance
variable as a factor in the experiment, and (5) power. In Chapters 4 and 8 to 16, I discuss other
ways in which designs differ: (1) use of crossed or nested treatments or a combination of the
two, (2) presence or absence of confounding, and (3) use of a covariate. I also identify the
following common threads that run through the various designs:
1.All complex designs can be constructed from the three building block designs: completely
randomized design, randomized block design, and Latin square design.
2.There are only four kinds of variation in ANOVA: total variation, between-groups variation,
within-groups variation, and interaction variation.
3.All error terms involve either within-groups variation or interaction variation.
4.The numerator of an F statistic should always estimate one more source of variation than
the denominator, and that source of variation should be the one that is being tested.
Considering the variety of analysis of variance designs available, it is not surprising that some
researchers approach the selection of an appropriate design with trepidation. Selection of the
best design for a particular research problem requires a familiarity with (1) the research area
and (2) the designs that are available. In selecting a design, the following questions need to be
considered.
1.Does the design permit the calculation of a valid estimate of the experimental effects and
the error effects?
2.Does the data collection procedure produce reliable results?
3.Does the design possess sufficient power to permit an adequate test of the statistical
hypotheses?
4.Does the design provide maximum efficiency within the constraints imposed by the
experimental situation?
5.Does the experimental procedure conform to accepted practices and procedures used in
the research area? Other things being equal, a researcher should use procedures that offer
an opportunity for comparing the findings with the results of other investigations.
The question “What is the best experimental design to use?” is not easily answered. Statistical
as well as nonstatistical factors must be considered. The discussion of specific designs in
Chapters 4 and 8 t o 15 should go a long way toward demystifying the selection of an
appropriate analysis of variance design.
People are by nature inquisitive. We ask questions, develop hunches, and sometimes put our
hunches to the test. Over the years, a formalized procedure for testing hunches has evolved—
the scientific method. The procedure involves (1) observing nature, (2) asking questions, (3)
formulating hypotheses, (4) conducting experiments, and (5) developing theories and laws.
The first step in evaluating a scientific hypothesis is to express the hypothesis in the form of a
statistical hypothesis. You learned in Chapter 1 that a statistical hypothesis is a statement
about one or more parameters of a population or the functional form of a population. For
example, “µ ≤ 115” is a statistical hypothesis; it states that the population mean is less than or
equal to 115. Another statistical hypothesis can be formulated that states that the mean is
greater than 115—that is, μ > 115. These hypotheses, μ ≤ 115 and μ > 115, are mutually
exclusive and exhaustive; if one is true, the other must be false. They are examples,
respectively, of the null hypothesis, denoted by H0, and the alternative hypothesis, denoted
by H1. The null hypothesis is the one whose tenability is actually tested. If on the basis of this
test the null hypothesis is rejected, then only the alternative hypothesis remains tenable.
According to convention, the alternative hypothesis is formulated so that it corresponds to the
researcher's scientific hunch. The process of choosing between the null and alternative
hypotheses is called hypothesis testing.
Logic plays a key role in evaluating a scientific hypothesis. This evaluation involves a chain of
deductive and inductive logic that begins and ends with the scientific hypothesis. The chain is
diagrammed in Figure 2.5-1. First, by means of deductive logic, the scientific hypothesis and its
negation are expressed as two mutually exclusive and exhaustive statistical hypotheses that
make predictions about one or more population parameters or the functional form of a
population. These predictions, called the null and alternative hypotheses, are made about the
population mean, median, variance, and so on. The next step in the chain is to obtain a random
sample from the population and estimate the population parameters of interest. A statistical test
is then used to decide whether the sample data are consistent with the null hypothesis. If the
data are not consistent with the null hypothesis, the null hypothesis is rejected; otherwise, it is
not rejected.
A statistical test involves (1) a test statistic, (2) a set of hypothesis-testing conventions, and (3)
a decision rule that leads to an inductive inference about the probable truth or falsity of the
scientific hypothesis, which is the final link in the chain shown in Figure 2.5-1. If errors occur in
the deductive or inductive links in the chain of logic, the statistical hypothesis that is tested may
have little or no bearing on the original scientific hypothesis, or the inference about the scientific
hypothesis may be incorrect. Consider the scientific hypothesis that cigarette smoking is
associated with high blood pressure. If this hypothesis is true, then a measure of central
tendency such as mean blood pressure should be higher for the population of smokers than for
nonsmokers. The statistical hypotheses are
Figure 2.5-1 ▪ The evaluation of a scientific hypothesis begins with a deductive inference
and ends with an inductive inference concerning the probable truth or falsity of the
scientific hypothesis.
where µ1 a n d µ2 denote the unknown population means for smokers and nonsmokers,
respectively. The null hypothesis, µ1 – µ2 ≤ 0, states in effect that the mean blood pressure of
smokers is less than or equal to that of nonsmokers. The alternative hypothesis states that the
mean blood pressure of smokers is greater than that of nonsmokers. These hypotheses follow
logically from the original scientific hypothesis. Suppose the researcher formulates the
statistical hypotheses in terms of population variances, for example,
where and denote the population variances of smokers and nonsmokers, respectively. A
statistical test of this null hypothesis, which states that the variance of blood pressure for the
population of smokers is less than or equal to the variance for nonsmokers, would have little
bearing on the original scientific hypothesis. However, it would be relevant if the researcher was
interested in determining whether the two populations differ in dispersion.
The reader should not infer that for any scientific hypothesis there is only one suitable null
hypothesis. A null hypothesis that states that the correlation between the number of cigarettes
smoked and blood pressure is equal to zero bears more directly on the scientific hypothesis
than the one involving population means. If cigarette smoking is associated with high blood
pressure, then there should be a positive correlation between cigarette consumption and blood
pressure. The statistical hypotheses are
where ρ denotes the population correlation coefficient for cigarette consumption and blood
pressure. So we see that both creativity and deductive skill are required to formulate relevant
statistical hypotheses.
At this point in the review, I need to examine three concepts that play key roles in statistical
inference: sampling distributions, the central limit theorem, and test statistics.
Sampling distribution. Inferential statistics are concerned with reasoning from a sample to the
population—from the particular to the general. Such reasoning is based on a knowledge of the
sample-to-sample variability of a statistic—that is, on its sampling behavior. Before data have
been collected, I can speak of a sample statistic such as in terms of probability. Its value is
yet to be determined and will depend on which observations happen to be randomly selected
from the population. Thus, at this stage of an investigation, a sample statistic is a random
variable,8 because it is computed from observations obtained by random sampling. Like any
random variable, a sample statistic has a probability distribution that gives the probability
associated with each value of the statistic over all possible samples of the same size that could
be drawn from the population. The distribution of a statistic is called a sampling distribution
to distinguish it from the probability distribution for, say, an observation. In the discussion that
follows, I focus on the sampling distribution of the mean.
Central limit theorem. The characteristics of the sampling distribution of the mean are
succinctly stated in the central limit theorem, one of the most important theorems in statistics.
In one form, the theorem states that if random samples are obtained from a population with
mean μ and finite standard deviation σ, then as the sample size n increases, the distribution of
approaches a normal distribution with mean μ and standard deviation (standard error) equal
to . Probably the most important point is that regardless of the shape of the sampled
population, the means of sufficiently large samples will be nearly normally distributed. Just how
large is sufficiently large? The answer depends on the shape of the sampled population; the
more a population departs from the normal shape, the larger n must be. For most populations
encountered in the behavioral sciences and education, a sample size of 50 to 100 is sufficient
to produce a nearly normal sampling distribution of . The tendency for the sampling
distribution of a statistic to approach the normal distribution as n increases helps to explain why
the normal distribution is so important in statistics.
Test statistics. It is important to distinguish between sample statistics and test statistics. The
former are used to describe characteristics of samples or to estimate population parameters;
the latter are used to test hypotheses about population parameters. An example of a test
statistic is the t statistic:
Like all statistics, t has a sampling distribution. If the null hypothesis is true and Y i s
approximately normally distributed or n is large, t is distributed as the t distribution. The
distribution is unimodal and symmetrical about a mean of zero. The variance of the t distribution
depends on the sample size or, more specifically, degrees of freedom. The term degrees of
freedom, denoted by df or ν (Greek letter nu and pronounced “new”), refers to the number of
scores whose values are free to vary, as I show in Section 3.2. For now I simply note that the
degrees of freedom for the t statistic described above are equal to ν = n − 1, and the variance of
the t distribution is equal to ν/(ν − 2). The t distribution is actually a family of distributions whose
shapes depend on the number of degrees of freedom. A comparison of three members of the t
family and the standard normal distribution is shown in Figure 2.5-2.
Figure 2.5-2 ▪ Graph of the t distribution for 4, 12, and ∞ degrees of freedom. The t
distribution for ν = ∞ is identical to the normal distribution.
Suppose that I am interested in testing the scientific hypothesis that on the average, college
students who are active in student government at Big Ten universities have higher IQs than
college students who are not involved in student government. The corresponding statistical
hypothesis is µ > µ0, where µ denotes the unknown population mean of students who are
active in student government and µ0 denotes the population mean of college students who are
not involved. Also assume that the mean IQ of noninvolved college students, µ0, is known to
equal 115. The first step in testing the scientific hypothesis is to state the null and alternative
hypotheses:
where µ0 has been replaced by 115, the known population mean IQ of college students who
are not involved in student government. As written, the null hypothesis is inexact because it
states a range of possible values for the population mean—all values less than or equal to 115.
However, one exact value is specified—µ = 115—and that is the value actually tested. If the null
hypothesis µ = 115 can be rejected, then any null hypothesis in which µ is less than 115 also
can be rejected. This follows because if µ = 115 is considered improbable because the sample
mean exceeds 115, any population mean less than 115 is considered even less probable.
The second step in testing a scientific hypothesis is to specify the test statistic. A relatively
small number of test statistics are used to evaluate hypotheses about population parameters.
As you see in Section 3.1, the principal ones are denoted by z, t, χ2 (chi square), and F. A test
statistic is called a z statistic if its sampling distribution is the standard normal distribution, a
test statistic is called a t statistic if its sampling distribution is a t distribution, and so on. The
choice of a test statistic is determined by (1) the hypothesis to be tested, (2) information about
the population that is known, and (3) assumptions about the population that appear to be
tenable. Which test statistic should be used to test the hypothesis μ ≤ 115? Because the
hypothesis concerns the mean of a single population, the population standard deviation is
unknown, and the population is assumed to be approximately normal, the appropriate test
statistic is
The next step in the hypothesis testing process is to choose a sample size. I want the sample
to be large enough but not too large. Later I show that there is a rational way of estimating the
size of a sample that will be large enough to reject a false null hypothesis. For now, I resort to
the time-honored practice of arbitrarily specifying a sample size that I think is large enough—
say, 61. It turns out that this sample size is not large enough. Once the test statistic and
sample size have been specified, the sampling distribution can be specified: It is the t sampling
distribution for n – 1 = 60 degrees of freedom.
When a decision about a population is based on information from a sample, there is always the
risk of making an error. I might decide that μ > 115 when in fact μ ≤ 115. The fourth step in the
hypothesis-testing process is to specify an acceptable risk of making this kind of error—that is,
rejecting the null hypothesis when it is true. According to hypothesis-testing conventions, a
probability of .05 is usually the largest risk a researcher should be willing to take of rejecting a
true null hypothesis—deciding, for example, that μ > 115 when in fact μ ≤ 115. Such a
probability, called a level of significance, is denoted by the Greek letter α. For α = .05 and H1:
μ > 115, the region for rejecting the null hypothesis, called the critical region, is shown in
Figure 2.5-3. The location and size of the critical region are determined, respectively, by the
alternative hypothesis and α. A decision to adopt the .05, .01, or any other level of significance
is based on hypothesis-testing conventions that have evolved since the 1920s. I return to the
problem of selecting a significance level later.
Figure 2.5-3 ▪ Sampling distribution of the t statistic, given that the null hypothesis is
true. The critical region, which corresponds in this example to the upper .05 portion of
the sampling distribution, defines values of t that are improbable if the null hypothesis μ
≤ 115 is true. Hence, if the t statistic falls in the critical region, the null hypothesis is
rejected. The value of t for 61 − 1 = 60 degrees of freedom that cuts off the upper .05
portion of the sampling distribution is called the critical value and is denoted by t.05,60.
This value can be found in the t table in Appendix Table E.3 and is 1.671.
The final step in testing H0: μ ≤ 115 is to obtain a random sample of size 61 from the population
of students who are active in student government at Big Ten universities, compute the test
statistic, and make a decision. The decision rule is as follows:
Reject the null hypothesis if the test statistic falls in the critical region; otherwise, do
not reject the null hypothesis. If the null hypothesis is rejected, conclude that the
mean IQ of students who are active in student government at Big Ten universities is
higher than that of college students who are not involved; if the null hypothesis is not
rejected, do not draw this conclusion.
The procedures just described for testing H0: μ ≤ 115 can be summarized in five steps and a
decision rule:
Decision rule:
Reject the null hypothesis if t falls in the upper 5% of the sampling distribution of t;
otherwise, do not reject the null hypothesis. If the null hypothesis is rejected,
conclude that the mean IQ of students who are active in student government at Big
Ten universities is higher than that of college students who are not involved; if the null
hypothesis is not rejected, do not draw this conclusion.
where
to test the hypothesis µ ≤ µ0. Recall that is the mean of a random sample from the
population of interest, µ0 is the hypothesized value of the population mean, estimates the
unknown population standard deviation, n is the size of the sample used to compute and ,
and µ is the unknown mean of the population.
Assume that a random sample of 61 students who are active in student government has been
obtained from the population of college students at the Big Ten universities and that the sample
estimates of the population mean and standard deviation are, respectively 117 and 12.5. The
number 117 is called a point estimate of µ; it is the best guess I can make about the unknown
value of µ. How improbable is a sample mean of 117 if the population mean is really 115?
Would a mean of 117 occur five or fewer times in 100 experiments by chance? Stated another
way, is it reasonable to believe that the population mean is really less than or equal to 115 if I
have obtained a sample mean of 117? To answer this question, I compute a t statistic. If the t
falls in the upper 5% of the sampling distribution of t, I have reason to believe that µ is not
equal to 115. Such a result would occur five or fewer times in 100 by chance.
According to the t table in Appendix E.3, the value of t that cuts off the upper .05 region of the
sampling distribution for 61 − 1 = 60 degrees of freedom is 1.671. This value of t, called the
critical value, is denoted by t.05, 60, where the subscript .05 refers to the proportion of the
sampling distribution that lies above the critical value and 60 is the degrees of freedom
associated with the denominator of the t statistic. Because t = 1.25 is less than t05, 60 = 1.671,
the observed t falls short of the upper .05 critical region. The critical region is shown in Figure
2.5-3. According to my decision rule, I fail to reject the null hypothesis and therefore conclude
that the sample data do not support the hypothesis that the mean IQ of students who are active
in student government at Big Ten universities is higher than that of college students who are
not active in student government. Two points need to be emphasized. First, I have not proven
that the null hypothesis is true—only that the evidence does not warrant its rejection. Second,
my conclusion has been restricted to the population from which I sampled, namely, college
students at Big Ten universities.
Three explanations can be advanced to account for my failure to reject the null hypothesis: (1)
The null hypothesis is true and shouldn't be rejected; (2) the null hypothesis is false, but the t
test lacked sufficient power to reject the null hypothesis; or (3) the null hypothesis is false, but
the particular sample belied this fact—I obtained an unrepresentative sample from the
population of students who are active in student government. In the following sections, I
examine the second explanation for not rejecting the null hypothesis and discuss a number of
concepts that round out my review of hypothesis testing.
A statistical test for which the critical region is in either the upper or the lower tail of the
sampling distribution is called a one-tailed test. If the critical region is in both the upper and
lower tails of the sampling distribution, the statistical test is called a two-tailed test.
A one-tailed test is used whenever the researcher makes a directional prediction about the
phenomenon of interest—for example, that the mean IQ of students who are active in student
government is higher than that of noninvolved students. The statistical hypotheses associated
with this scientific hypothesis are
These hypotheses are called directional hypotheses or one-sided hypotheses. The region
for rejecting the null hypothesis is shown in Figure 2.5-3. If the scientific hypothesis stated that
students who are active in student government have lower IQs than noninvolved students, the
following statistical hypotheses would be appropriate:
The region for rejecting this null hypothesis is shown in Figure 2.5-4(a). To reject the null
hypothesis, an observed t statistic would have to be less than or equal to the critical value
–t.05,60 = −1.671.
Figure 2.5-4 ▪ (a) Critical region of the t statistic for a one-tailed test; H0: μ ≥ μ0; H1: μ <
μ0; α = .05. (b) Critical regions for a two-tailed test; H0: μ = μ0; H1: μ ≠ μ0; α = .025 + .025
= .05.
Often researchers do not have sufficient information to make a directional prediction about a
population parameter; they simply believe the parameter is not equal to the value specified by
the null hypothesis. This situation calls for a two-tailed test. The statistical hypotheses for a
two-tailed test have the following form:
These hypotheses are called nondirectional hypotheses or two-sided hypotheses. For two-
sided hypotheses, the regions for rejecting the null hypothesis are in both the upper and the
lower tails of the sampling distribution. The two-tailed critical regions are shown in Figure 2.5-
4(b).
In summary, a one-sided or directional hypothesis is called for when the researcher's original
hunch is expressed in such terms as “more than,” “less than,” “increased,” or “decreased.”
Such a hunch indicates that the researcher has quite a bit of knowledge about the research
area. This knowledge could come from previous research, a pilot study, or perhaps a theory. If
the researcher is interested only in determining whether the independent variable affects the
dependent variable without specifying the direction of the effect, a two-sided or nondirectional
hypothesis should be used. Generally, significance tests in the behavioral sciences are two-
tailed because most researchers lack the information necessary to formulate directional
predictions.
How does the choice of a one- or two-tailed test affect the probability of rejecting a false null
hypothesis? A researcher is more likely to reject a false null hypothesis with a one-tailed test
than with a two-tailed test if the critical region has been placed in the correct tail. A one-tailed
test places all of the α area, say .05, in one tail of the sampling distribution. A two-tailed test
divides the α = .05 area between the two tails with .025 in one tail and .025 in the other tail. To
illustrate, assume that α = .05 and the following hypotheses for a two-tailed test have been
advanced:
If the t statistic falls in either the upper or lower .025 region of the sampling distribution, the
result is said to be significant at the .05 level of significance because .025 + .025 = .05. The
values of t for 60 degrees of freedom that cut off the upper and lower .05/2 = .025 regions are
t.05/2, 60 = 2.000 and −t.05/2, 60 = −2.000, respectively. An observed t statistic is significant at
the .05 level if its value is greater than or equal to 2.000 or less than or equal to −2.000 or, more
simply, if its absolute value, |t|, is greater than or equal to 2.000.
Now consider the hypotheses for a one-tailed test where the researcher believes that the
population mean is less than μ0.
Again, let α = .05. If the t statistic falls in the lower tail of the sampling distribution—that is, if t is
less than or equal to −1.671—the result is said to be significant at the .05 level of significance.
The critical regions and critical values for the one- and two-tailed tests are shown in Figures
2.5-4(a) and (b), respectively. From an inspection of these figures, it should be apparent that
the difference necessary to reach the critical region for a two-tailed test is larger than that
required for a one-tailed test. Consequently, a researcher is less likely to reject a false null
hypothesis with a two-tailed test than with a one-tailed test.
The term power refers to the probability of rejecting a false null hypothesis. A one-tailed test is
more powerful than a two-tailed test if the researcher's hunch about the true difference µ – µ0
is correct—that is, if the alternative hypothesis places the critical region in the correct tail of the
sampling distribution. If the directional hunch is incorrect, the rejection region will be in the
wrong tail, and the researcher will most certainly fail to reject the null hypothesis even though it
is false. A researcher is rewarded for making a correct directional prediction and penalized for
making an incorrect directional prediction. In the absence of sufficient information for using a
one-tailed test, the researcher should play it safe and use a two-tailed test.
When the null hypothesis is tested, a researcher's decision will be either correct or incorrect. An
incorrect decision can be made in two ways. The researcher can reject the null hypothesis
when it is true; this is called a Type I error. Alternatively, the researcher can fail to reject the
null hypothesis when it is false; this is called a Type II error. Likewise, a correct decision can be
made in two ways. If the null hypothesis is true and the researcher does not reject it, a correct
acceptance has been made; if the null hypothesis is false and the researcher rejects it, a
correct rejection has been made. The two kinds of correct decisions and the two kinds of
errors are illustrated in Table 2.5-1.
The probability of making a Type I error is determined by the researcher when the level of
significance, α, is specified. If α is specified as .05, the probability of making a Type I error is
.05. The level of significance also determines the probability of a correct acceptance of a true
null hypothesis because this probability is equal to 1 – α.
The probability of making a Type II error is denoted by β. The probability of making a correct
rejection, the power, is denoted by 1 – β. These two probabilities are determined by a number
of variables: (1) the level of significance adopted, α; (2) the size of the sample, n; (3) the size of
the population standard deviation, σ; and (4) the magnitude of the difference between µ and
µ0. The two probabilities also are affected by the researcher's decision to use a one- or two-
tailed test. To compute the probability of making a Type II error and power, it is necessary either
to know μ, the true population mean, or to specify a value of μ that is sufficiently different from
μ0 to be of practical value. The latter approach is usually necessary because in any practical
hypothesis-testing situation, μ is unknown. Also, the population standard deviation is rarely
known. Sample data can be used to estimate this parameter.
Hypothesis testing involves a number of conventions. As you have seen, one convention is to
set the probability of a Type I error equal to or less than .05. Another convention is to design an
experiment so that the probability of a Type II error is equal to or less than .20. If β = .20, the
power of the test is 1 – β = .80. A power of .80 is considered by many researchers to be the
minimum acceptable power. If the probability of rejecting a false null hypothesis is less than
.80, many researchers would choose not to perform the experiment or would redesign the
experiment so that its power is greater than or equal to .80. As you will see, there are ways to
increase power. Before examining these, I compute the power of the test of the hypothesis that
the mean IQ of college students who are active in student government is higher than that of
students who are not involved. The statistical hypotheses are H0: μ ≤ 115 and H1: μ > 115.
To compute the power of the t test in the student government example, it is necessary to know
μ, the value of the population mean, or specify a value of μ that is sufficiently different from μ0
to be worth detecting. I'll denote the value of the population mean that I am interested in
detecting by μ′. Suppose that this value is μ′ = 117.5. I am saying in effect that any IQ
difference less than | μ′ – μ0 | = | 117.5 − 115 | = 2.5 IQ points is too small to be of practical
significance. Recall that for this example, , n = 61, and t.05, 60 = 1.671. To compute an
estimate of power, I need one more bit of information—the value of that cuts off the upper .05
region of the null hypothesis sampling distribution. I'll denote this mean by . I can compute
by rearranging the terms in the formula as follows:
Thus, a mean of 117.67 cuts off the upper .05 region of the null hypothesis sampling
distribution. In Figure 2.5-5, falls on the boundary between the reject and nonreject
regions.
Estimates of the sizes of the regions corresponding to a Type II error and a correct rejection
(labeled and in Figure 2.5-5) can be obtained by computing a t statistic for the difference
. The t statistic is
Figure 2.5-5 ▪ Regions that correspond to making a Type I error, α, and a Type II error, , if
µ′ = 117.50. The mean that cuts off the upper .05 region of the sampling distribution
under H0 is denoted by and is equal to 117.67. The value of is obtained by
The size of the area above t = 0.11, the region, can be obtained with the aid of a statistical
calculator or computer. A simple way to find this area is to use Microsoft's Excel TDIST function.
To access this function, select “Insert” in Excel's Menu bar and then the menu command
“Function.” You then select the TDIST function from the list of functions. After you access the
TDIST function,
you replace “x” with the absolute value of the t statistic, “deg_freedom” with the degrees of
freedom for the t statistic, and “tails” with 1 because the area always lies in only one of the
distribution tails. The size of the area for the one-tailed t = 0.11 with 60 degrees of
freedom is obtained from
and is equal to .46. Thus, if the mean IQ of students who are active in student government is μ′
= 117.5, the probability of making a correct rejection (power) is . The probability of
making a Type II error ( ) is 1 – (1 – ) = 1 – .46 = .54. Figure 2.5-5 shows the regions
corresponding to these two probabilities. A power of .46 is considerably less than .80, the
minimum acceptable power according to convention. Table 2.5-2 summarizes the possible
decision outcomes when μ = 115 and when μ′ = 117.5. In this example, the size of the region
corresponding to making a correct decision is larger when the null hypothesis is true (1 – α =
.95) than when the null hypothesis is false (1 – = .46). It also is apparent that the size of the
region corresponding to making a Type I error (α = .05) is much smaller than the probability of
making a Type II error ( ). In most research situations, the researcher follows the
convention of setting α = .05 or α = .01. This convention of choosing a small numerical value for
α is based on the notion that making a Type I error is bad and should be avoided.
Unfortunately, as the probability of a Type I error is made smaller and smaller, the probability of
a Type II error increases and vice versa. This can be seen from an examination of Figure 2.5-5.
If the vertical line cutting off the upper α region is moved to the right or to the left, the region
designated in the lower distribution is made, respectively, larger or smaller.
As you have seen, four factors determine the power of a test: (1) the level of significance, α; (2)
the size of the sample, n; (3) estimate of the population standard deviation, ; and (4) the
magnitude of the difference between µ′ and µ0. The power of a test can be increased by
making an appropriate change in any of these factors. For example, power can be increased by
adopting a larger value for α, say .10 or .20; increasing the sample size; refining the
experimental design or measuring procedure so as to decrease the population standard
deviation estimate; and increasing the difference between µ′ and µ0 that is considered worth
detecting. Often the simplest way to increase the power of a statistical test is to increase the
sample size.
Hypothesis testing involves a number of conventions. I hope that this discussion has dispelled
the magical aura that surrounds the .05 level of significance; its use in hypothesis testing is
simply a convention. A level of significance is the probability of committing an error in rejecting
a true null hypothesis. It says nothing about the importance or practical significance of a
result.9
Effect Magnitude
Researchers want to answer two basic questions from their research: (1) Is an observed
treatment effect real, or should it be attributed to chance? and (2) If the effect is real, how large
is it? The first question concerning whether chance is a viable explanation for an observed
treatment effect is usually addressed with a null hypothesis significance test. A significance test
tells the researcher the probability of obtaining the effect or a more extreme effect if the null
hypothesis is true. The test does not address the question of how large the effect is. This
question is usually addressed with descriptive statistics and measures of effect magnitude. The
most widely used measures of effect magnitude fall into one of two categories: measures of
effect size and measures of strength of association. I describe Cohen's and Hedge's measures
of effect size here and defer a discussion of measures of effect magnitude to Section 4.4.
In 1969, Cohen introduced the first effect-size measure that was explicitly labeled as such. His
measure, denoted by d, expresses the size of the absolute difference µ – µ0 in units of the
population standard deviation,
Cohen recognized that the size of the difference µ – µ0 is influenced by the scale of
measurement of the means. Cohen divided the difference between the means by σ to rescale
the difference in units of the amount of variability in the data. What made Cohen's contribution
unique is that he provided guidelines for interpreting the magnitude of d.
According to Cohen (1992), a medium effect of 0.5 is visible to the naked eye of a careful
observer. Several surveys have found that 0.5 approximates the average size of observed
treatment effects in various fields. A small effect of 0.2 is noticeably smaller than medium but
not so small as to be trivial. A large effect of 0.8 is the same distance above medium as small is
below it. These operational definitions turned his measure of effect size into a much more
useful statistic. A sample estimator of Cohen's d is obtained by replacing µ with and σ with .
For experiments with two sample means, Larry Hedges proposed a modification of Cohen's d
as follows:
where
From Cohen's rule of thumb, a small effect for the student IQ data is one for which µ – µ0| =
|117.5 − 115| = 2.5, because
Similarly, medium and large effects correspond to | 121.25 − 115 | = 6.25 and | 125 − 115 | = 10,
respectively, because
Reporting p Values
Most research reports and computer printouts contain a statistic called a probability value, or
simply p value. A p value is the probability of obtaining a value of the test statistic that is equal
to or more extreme than the one observed, given that the null hypothesis is true. Usually p
values are obtained with the aid of a statistical calculator or computer statistical program. p
values for a variety of sampling distributions also can be obtained with Microsoft's Excel
program. To illustrate, I use the Excel TDIST function
described earlier to obtain the p value of the t statistic for the students who are active in student
government. After the TDIST function has been accessed, I replace x with the value of t, which
is 1.25, degrees of freedom with 60, and tails with 1 as follows:
In reporting the results of hypothesis tests in the text portion of publications, the Publication
Manual of the American Psychological Association (American Psychological Association, 2010)
recommends reporting in order the test statistic that was used, the degrees of freedom (in
parentheses) associated with the test statistic, the value of the test statistic, the exact p value to
two or three decimal places, and effect size. For example, in describing the results of the
college student experiment, I could report that “the difference between the means of students
who are active in student government and those who are not active, 117 − 115 = 2, was not
statistically significant, t(60) = 1.25, p = .108, d = 0.16.” The Publication Manual says to “report
p value less than .001 as p < .001” (American Psychological Association, 2010, p. 114).
Earlier, I formulated a hypothesis-testing decision rule in terms of the test statistic and the
critical region: Reject the null hypothesis if the test statistic falls in the critical region; otherwise,
do not reject the null hypothesis. A decision rule also can be formulated in terms of a p value
and a level of significance. The rule is as follows:
Decision rule:
Reject the null hypothesis if the p value is less than or equal to the preselected level
of significance, that is, reject the null hypothesis if p ≤ α; otherwise, do not reject the
null hypothesis.
The inclusion of a p value in a research report provides useful information because it enables a
reader to discern those significance levels for which the null hypothesis could have been
rejected. The p values provided in computer printouts are usually appropriate for two-sided
hypotheses. If your null hypothesis is directional, the two-tailed p value given in the computer
printout should be divided by 2. Of course, the p value for a one-sided hypothesis is only
meaningful if the data are consistent with the alternative hypothesis. Before leaving the subject
of p values, I want to emphasize that a p value is related to statistical significance; it says
researchers that this approach has some shortcomings.10 A null hypothesis significance test
addresses the question, Is chance a likely explanation for the results that have been obtained?
The test does not address the question, Are the results important or useful? There are other
criticisms. For example, null hypothesis significance testing and scientific inference address
different questions. In scientific inference, what you want to know is the conditional probability
that the null hypothesis (H0) is true, given that you have obtained a set of data (D); that is,
Prob(H0|D). What null hypothesis significance testing tells you is the conditional probability of
obtaining these data or more extreme data if the null hypothesis is true, Prob(D|H0). Obtaining
data for which Prob(D|H0) is low does not imply that Prob(H0|D) also is low.
A third criticism of null hypothesis significance testing is that it is a trivial exercise. John Tukey
(1991) said, “All we know about the world teaches us that the effects of A and B are always
different—in some decimal place. Thus asking ‘Are the effects different?’ is foolish” (p. 100).
Hence, because all null hypotheses are false, Type I errors cannot occur and statistically
significant results are ensured if large enough samples are used. Bruce Thompson (1998)
captured the essence of this view when he wrote, “Statistical testing becomes a tautological
search for enough subjects to achieve statistical significance. If we fail to reject, it is only
because we've been too lazy to drag in enough subjects” (p. 799). In the real world, all null
hypotheses are false. Hence, a decision to reject simply means that the research methodology
had adequate power to detect a true state of affairs, which may or may not be a large effect or
even a useful effect.
The list of criticisms goes on. I mention one more. By adopting a fixed significance level such as
α = .05, a researcher turns a continuum of uncertainty about a true state of affairs into a
dichotomous reject/do-not-reject decision. Researchers ordinarily react to a p value of .06 with
disappointment and even dismay, but not p values of .05 or smaller. Rosnow and Rosenthal's
(1989) comment is pertinent: “Surely, God loves the .06 nearly as much as the .05.” Many
psychologists believe that an emphasis on null hypothesis significance tests and p values
distracts researchers from the main business of science—understanding and interpreting the
outcomes of research. An alternative approach to statistical inference using confidence intervals
is described next.
A confidence interval is a segment or interval on the real number line that has a high
probability of including a population parameter. Confidence intervals can be either one- or two-
sided. A one-sided confidence interval is constructed when the researcher has made a
directional prediction about the population mean; otherwise, a two-sided interval is constructed.
The computation of a one-sided confidence interval is illustrated for the mean IQ of college
students who are active in student government at Big Ten universities. Recall that the null
hypothesis is
Also, , n = 61, and α = .05. A one-sided 95% confidence interval for the
The number 114.33 is the lower limit of the open confidence interval12 and is denoted by LL. I
can be fairly confident that the population mean is greater than 114.33. The degree of my
confidence is represented by the confidence coefficient 100(1 – .05)% = 95%, where α = .05
is the level of significance. It helps to visualize the confidence interval as a segment on the
number line as follows:
The confidence interval indicates values of the parameter μ that are consistent with the
observed sample statistic. It also contains a range of values for which the null hypothesis is
nonrejectable at the .05 level of significance. To put it another way, the confidence interval can
be used to test all one-sided hypotheses of interest, not just H0: μ ≤ 115. For example, I know
that H0: μ ≤ 113 and H0: μ ≤ 112 would be rejected, but not H0: μ ≤ 115 or H0: μ ≤ 116. These
decisions follow because 113 and 112 are not included in the confidence interval, whereas 115
and 116 are included.
Suppose that I had proposed a two-sided null hypothesis, H0: μ = 115, for the students who
are active in student government. A two-sided 100(1 – .05)% = 95% confidence interval for the
population mean is
where the lower and upper confidence limits are, respectively, LL = 113.80 and UL = 120.20. I
can be 95% confident that the open interval 113.80 to 120.20 contains the population mean.
The Publication Manual of the American Psychological Association (American Psychological
Association, 2010) says to express the confidence interval in the text portion of a publication as
95% CI[113.80, 120.20].
I could increase my confidence that the interval includes the population mean by replacing
t.05/2,60 with t.01/2,60. The resulting interval
is a 100(1 – .01)% = 99% confidence interval. Notice that as my confidence that I have captured
μ increases, so does the size of the interval. This is illustrated in the following figures.
1.Terms to remember:
a.t test for independent samples (2.2)
b.completely randomized design (2.2)
c. experimental design model equation (2.2)
d.linear model (2.2)
e.repeated measures (2.2)
f. subject matching (2.2)
g.t test for dependent samples (2.2)
h.block (2.2)
i. randomized block design (2.2)
j. Latin square design (2.2)
k. building block design (2.2)
l. completely randomized factorial design (2.2)
m.treatment combination (2.2)
n.completely crossed treatments (2.2)
o.randomized design (2.3)
p.scientific hypothesis (2.5)
q.statistical inference (2.5)
r. statistical hypothesis (2.5)
s. null hypothesis (2.5)
t. alternative hypothesis (2.5)
u.hypothesis testing (2.5)
v. statistical test (2.5)
w.random variable (2.5)
x. sampling distribution (2.5)
y. central limit theorem (2.5)
z. test statistic (2.5)
aa.
degrees of freedom (2.5)
ab.
z statistic (2.5)
ac.t statistic (2.5)
ad.
level of significance (2.5)
ae.
critical region (2.5)
af.decision rule (2.5)
ag.
point estimate (2.5)
ah.
critical value (2.5)
ai.one- and two-tailed tests (2.5)
aj.directional prediction (2.5)
letters (a1 is preparing an outline of a letter prior to dictating it, a2 is making a list of the
major points to be covered prior to dictating a letter, and a3 is silently dictating a letter
prior to dictating it). Treatment level a1 was paired with b1 and c1, b2 and c3, and b3
and c2; a2 was paired with b1 and c2, and so on. The dependent variable was the
average time taken to dictate five letters following 2 weeks of practice with the assigned
practice procedure.
*c.The effects of isolation at 90, 120, 150, and 180 days of age on subsequent combative
behavior of Mongolian gerbils (Meriones unguiculatus) were investigated. Eighty gerbils
were randomly assigned to one of the four isolation conditions with 20 in each condition.
The number of combative encounters was recorded when the gerbils were 2 years old.
It was hypothesized that the number of combative encounters would increase with
earlier isolation.
d.Scores on the Conforming-Compulsive scale of the Millon Clinical Multiaxial Inventory
are known to be positively correlated with the dependent variable in Exercise 2(a).
Suppose that this scale was used to form blocks of three boys who had similar
Conforming-Compulsive scores and that the three boys in each block were randomly
assigned to the three treatment conditions.
e.Dreams of a random sample of 50 English-Canadian and 50 French-Canadian students
were analyzed in terms of the proportion of achievement-oriented elements such as
achievement imagery, success, and failure. The students were matched on the basis of
reported frequency of dreaming. It was hypothesized that the French-Canadian
students’ dreams would contain a higher proportion of achievement-oriented elements.
f. Pictures of human faces posing six distinct emotions (treatment A) were obtained. The
faces and their mirror reversals were split down the midline, and left-side and right-side
composites were constructed. This resulted in 12 pictures. Six hundred introductory
psychology students were randomly assigned to one of the 12 groups with 50 in each
group. Each student rated one of the 12 pictures on a 7-point scale in terms of the
intensity of the emotion expressed. It was hypothesized that the left-side composites
would express the most intense emotion.
g.Ninety Sprague-Dawley rats performed a simple operant barpress response and were
given partial reinforcement, partial reinforcement followed by continuous reinforcement,
or partial reinforcement preceded by continuous reinforcement. The dependent variable
was rate of responding following the training condition. The rats were randomly
assigned to the experimental conditions; there were 30 rats in each condition.
3.[2.2] In your own words, describe what is meant by the terms (a) grand mean, (b) treatment
effect, and (c) error effect.
*4.[2.2]
*a.Under what conditions is the sum of the squared error effects for the randomized block
and Latin square designs less than the sum for the completely randomized design?
(b)Discuss the relative merits of completely randomized, randomized block, and Latin
square designs.
*5.[2.2] List the treatment combinations for the following completely randomized factorial
designs.
*a.CRF-24
*b.CRF-222
c. CRF-33
d.CRF-42
e.CRF-322
*6.[2.2] Construct block diagrams similar to those in Figures 2.2-1 through 2.2-7 for the
following designs.
*a.t test for independent samples, n = 10
*b.CR-3 design, n = 10
*c.CRF-32 design, n = 3
d.CR-5 design, n = 6
e.t test for dependent samples, n = 7
f. RB-4 design, n = 6
g.CRF-222 design, n = 3
h.LS-3 design, n = 3
*7.[2.2] Construct block diagrams similar to those in Figures 2.2-1 through 2.2-7 for the
designs in Exercise 2a–c.
8.[2.2] Construct block diagrams similar to those in Figures 2.2-1 through 2.2-7 for the
designs in Exercise 2d–g.
9.[2.2] The following data on running time (in seconds) in a straight-alley maze were obtained
in a CRF-33 design.
a1 = small b1 = 10
a2 = medium b2 = 15
a3 = large b3 = 20
and ν = n1 + n2 − 2.
*b.State the decision rule.
*c.Draw the sampling distribution associated with the null hypothesis and indicate the
region or regions that lead to rejection and nonrejection of the null hypothesis.
*d.Suppose that the researcher has obtained the following statistics: , and
. Compute a t statistic and use Microsoft's Excel TDIST function to determine
the p value of the statistic. Write a sentence that summarizes the results of the
research.
*e.Compute and interpret the effect size where .
*f Use the formula
*g.If the researcher believed that the minimum difference between the population means
that was worth detecting was , estimate the power of the research
to compute a two-sided 95% confidence interval for μ. Interpret the confidence interval.
What does the confidence interval tell you about the tenability of the null hypothesis μ1
– μ2 = 0?
g.If the researcher believed that the minimum population mean that was worth detecting
was μ′ = 52.5, estimate the power of the research methodology.
h.Make a table like Table 2.5-2 that summarizes the sizes of the regions of the sampling
distributions associated with the four possible decision outcomes.
*16.
[2.5] Indicate the type of error or correct decision for each of the following.
*a.A true null hypothesis was rejected.
*b.The researcher failed to reject a false null hypothesis.
c. The null hypothesis is false and the researcher rejected it.
d.The researcher did not reject a true null hypothesis.
e.A false null hypothesis was rejected.
f. The researcher rejected the null hypothesis when he or she should have failed to reject
it.
17.
List the ways that a researcher can increase the power of an experiment. What are their
relative merits?
18.
[2.5] The effect of playing video racing games or neutral games on cognitions associated
with risk taking was investigated. The games were played on a Sony PlayStation. Sales
rankings in computer magazines were used to select the most popular games in each
category. Forty-seven men at Ludwig-Maximilians University, Munich, Germany, were
randomly assigned to the two types of games. The dependent variable was a paper-and-
pencil measure of risk-related cognitions. The means, standard deviations, and sample
sizes for the men who played the video racing games and the neutral games were,
respectively, , and n1 = 24 and , and n1 = 23. (Suggested
by Fletcher, P., & Kubitzki, J. Virtual driving and risk taking: Do racing games increase risk-
taking cognitions, affect, and behaviors? Journal of Experimental Psychology, Applied.)
a.List the steps that you would use to test the hypothesis that playing video racing games
increases risk-related cognitions. Let α = .05.
b.State the decision rule.
c. Draw the sampling distribution associated with the null hypothesis and indicate the
region or regions that lead to rejection or nonrejection of the null hypothesis.
d.Compute a t statistic and use Microsoft's Excel TDIST function to determine the p value
of the statistic. Write a sentence that summarizes the results of the research.
e.Compute and interpret the effect size where .
f. Use the formula
to compute a one-sided 95% confidence interval for µ1 – µ2. Interpret the confidence
interval. What does the confidence interval tell you about the tenability of the null
hypothesis µ1 – µ2 ≤ 0?
g.If the researcher believed that the minimum difference between the population means
that was worth detecting was , estimate the power of the research
1A sample mean for the jth treatment level is obtained by summing over the i = 1, … , n
subjects in the jth treatment level and dividing by n—that is, . Notice that the i
subscript in Yij has been replaced by a dot in . The dot indicates that summation was
performed over the i subscript.
3Power refers to the probability of rejecting a false null hypothesis (see Section 2.5).
4For now, I ignore the possibility that cigarette consumption interacts with type of therapy.
Section 8.3 describes an experimental design model equation that includes a treatment-block
interaction term (απ)ji.
5The distinction between a treatment and a nuisance variable is in the mind of the researcher.
A nuisance variable is included in a design to improve the efficiency and power of the design; a
treatment is included because it is related to the scientific hypothesis that a researcher wants to
test. This distinction has important implications for the statistical analysis, as you will learn.
6As discussed in Chapter 5, the assignment of the same number of subjects to each treatment
level of a completely randomized design is desirable but not necessary.
*This section provides an elementary review of statistical inference. It assumes a prior exposure
to both descriptive and inferential statistics. Readers who have a good grasp of statistical
inference can omit this section. Those who want a more in-depth review can consult Statistics:
An Introduction (Kirk, 2008).
7For example, a completely randomized design has been called a one-way classification
design, single-factor experiment, randomized group design, simple randomized design, and
single variable experiment.
10Cumming (2012), Kline (2004), and Nickerson (2000) provide in-depth discussions of these
shortcomings.
11See Kirk (2008, pp. 294–295) for the derivation of the confidence interval.
12An interval in which the endpoint is not included is called an open interval.
http://dx.doi.org/10.4135/9781483384733.n2
One of the distinguishing characteristics of the scientific method is formulating and testing
hypotheses about population parameters. A test of a statistical hypothesis requires a test
statistic, a knowledge of the sampling distribution of the test statistic, and a decision rule. In
Section 2.5, I discussed the formulation of decision rules and the sampling distribution of the
one-sample t test statistic, . Here I discuss several other important test
statistics and sampling distributions.
The most widely used sampling distributions in the behavioral sciences are the binomial,
normal, t, chi-square, and F distributions. The first three distributions are often used to draw
inferences about the central tendency of populations. The last two distributions, chi-square and
F, also are used to draw inferences about variability. Because the analysis of variability is so
important in experimental design, I review the essential features of the chi-square and F
distributions and then examine the relationships among these two distributions and the t
distribution. I also discuss the assumptions of analysis of variance and the consequences of
violating these assumptions. This discussion assumes a knowledge of the elementary rules of
summation and the expectation of random variables. These rules are reviewed in Appendixes A
and B, respectively.
Chi-Square Distribution
In 1876, F. R. Helmert derived the chi-square distribution, but it was Karl Pearson who in 1900
first used the distribution to test hypotheses. For purposes of illustration, assume that there is a
population of scores, Y, that is normally distributed with population mean E(Y) = μ, where E(Y)
denotes the expected value of the random variable Y. The expected value of a random variable
is the long-run average of the variable over an indefinite number of samples (see Appendix B).
The variance of the distribution of Y is E[Y – E(Y)]2 = σ2 (see Appendix B). Suppose that a
random sample of size 1 is drawn from this normally distributed population. The score Y can be
expressed as a standard normal random variable with μ = 0 and σ = 1 by the transformation
is called a chi-square random variable with one degree of freedom and is denoted by . The
sampling distribution of is a chi-square distribution with one degree of freedom. The
subscript (1) denotes the number of degrees of freedom.
What does the distribution of look like? Because is a squared quantity, it can range over
only nonnegative real numbers from zero to positive infinity, whereas Y and z can range over all
real numbers. The distribution of is very positively skewed because approximately 68% of
the sampling distribution lies between zero and 1. The remaining 32% of the distribution lies
between 1 and positive infinity. The shape of this distribution is represented by the df = 1
distribution in Figure 3.1-1. Notice that the formula of a chi-square random variable with one
degree of freedom,
If two random samples each containing one score denoted by Y1 for sample 1 and Y2 for
sample 2 are drawn from the same normally distributed population, they can be expressed as
Because the scores are obtained by random sampling, they are independent, as are the
squared standard normal random variables and . The sum is a chi-square
random variable with two degrees of freedom: . This means that
This chi-square random variable has n degrees of freedom. The form of a chi-square
distribution depends on the number of degrees of freedom, denoted by df or v. The chi-square
distributions for 1, 2, 6, and 10 degrees of freedom are shown in Figure 3.1-1. A chi-square
distribution is positively skewed when v is small, but as v increases, the shape of the
distribution approaches the normal distribution with mean and variance, respectively,
For v > 2, the mode is equal to v − 2. The distribution of depends only on the number of
degrees of freedom.
In most research situations, the value of the sample mean is known or can be determined,
but the value of the population mean μ is unknown. As I have just shown, the sampling
distribution of is a chi-square distribution with n degrees of freedom, but this
statistic is of limited usefulness because the population mean μ is rarely known. A more
practical question concerns the sampling distribution of where the sample
mean is substituted for the population mean. What is the sampling distribution of this statistic?
I now show that for random samples of n observations from a normally distributed population,
the following identity1 where has been subtracted from and added to Yi – μ:
Distributing the summation operator over the three terms on the right gives
and
and
As defined earlier, the term on the left is a chi-square random variable with n degrees of
freedom, . The last term on the right of equation (3.1-4) can be rewritten as
In this form, it is apparent that is a chi-square random variable with one degree
of freedom. This follows because has the form described in equation (3.1-
1):
where is a normally distributed random variable with mean equal to μ and variance equal to
If the two terms on the right of equation (3.1-5) are independent, it follows from the addition
property of chi-square that must have a chi-square distribution with degrees of
freedom equal to n − 1. Hence, equation (3.1-5) can be written as
An easily followed proof of the independence of the terms on the right of equation (3.1-5) is
given by Lindquist (1953, pp. 31–35). In the process of showing that is a chi-
square random variable with n − 1 degrees of freedom, I have made the following assumptions:
The chi-square statistic with n − 1 degrees of freedom can be used to test the null hypothesis
that a population variance is equal to some specified value denoted by . The null and
alternative hypotheses are
F Distribution
The experimental designs described in this book involve hypotheses about two population
variances rather than a single variance. An F statistic and F distribution are used to test a
hypothesis about two population variances. I described the chi-square random variable in some
detail because an F random variable is defined as the ratio of two independent chi-square
variables, each divided by its degrees of freedom—that is,
It follows that the assumptions of the chi-square statistic also are assumptions of the F statistic.
The F statistic has an additional assumption: The chi-square random variables in the numerator
and denominator are independent. The distribution of the F statistic was determined by R. A.
Fisher (1924) and given the label F in his honor by G. W. Snedecor (1934).
To see the relationship between F and , imagine that there are two normally distributed
populations whose variances are equal, . You need not assume that the population
means are equal. Suppose that random samples of size n1 and n2 are drawn from the
populations. The samples can be used to compute unbiased estimators, and , of the
population variances with, respectively, v1= n1 − 1 and v2 = n2 − 1 degrees of freedom. In
equation (3.1-6), I showed that , where v = n − 1. With a little algebra, and
can be expressed as
The most common use of an F statistic is in testing hypotheses regarding the equality of three
of more population means. I discuss the rationale underlying the use of an F statistic to test
hypotheses of the form
in Section 3.4. The distribution of F depends on only two parameters, v1 a n d v2. The
distribution of F for v1 = 9 and v2 = 15 is shown in Figure 3.1-2. Appendix Table E.4 provides
values of F that cut off the upper α portion of an F distribution. In analysis of variance, only the
upper portion of the distribution is required because in practice, F is taken as the ratio of
to . If values of F that cut off the lower 1 – α portion are needed, they can be computed
from
That is, an F value in the lower 1 – α region of the F distribution can be found by computing the
reciprocal of the F value in the upper α region, with the degrees of freedom for the numerator
and denominator reversed. As Figure 3.1-2 shows, the distribution of F for v1 = 9 and v2 = 15 is
nonsymmetrical. The distribution approaches the normal distribution for very large values of v1
and v2. Because F is the ratio of two independent chi-square variables, each divided by its
degrees of freedom, it ranges over nonnegative real numbers from zero to positive infinity. The
mean and variance of an F distribution are, respectively,
The mean is always greater than 1. It can be shown that the mode is always less than 1.
Hence, the F distribution is positively skewed.
The sampling distribution of the t statistic was derived by William Sealey Gosset, who wrote
under the pseudonym “Student.” Gosset is credited with starting a trend toward the
development of exact statistical tests, that is, tests that do not require a knowledge of certain
population parameters. Gosset's (1908) derivation of the sampling distribution of t marked the
beginning of the modern era in mathematical statistics.
Consider a random sample of scores Yi from a normally distributed population with mean μ and
variance . The mean of this random sample can be expressed as a standard normal random
variable:
z is normally distributed with mean equal to zero and variance equal to 1. The one-sample t
statistic is defined as
where z and χ2 are independent. In words, a t statistic is the ratio of a standard normal random
variable to the square root of an independent chi-square random variable divided by its degrees
of freedom.
The one-sample t statistic can be expressed in a more familiar form by replacing z with
(equation 3.1-8) and with (equation 3.1-6) as follows:
Notice that the computation of t, unlike z, does not require a knowledge of the population
parameter σ.
The t statistic also is related to F, as I now show. Consider the square of the random variable t
with v equal to n − 1 degrees of freedom:
The numerator of this ratio is a chi-square random variable with one degree of freedom
is the definition of an F random variable given earlier. Thus, t2 with v degrees of freedom is
identical to an F statistic, with 1 and v2 degrees of freedom, where v = v2:
The t statistic can assume values from negative to positive infinity. The distribution of t depends
only on the parameter v. Percentage points of Student's t distribution are given in Appendix
Table E.3. For large values of n, the t distribution approaches the normal distribution. For n >
30, the t distribution can, for most practical purposes, be regarded as normally distributed. The
The F distribution is the theoretical model against which an F statistic is evaluated. As I showed
in Section 2.5, the alternative statistical hypothesis and the level of significance jointly specify
the region for rejecting a null hypothesis. Hypothesis-testing procedures that use the F
distribution are based on assumptions that are necessary for the mathematical justification of
the procedure. The following paragraphs briefly summarize those assumptions that have been
described in connection with the F and χ2 distributions. In Section 3.5, I show that even though
certain assumptions associated with the F distribution are violated, the distribution may still
provide an adequate approximation for purposes of hypothesis testing.
Two sets of assumptions are involved in using analysis of variance to test hypotheses: those
associated with the derivation of the F sampling distribution as just described and those
associated with the experimental design model equation. The latter assumptions are discussed
in detail in Section 3.3. The F assumptions apply to all designs; those associated with the
experimental design model equation vary from one design to another. I show in Section 3.5 that
there is considerable overlap between the two sets of assumptions.
The experimental design model equations for several of the simpler designs were described in
Section 2.2. You learned that each score in an experiment is a composite that reflects the
effects of all the sources of variability that affect the score. A score may reflect the effects of (1)
the independent variable, (2) individual characteristics of the subject or experimental unit, (3)
chance fluctuations in the subject's performance, and (4) any other nuisance variables such as
environmental conditions that have not been controlled.
The model equation for a design purports to represent all the sources of variability that affect
individual scores. Similarly, the total variance of all np s c o r e s i n a n e x p e r i m e n t ,
, is a composite that reflects all the sources of variability in the experiment.
To test hypotheses about some sources of variability, it is necessary to partition the total
variance into its component parts. The partition of the total variance is different for each
experimental design. Actually, the partitioning is done with the total sum of squares,
, rather than with the total variance.
In the following section, I partition the total sum of squares, SSTO, for a completely randomized
design into two parts: sum of squares between groups, SSBG, and sum of squares within
groups, SSWG.
In Section 2.2, I described the experimental design model equation for a completely
randomized design:
The values of the parameters μ, αj, and εi(j) in equation (3.2-1) are unknown, but they can be
estimated from sample data as follows.
The meanings of the parameters in the model equation are discussed in Section 2.2. An
examination of Table 3.2-1 should clarify the meanings of Yij, , and .
Table 3.2-1 ▪ Layout for CR-p Design (entry is Yij, i = 1,…, n; j = 1,…, p)
I now show how to partition the total sum of squares into the sum of squares between groups
and the sum of squares within groups:
This equation is for a single score, but I can treat each of the np scores in the same way and
sum the resulting equations to obtain
The term on the left is the total sum of squares. According to rule A.5 in Appendix A, if addition
or subtraction is the only operation to be performed before summation, the summation can be
distributed. Thus,
The specific rules I used to arrive at the first two terms on the right of equation (3.2-3) are as
follows. In the first term on the right,
Rules A.1 and A.2 apply because ( ) is a variable with respect to summing over j but a
constant with respect to summing over i; note that j is the only subscript. For the middle term
Rules A.3 and A.6 apply because 2 is a constant and ( ) involves only the outside index
of summation, j. On reflection it should be apparent that
for each of the j = 1, …, p treatment levels. Hence, the middle term on the right side of equation
(3.2-3) is equal to zero. This leads to the desired partition of the total sum of squares:
This partition of the total sum of squares is appropriate for a completely randomized design. A
different partition is required for other designs. In each case, however, the starting point for the
derivation is the experimental design model equation.
Now that I have partitioned SSTO into SSBG and SSWG, I need to show what this has to do
with testing the hypothesis that μ1 = μ2 = … = μp or the equivalent hypothesis that α1 = α2 =
… = αp = 0. Before doing this, I describe more convenient formulas for computing SSTO,
SSBG, and SSWG and discuss the concepts of degrees of freedom, mean squares, and
expected values of mean squares.
The partition of the total sum of squares in equation (3.2-4) provided deviation formulas that are
not convenient for computing sums of squares. Raw score formulas that are more convenient to
use are as follows:
The letter p denotes the number of treatment levels; n denotes the number of observations in
each treatment level. The formulas on the right must be used when the njs are not equal;
.
Writing out the computational formulas is tedious. In subsequent chapters, I use abbreviated
symbols to denote the terms in the formulas, for example,
You can think of [AS] as reflecting variation due to treatment A and subjects. [Y] is a correction
for the grand mean, and [A] reflects variation due to treatment A. It is apparent from the
formulas that only three terms are needed to compute SSTO, SSBG, and SSWG. The use of
the computational formulas and abbreviated symbols is illustrated in Chapter 4 (see Table 4.3-
1). Computing the sums of squares using the computational formulas is tedious. Fortunately,
statistical packages are available for performing the computations.3 After you have used the
computational formulas a few times to gain an understanding of the ANOVA procedures, let a
computer do the work.
Degrees of Freedom
The term degrees of freedom refers to the number of observations whose values are free to
vary. The degrees of freedom of SSBG, SSWG, and SSTO are developed next.
then must equal 6 because (5 + 1 + 6)/3 = 4. Given the value of the grand mean, I am free to
assign any values to two of the three treatment means, but after I do so, the third value is
determined. Hence, the number of degrees of freedom for SSBG is p − 1, one less than the
number of treatment means.
The degrees of freedom associated with SSWG is p(n − 1). To see why this is true, consider
and let p = 3 and n = 8. Each of the j = 1, …, 3 treatment levels
contains n = 8 scores. The eight scores in the jth treatment level are related to the j th treatment
mean, , by the equation
Seven of the scores can take any value, but the eighth score is determined
because the sum of the scores divided by 8 must equal . Hence, there are n − 1 = 8 − 1 = 7
degrees of freedom associated with the jth treatment level. If n1 = n2 = … = np, each of the p
treatment levels has n − 1 degrees of freedom. Thus, there are p(n − 1) degrees of freedom
associated with SSWG. If the njs are not equal, the degrees of freedom for SSWG is (n1 − 1) +
(n2 − 1) + … + (np − 1) = N – p, where N is the total number of scores.
The same line of reasoning can be used to show that when n1 = n2 = … = np, the total sum of
Hence, np − 1 = 23 of the scores can take any value, but the 24 must be assigned so that the
mean of the scores equals . If the njs are not equal, the number of degrees of freedom is n1 +
n2 + … + np − 1 = N − 1.
Mean Squares
A mean square is another name for a variance. Mean squares are obtained by dividing a sum of
squares by its degrees of freedom. When the ns are equal, the mean squares are
Some readers want “just the facts;” other readers want to know where the facts come from. To
accommodate the two kinds of readers, I occasionally provide alternative presentations of the
same topic. Here I provide “just the facts” regarding the expectations of the between- and
within-groups mean squares. I show where the facts come from in Section 3.8 that is optional.
In Section 3.2, I partitioned the total sum of squares for a completely randomized design into
the within- and between-groups sums of squares. An examination of the formulas,
provides an intuitive understanding of the sources of variation that they represent. The within-
groups sum of squares is composed of variation due to differences among subjects who receive
the same treatment level, chance fluctuations in a subject's behavior, and any other nuisance
variables that have not been controlled. Some variation among the scores of subjects in the
same treatment level can be expected because of individual differences. This source of
variation represents chance variation because subjects are randomly assigned to the treatment
levels. The between-groups sum of squares is composed of variation due to the sources just
described and, in addition, the effects of the treatment levels, if such effects are present. These
points can be expressed more succinctly in terms of the expectations of the between- and
within-groups mean squares. In the following sections, I describe the expectations of the mean
squares for two models: the fixed-effects model and the random-effects model.
The partition of the total sum of squares in Section 3.2 involved no assumptions about
populations or sampling procedures. However, assumptions are required if one wants to draw
inferences about population parameters from sample data. Next I describe the expectations and
assumptions for the fixed-effects model.
The fixed-effects model, also called model I, is appropriate for experiments in which all of the
treatment levels about which inferences are to be drawn are included in the experiment. If the
experiment was replicated, the same treatment levels would be included in the replication.
Under these conditions, conclusions from the experiment apply to only the p treatment levels in
the experiment.
For the fixed-effects model, the expected value of the between-groups mean squares is
In words, E(MSBG) is equal to the error variance ( ), the variance of the error effects, plus a
function of the treatment effects, αj = μj – μ. If the p population means are equal (μ1 = μ2 = …
= μp), all of the treatment effects are equal to zero (α1 = α2 = … = αp = 0) and
To summarize, when the null hypothesis is false, the expected values of MSBG and MSWG are
The fixed-effects model applies when the experiment contains all of the treatment levels about
which a researcher wants to draw conclusions. The assumptions of the model are as follows:
each treatment population, with (c) mean equal to zero and (d) variance equal to .
The random-effects model, also called model II, is appropriate for experiments in which a
researcher wants to draw conclusions about more treatment levels than can be included in the
experiment. For this case, a researcher can obtain a random sample of p treatment levels from
the population of P levels of interest, where p < P. If the researcher were to replicate the
experiment, a second random sample of p treatment levels from the population would be
obtained. Under these conditions, conclusions from the experiment generalize to the P levels in
the population.
In words, E(MSBG) is equal to the error variance ( ) plus n times the variance of the P
population treatment effects, where αj = μj – μ. If the P population means are equal (μ1 = μ2 =
… = μP), all of the P treatment effects are equal to zero (α1 = α2 = … = αP = 0) and the
variance of the treatment effects is equal to zero. Hence, the expected value of the between-
groups mean squares is
To summarize, when the null hypothesis is false, the expected values of MSBG and MSWG are
The random-effects model applies when a random sample of p treatment levels has been
obtained from a larger population of P levels. Assumptions 1 and 3 of the fixed-effects model
also are assumptions of the random effects model. Assumption 2 regarding the treatment levels
is different.
2.The experiment contains a random sample of the treatment levels of interest. αj is a random
variable (random treatment effect) that is (a) independent of other αjs and (b) normally
distributed, with (c) mean equal to zero and (d) variance equal to .
The expected values of the mean squares for the fixed- and random-effects models are similar
for a completely randomized design. This is not true for more complex experimental designs. A
knowledge of the expected values for complex designs is particularly important because this
information determines which mean squares should be used in testing various null hypotheses.
I discuss this point at length in Section 9.9.
Fixed-Effects Model
I now examine the rationale for using the statistic F = MSBG/MSWG to test the null hypothesis.
The null and alternative hypotheses for the fixed-effects model can be stated in terms of p
population means as follows:
It may seem paradoxical to use the ratio of two variances, MSBG/MSWG, to test a hypothesis
about means or treatment effects, but this makes sense when you consider the expected
values of MSBG and MSWG. In Section 3.3, you learned that
In words, the between-groups mean square estimates the population error variance plus a
function of squared treatment effects, if such effects exist. The within-groups mean square
estimates only the population error variance. An F statistic that is appreciably greater than 1
suggests that is greater than zero, which means that at least one of the treatment effects,
, is not equal to zero. An F statistic that is close to 1 suggests that MSBG and MSWG
are both estimators of the same population error variance,
If the null hypothesis is true and if the assumptions for the fixed-effects model described in
How large should an obtained F be before a researcher decides that the null hypothesis is
probably false? According to convention, an F statistic that falls in the upper 5% of the
sampling distribution of F i s considered to be sufficient evidence for rejecting the null
hypothesis. Values of F that cut off the upper .05 region of the F distribution for various degrees
of freedom are listed in Appendix Table E.4.
Random-Effects Model
The null and alternative hypotheses for the random-effects model can be stated in terms of P
population means:
where P is the number of treatment levels in the population from which the p levels in the
experiment are a random sample. The hypotheses also can be stated in terms of the variance
of the population treatment effects
that an F statistic that is appreciably greater than 1 suggests that . An F statistic that is
close to 1 suggests that MSBG and MSWG are both estimates of the same population
variance, , and that . If the null hypothesis is rejected, a researcher can conclude that
two or more means in the population from which the p means in the experiment are a random
sample are not equal.
I show in Sections 3.1 and 3.3 that the analysis of variance for a completely randomized design
involves two sets of assumptions: those associated with the F random variable and those
associated with the experimental design model. The former assumptions apply in testing
hypotheses for any ANOVA design; the latter assumptions vary from one design to the next.
The two sets of assumptions are as follows:
F Assumptions
Model Assumptions
B1. The model equation Yij = μ + α + εi(j) (where i = 1, …, n; j = 1, …, p) contains all of the
sources of variation that affect Yij.
B2. The experiment contains all of the treatment levels, ajs, of interest.
B3. The error effect, εi(j), is (a) independent of all other εi(j)s and (b) normally distributed
within each treatment population, with (c) mean equal to zero and (d) variance equal to .
The assumptions for a random-effects model are the same as those for the fixed-effects model
with one exception. The model assumption B2 is replaced by
C2. The experiment contains a random sample of the treatment levels of interest. αj is
a random variable (random treatment effect) that is (a) independent of all other αjs
and (b) normally distributed, with (c) mean equal to zero and (d) variance equal to .
I now discuss these assumptions in detail and the consequences of violating them. At the
outset, I should state that for real data, some of the assumptions will always be violated. For
example, the underlying populations from which samples are drawn are never exactly normally
distributed with equal variances. The important question then is not whether the assumptions
are violated but whether minor violations seriously affect the probability of Type I and II errors.
When one or more assumptions are violated, Cochran and Cox (1957, p. 91) observed that a
test performed at the α = .05 level, for example, may actually be made at the .04 or .09 level.
Also, a loss of power can occur when assumptions are not fulfilled. A statistic is said to be
robust if violations of its assumptions have little effect on the sampling distribution of the
statistic and thus on Type I and Type II errors. Unfortunately, the F statistic is not as robust to
the violation of certain assumptions as was once thought.
Assumption B1 states that an observation is the sum of three components: the average value of
the scores, μ; plus effects attributable to the jth treatment level, αj; plus all effects not
attributable to the jth treatment level, εi(j).
Assumption That the Experiment Contains All of the Treatment Levels of Interest (B2)
This assumption identifies the model as a fixed-effects model and distinguishes it from the
random-effects model. For the fixed-effects model, inferences are restricted to the population of
p treatment levels represented in the experiment. As I discussed in Section 3.3, the expectation
of MSBG is different for the fixed- and random-effects models, as is the nature of the null
hypothesis that is tested.
Assumption That the Experiment Contains a Random Sample of the Treatment Levels of Interest (C2)
This assumption identifies the model as a random-effects model. If the p treatment levels in the
experiment are not a random sample from a population of P levels where p < P, conclusions
from the experiment cannot be generalized to the population of P treatment levels.
The internal and external validity of an experiment depends on random sampling or random
assignment of experimental units to the treatment levels (assumption A2). The use of random
The F statistic is not robust with respect to violation of the assumption of the independence of
the error effects (assumption B3a). The assumption is likely to be violated, for example, if two or
more observations were obtained on each subject. Random sampling and assignment help to
obtain independent error effects, εi(j)s. Nonindependence of errors seriously affects both the
probability of Type I and Type II errors. The independence assumption can be visually checked
by plotting standardized residuals (standardized error effect),
against the order in which the corresponding observations were collected and against any
spatial arrangement of the corresponding experimental units. The procedure is illustrated in
Section 4.2. Some experimental designs permit nonindependence of observations but still
require the error effects to be independent; I return to this topic in Section 8.4.
Assumptions of Normally Distributed Errors (B3b) and Populations (A1) and Independence of MSBG and
MSWG (A3)
The assumption that the εi(j)s within each population are normally distributed (assumption B3b)
is equivalent to the assumption that the Yijs in each of the populations are normally distributed
(assumption A1). This follows because according to the linear model equation Yij = μ + α + εi(j),
the εi(j)s are the only source of variation within a particular treatment population.
The F statistic is quite robust to violation of the normality assumption (Lix, Keselman, &
Keselman, 1996). This is especially true when the populations are symmetrical, but not normal,
and the sample sizes are equal and are greater than 12 (Clinch & Keselman, 1982; Tan, 1982).
Even skewed populations have very little effect on the Type I and II error rates of the F test for
the fixed-effects model. Platykurtic (flat) and leptokurtic (peaked) populations also have little
effect on the Type I error rate, but they can have an appreciable effect on the power when the
sample njs are small. In general, a researcher need not be concerned about moderate
departures from normality provided that the populations are homogeneous in form—for
example, all positively skewed and slightly leptokurtic. It is usually possible by examining a
frequency distribution of Yij, , or zi(j), standardized error effects, to detect cases in which
the analysis of variance leads to gross errors in interpreting the outcome of an experiment.
Further support for the robustness of the F statistic with respect to nonnormality comes from a
different kind of investigation. Lunney (1970) studied the effect of using a dichotomous
dependent variable, the ultimate in nonnormality, on the Type I error rate of the F test. The
actual Type I error rates were found to be quite close to the nominal levels when the sample njs
were equal. This result does not hold for unequal njs. Similar results were obtained by Hsu and
Feldt (1969), who investigated four dependent scale lengths: 2, 3, 4, and 5 points. They
reported that the 5-point scale with sample njs as small as 11 gave excellent control of the Type
I error rate in the presence of moderate heterogeneity of variance, platykurtosis, and,
skewness. It is interesting that the 3-point scale was superior to the 4-point scale. With larger
njs, even the 2-point scale resulted in adequate control of the Type I error.
It can be shown that the numerator and denominator of the F statistic are independent
(assumption A4) if the populations are normally distributed or approximately so (Lindquist,
1953, pp. 31–35).
When normality and other assumptions are tenable, ANOVA provides the most powerful test of
the omnibus null hypothesis. However, when the normality assumption is grossly violated, the
powers of the alternative procedures that I describe in Section 3.7 are greater.
Assumption B3c states that the expectation of the error effects within each treatment population
is equal to zero. This follows from the way an error effect is defined:
and
Assumption B3d states that the variances of the p populations are homogeneous. The
assumption can be written as
The alternative hypothesis is H1: for some j and j′. The most common pattern of
heterogeneous variances is one in which the sample variances increase as the sample means
increase. This situation is easily detected by plotting standardized error effects on the Y-axis
and the corresponding treatment means on the X-axis. G. E. P. Box (1954a) reported that the
ANOVA F test is robust with respect to violation of the homogeneity of variance assumption if (1)
there are an equal number of observations in each of the treatment levels, (2) the populations
are normal, and (3) the ratio of the largest variance to the smallest variance does not exceed 3.
Given these restrictions and research showing that the ratio of the largest to smallest sample
variance often exceeds 3, it seems prudent to question the reputed robustness of ANOVA with
respect to heterogeneous variances (Wilcox, 2009). Indeed, numerous investigators have
shown that even when the sample sizes are equal, the ANOVA F test is not robust for the
degree of variance heterogeneity often encountered in behavioral and educational research
(Harwell, Rubinstein, Hayes, & Olds, 1992; Rogan & Keselman, 1977; Tomarken & Serlin,
1986). Rogan and Keselman (1977) found that the larger the degree of variance heterogeneity,
the greater the discrepancy between the actual probability of a Type I error and the nominal
level. The discrepancy tended to decrease but not disappear as the sample size increased.
They concluded that even moderate violations of the homogeneity assumption can have
marked effects on the Type I error when the sample sizes are unequal. According to G. E. P.
Box (1953, 1954a), the nature of the bias for this latter case may be either positive or negative.
The actual probability of a Type I error will exceed the nominal level when the smaller samples
are drawn from the more heterogeneous populations; the reverse is true when the smaller
samples are drawn from the more homogeneous populations. In the face of this evidence, it is
clear that researchers should not ignore violations of the homogeneity of variance assumption.
Fortunately, robust alternatives to the ANOVA F test statistic can be used when heterogeneous
population variances are suspected (Harwell et al., 1992; Wilcox, 2009).
A number of statistics are available for testing the hypothesis . The statistics
can be classified as either robust or not robust to nonnormality. Two easy-to-use tests that fall
in the latter category are those of Hartley (1940, 1950) and Cochran (1941). Both tests require
equal or approximately equal sample ns. Hartley's test statistic is
where
The degrees of freedom are equal to p and n − 1, where p is the number of variances and n is
the number of observations within each treatment level. Critical values for the distribution of
Fmax are given in Appendix Table E.8. The hypothesis of homogeneity of variance is rejected if
Fmax is greater than or equal to the tabled value, Fmaxα; p, n − 1. If the ns for the treatment
levels differ only slightly, the largest n can be used for determining the degrees of freedom for
this test. Use of the largest n leads to a slight positive bias in the test—that is, in rejecting the
hypothesis of homogeneity more frequently than it should be rejected.
where is the largest of the p sample variances and is the sum of all of the sample
variances. The degrees of freedom for this test are equal to p and n − 1 as defined for the Fmax
test. Critical values, denoted by Cα; p, n − 1, for the distribution of C are given in Appendix
Table E.9. The C test uses more of the data and tends to be a bit more powerful than the Fmax
test. Both tests are useful as quick preliminary screening tools.
The Brown-Forsythe (Brown & Forsythe, 1974b) test has been found to be robust to typical
departures from normality.4 The test is relatively easy to compute. For each treatment level: (1)
compute the sample median, Mdn.j; (2) subtract this sample median from each observation, Yij,
in the corresponding treatment level; and (3) denote the absolute value of this difference by Zij.
That is,
The Brown-Forsythe test consists of performing a completely randomized ANOVA on the Zij
scores. If , reject the hypothesis of homogeneity of variances. The computational
procedure is illustrated in Table 3.5-1. The .25 level of significance was adopted in Table 3.5-1
to increase the power of the test—a common procedure when performing a preliminary test on
model assumptions. Because F < F.25; 3 28, there is no reason for believing that the four
populations from which the samples were obtained have heterogeneous variances.
Table 3.5-1 ▪ Computational Procedures for the Brown-Forsythe Test for Equal Variances
A test proposed by O'Brien (1981) also is robust to violations of normality. O'Brien's test, like the
Brown-Forsythe test, uses a transformation of the original observations. The transformation is
given by
As a computational check, the mean of for each of the treatment levels should equal the
variance of the original observations in the treatment level. An ANOVA F test is used to analyze
the transformed dependent variable as in the Brown-Forsythe test. If , reject the
hypothesis of homogeneity of variances. The procedure generalizes to complex ANOVA
designs.
Assumption That the Numerator and Denominator of the F Statistic Estimate the Same Population Error
Variance (A4)
The assumption that MSBG and MSWG both estimate corresponds to the ANOVA null
hypothesis. The hypothesis is advanced in the hope that it can be rejected.
3.6 Transformations
Research by Budescu and Appelbaum (1981) and Games and Lucas (1966) has raised serious
questions about the desirability of transforming observations to achieve either homogeneity of
variance or normality. The exchanges between Levine and Dunlap (1982, 1983) and Games
(1983, 1984) pinpoint many of the controversial issues surrounding the use of transformations.
Extreme scores are undesirable for substantive reasons. An unusually long reaction time, for
example, may result from a lapse of attention and have nothing to do with the experimental
condition that was administered. Extreme scores also are undesirable for statistical reasons
because they increase variability and decrease power. Fortunately, most transformations
reduce the influence of extreme scores.
Obtaining additivity of effects is a desirable use of transformations. The need for additivity
arises in designs such as a fixed-effects randomized block design where a residual mean
square (MSRES) is used to estimate the population error variance. If the block and treatment
effects in a randomized block design are not additive, the expected value of the residual mean
square is
instead of
If the null hypothesis is true, then following the discussion in Section 3.4, the numerator and
denominator of
should provide independent estimates of the same population error variance, . It is apparent
from an examination of the expected values of MSA and MSRES that this can occur only if
—that is, if the block and treatment effects are additive. It is often possible to
find a transformation that will achieve additivity. I return to this point in Chapter 8.
The decision to use a particular scale for measuring a dependent variable in the behavioral
sciences is often arbitrary; alternative choices may be equally interpretable. If this is true,
homogeneity of variance, normality, additivity, or minimization of the effects of extreme scores
may be achieved by transforming or changing the measurement scale. In general, a
transformation can be used whenever the means of the treatment levels and the variances of
the error effects are proportional and the shapes of the error-effect distributions are
homogeneous. The transformations to be described are not effective under the following
conditions:
1.The means of the treatment levels are approximately equal, but the variances of the error
effects are heterogeneous.
2.The means of the treatment levels vary independently of the error-effect variances.
3.The error-effect variances are homogeneous, but the treatment distributions are
heterogeneous in shape.
Procedures for determining which transformation is appropriate for a set of data have been
described by G. E. P. Box, Hunter, and Hunter (2005, p. 329) and Emerson and Stoto (1983).
One procedure is to follow general rules concerning situations in which a given transformation
is often successful. I follow this approach in describing the various transformations. I describe
an alternative procedure for selecting a transformation in the subsection titled Selecting a
Transformation.
Square-Root Transformation
For certain types of distributions such as a Poisson distribution, treatment means and variances
are proportional. A Poisson distribution often occurs when the dependent variable is a
frequency count of events having a small probability of occurring and there are many
opportunities for the event to occur—for example, the number of errors at each choice point in a
relatively simple multiple T-maze. Often a square-root transformation can be used for such
data. Let Y denote the original score and Y′ the transformed score. A square-root
transformation is given by
Logarithmic Transformation
If treatment means and standard deviations are proportional, a logarithmic transformation may
be appropriate. A transformed score, Y′, is given by
The formula on the right is used when some scores are zero or very small. Logarithmic
transformations have been found to be useful for very positively skewed distributions. Such
distributions can occur when the dependent variable is a measure of reaction time. A
logarithmic transformation typically has a double benefit: reducing the influence of extreme
scores and making the scores more closely resemble a normal distribution.
Reciprocal Transformation
The second formula should be used if any scores are zero. A reciprocal transformation can be
useful when the dependent variable is reaction time.
The angular transformation is may be useful when the data are percentages or proportions.
The transformation is given by
where Y is a proportion. It is not necessary to solve for Y′ in the preceding formula; values of Y
from .001 to .999 are given in Appendix Table E.11. The transformed values in Table E.11 are in
radians. It is recommended that 1/2n or 1/4n be substituted for Y = 0 and that 1 − 1/2n or 1 −
1/4n be substituted for Y = 1, where n is the number of observations on which each proportion
is based. An angular transformation can be useful when the means and variances are
proportional and the distribution has a binomial form. This condition can occur when the
number of trials is fixed and Y is the probability of a correct response that varies from one
treatment level to another.
Selecting a Transformation
I have described situations where particular transformations have been found to be useful. An
alternative approach to selecting a transformation is to apply each of the transformations to the
largest and smallest score in each treatment level. The range within each treatment level is
then determined and the ratio of the largest to the smallest range is computed. The
transformation that produces the smallest ratio is selected as the most appropriate one. This
procedure is illustrated in Table 3.6-2. O n t h e b a s i s o f this procedure, a square-root
transformation is selected for the data in Table 3.6-1.
Table 3.6-2 ▪ Transformations Applied to Largest and Smallest Scores in Table 3.6-1
Once an appropriate transformation has been selected and the data analyzed on the new
measurement scale, all inferences regarding treatment effects must be made with respect to
the new scale. In many behavioral research situations, inferences based on log Y or , for
example, are just as meaningful as are inferences based on untransformed scores.
3.7 Other Procedures for Dealing with Nonnormality, Unequal Variances, and Outliers
Transformations are not the only procedure that researchers can use to deal with
heterogeneous variances, nonnormality, and extreme scores. Here I examine three other
procedures that may be useful when the assumptions of analysis of variance are not tenable.
Other promising procedures are described by Dean, Majumdar, and Li (2011) and Toothaker
(2005).
Trimming
Wilcox (2005) has called attention to the benefits of trimming in which a certain percentage of
scores in the lower and upper tails of each distribution are removed before computing sample
means. Trimming is recommended when the distributions contain outliers or are relatively flat
with an unusual number of observations in the tails—that is, heavy-tailed distributions. Outliers
affect sample means and increase sample variances. Heavy-tailed symmetrical distributions do
not affect sample means, but they do increase sample variances. Both forms of nonnormality
reduce power. What percent of observations should be removed? Lee and Fung (1985)
recommended trimming 15% of observations. Rocke, Downs, and Rocke (1982) have found
20% and even 25% trimming to be useful.
where
Suppose that I decide to trim p = .20 of the scores—that is, remove the np = k = (10) (.20) = 2
largest and the 2 smallest scores. If k is not an integer, adjust p so that np is an integer. The
trimmed distribution is
A related procedure, Winsorizing, is used to compute the sample variance. The procedure
replaces each trimmed score in the lower tail of a distribution with a copy of the smallest
untrimmed score. An analogous procedure is performed on trimmed scores in the upper tail.
The 20% Winsorized distribution of
is
Trimming and Winsorizing have removed the impact of the two extreme scores in the original
skewed distribution. N. H. Anderson (2001, pp. 352–356) illustrates trimming and Winsorizing
with a completely randomized design. Trimmed F tests provide excellent control of Type I errors
and provide greater power with skewed and heavy-tailed distributions (N. H. Anderson, 2001;
Lee & Fung, 1985).
A rank-transform ANOVA F test is a procedure in which an ANOVA F test is applied to the ranks
of data instead of to the data themselves. The procedure was popularized by Conover and
Iman (1981) and can be used with a variety of parametric statistics. To use the procedure with a
completely randomized ANOVA design, begin by ranking all of the scores from 1 to N. A rank of
1 is assigned to the smallest score and a rank of N to the largest score. In the case of ties, the
median rank is assigned. The usual ANOVA analysis is then applied to the ranks. The
procedure converts the parametric ANOVA F test into a distribution-free test. Distribution-free
procedures do not require certain assumptions about the form of the underlying population
distributions. The test is similar to other procedures that analyze ranks—for example, the
Kruskal-Wallis H test for p ≥ 2 treatment levels and the Wilcoxon rank sum test (or the
equivalent Mann-Whitney U test) for p = 2 treatment levels.
The completely randomized rank-transform ANOVA F test does not require that the treatment
populations have identical shapes. However, a test of the equality of population means rests on
the assumption that the distribution shapes are identical. If this assumption is not tenable, the
null hypothesis can be rejected because of differences in the shapes of the distributions, not
their means. If the populations are not normal, but the shapes are identical, the rank-transform
F test often has more power than the conventional ANOVA F test. Furthermore, the rank-
transform F test is only slightly less powerful when the populations are normal.
Hora and Iman (1988) have extended the rank-transform F procedure to a randomized block
design. Kepner and Robinson (1988) have extended the procedure to repeated-measures
designs. More recently, Wobbrock, Findlater, Gergle, and Higgins (2011) extended the
procedure to a completely randomized factorial design. Further research is needed to apply the
procedure to other ANOVA designs.
Earlier I described a variety of transformations have been used in an attempt to achieve both
normality and homogeneity of variance. For many distributions, however, there is no single
transformation that will achieve both goals. Generalized linear models, introduced by Nelder
and Wedderburn (1972), represent a different approach to achieving these goals. A generalized
linear model can be thought of as a flexible generalization of a regression model. I describe
regression models in Section 7.5. A thorough description of generalized linear models is
beyond the scope of this book. The interested reader can refer to Nelder and Wedderburn
(1972), Dobson and Barnett (2008), and C. J. Anderson (2009). Here I simply describe several
features of generalized linear models.
To use a generalized linear model, a researcher specifies a distribution for the response
variable and a link function. The response variable can have any distribution that is a member
of the exponential family. This family includes the normal, Poisson, binomial, exponential, and
gamma distributions. Hence, the model is applicable to a wide variety of response variables.
Furthermore, the variances of the response variables do not have to be equal. An appropriate
link function is then selected to connect the response mean to the linear predictors of the
regression model. The advantage of the generalized linear model is that it can achieve both
normality and homogeneity of variance—a task that is difficult to achieve with a transformation.
In Section 3.3, I gave the expected values of MSBG and MSWG for a fixed-effects model as
Here I show how the expected values are derived. The material that follows assumes a
familiarity with the rules of expectation in Appendix B.
The grand mean, μ, is a constant for all observations in the p populations. Its expected value is
αj is the treatment effect for the jth population. αj is a constant fixed effect for all the
observations in population j, but αj may vary for the j = 1, …, p treatment populations. The sum
of the p treatment effects is zero, , because . Because αj is a
constant for population j, its expected value is
The only source of variation among the observations in population j is that due to the error
effect, . It is assumed that the distribution of εi(j) for each of the p populations is
normal with mean equal to zero and variance equal to . It also is assumed that the εi(j)s are
independent, both within each treatment level and across the treatment levels. Based on these
assumptions, the expected value of εi(j) and functions of εi(j) are as follows:
The computational formula for the between-groups sum of square is given in Section 3.2 as
The expected value of the between-groups mean square is obtained by first determining the
expected value of each term on the right side of equation (3.8-1) and then combining the two
expectations. The model equation for a completely randomized design is given in equation (3.2-
1) as
As shown earlier, . Thus, the expectation of the first term on the right side of equation
(3.8-1) is
Hence, the expectation of the second term on the right side of equation (3.8-1) is
The expected value of the between-groups sum of squares is obtained by combining the
expected values in equations (3.8-2) and (3.8-3) as follows:
Dividing E(SSBG) by its degrees of freedom, p − 1, gives the expected value of the between-
groups mean square
The computational formula for the within-groups sum of squares is given in Section 3.2 as
Earlier I derived the expected value of the second term on the right of equation (3.8-4). It is
The degrees of freedom for the within-groups sum of squares is p(n − 1). Dividing E(SSWG) by
p(n − 1) gives the expected value of the within-groups mean square:
The computational formula for the total sum of squares is given in Section 3.2 as
I have already obtained the expected values of the two terms in the SSTO formula. Replacing
the terms by their expected values, I obtain
The degrees of freedom for the total sum of squares is np − 1. Dividing E(SSTO) by np − 1
gives the expected value of the total mean square
If one word could characterize the procedure for determining the expected values of MSBG,
MSWG, and MSTO, it would be tedious. An alternative, simpler procedure that works for some
experimental designs is described in Section 9.9.
Sometimes a researcher wants to draw conclusions about more treatment levels than can be
included in the experiment. This requires obtaining a random sample of p treatment levels from
the population of P levels, where p is less than P. The results of the experiment can then be
generalized to the P levels in the population. The random-effects experimental design model
equation for a completely randomized design is
where (1) αj is a random variable (random treatment effect) that is normally and independently
distributed with mean equal to zero and variance equal to
The expectation of the μ and εi(j) terms is identical to those for the fixed-effects model.
Following the procedures used for the fixed-effects model, I can show that
1.Terms to remember:
a.chi-square random variable (3.1)
b.F random variable (3.1)
c. one-sample t statistic (3.1)
d.fixed-effects model (model I) (3.3)
e.error variance (3.3)
f. random-effects model (model II) (3.3)
g.robust (3.5)
h.transformation (3.6)
2.[3.1] Whose name is most closely associated with the derivation of each of the following
sampling distributions?
a.Chi square
b.F
c. t
*3.[3.1] For each of the following chi-square variables, determine the mean and variance.
*a.
*b.
c.
d.
*4.[3.1] Use the summation rules in Appendix A to prove that .
*5.[3.1] Show why the following are true and cite the relevant summation rule(s).
*a.
b.
c.
6.[3.1] List the assumptions associated with the use of the chi-square sampling distribution.
*7.[3.1] Compute the value of F that cuts off the lower 1 – α region of the sampling distribution
for the following.
*a.F1 −.05; 3, 20
*b.F1 − .01; 5, 15
c. F1 − .05; 3, 10
d.F1 − .01; 10, 30
*8.[3.1] For each of the following F random variables, determine the mean and variance.
*a.F6, 30
*b.F10, 50
c. F5, 20
d.F3, 20
*9.[3.1] For each of the following random variables, determine the mean and variance.
*a.t20
*b.t60
c. t100
d.t30
10.
[3.2] Explain why the experimental design model equation is the starting point for
partitioning the total sum of squares.
*11.
[3.2] The partition of SSTO into SSBG a n d SSWG and the derivation of convenient
computational formulas require the use of six summation rules (A.1, A.2, …, A.6). These
rules are described in Appendix A. For each of the following, where i = 1, …, n and j = 1, …,
*b.
*c.
*d.
e.
f.
g.
h.
i.
j.
*12.
[3.2] Show that the deviation formula for SSTO is equal to the raw score formula, that is,
13.
[3.2] Show that the deviation formula for SSBG is equal to the raw score formula, that is,
14.
[3.2] Show that the deviation formula for SSWG is equal to the raw score formula, that is,
*15.
[3.2] Determine the degrees of freedom for SSTO, SSBG, and SSWG for the following CR-
p designs.
*a.CR-3 with n = 10
*b.CR-4 with n1 = 3, n2 = 4, n3 = 4, n4 = 5
*c.CR-6 with n = 8
d.CR-5 with n = 6
e.CR-3 with n1 = 6, n2 = 6, n3 = 5
f. CR-3 with n1 = 7, n2 = 10, n3 = 8
g.CR-2 with n = 20
*16.
[3.3] For each of the following experiments or investigations, indicate whether a fixed-effects
or random-effects model is appropriate.
*a.Random samples of White and Black employed mothers were queried concerning the
number and nature of the physical symptoms that would lead them not to send a child
to school.
*b.T w e n t y m a l e W i s t a r r a t s w e r e s u b j e c t e d t o 22 hours of food deprivation.
Tetrahydrocannabinol was injected intraperitoneally at five randomly selected times
following the deprivation period. Activity level was recorded following each injection.
c. The effects of completing one, two, three, or four courses in the social sciences on
liberalizing political attitudes was investigated using a random sample of 200 college
students.
d.The amounts of social interaction initiated by 4- to 6-year-old children classified as
severely, moderately, mildly, or nonhandicapped were investigated.
e.Eight randomly selected doses of testosterone propionate were administered to 80
immature male domestic ducks of the Rouen breed. The dependent variable was the
number of different patterns of sexual behavior that were exhibited following
administration of the drug.
*17.
[3.8] Compare the scope of the null hypothesis for the fixed- and random-effects models for
a CR-p design.
*18.
[3.4] Explain why a test of is a test of the hypothesis that α1 = α2 = … = αP = 0.
19.
[3.4] Explain the rationale for using the F statistic, which is the ratio of two independent
variances, to test the hypothesis of equality of means.
*20.
[3.5] Qualify the statement, “The F test is quite robust with respect to violation of the
normality assumption.”
21.
[3.5] Discuss the statement, “The F test is not appropriate for dichotomous data because
such data depart radically from the normal distribution.”
22.
[3.5] What purposes do random sampling and random assignment serve?
*23.
[3.5] Discuss the statement, “The F test is robust with respect to moderate violation of the
homogeneity of variance assumption.”
24.
[3.5] Classify the tests of homogeneity of variance in terms of those that are (a) robust or
not robust to nonnormality and (b) appropriate for equal or unequal njs.
25.
[3.5] When the population variances are not homogeneous and the sample sizes are
unequal, how is the actual probability of a Type I error affected under the following
conditions?
a.The smaller samples are drawn from the more heterogeneous populations.
b.The smaller samples are drawn from the more homogeneous populations.
*26.
[3.5] Three programs for improving the reading skills of sixth-graders were evaluated.
Twenty-seven boys were randomly assigned to the three programs with the restriction that
nine boys were assigned to each program. Treatment level a1 was the Delacato program,
a2 was the Stanford program, and a3 was a control condition. The Iowa Test of Reading
Skills was administered at the beginning and end of the school year. The dependent
variable was the increase from the pretest to the posttest. The following data were obtained:
a1a2a3
201512
182015
181318
231220
221618
171717
152110
131524
211316
*a.Compute the statistic for testing the hypothesis that the population variances are
homogeneous using (i) the Fmax test, (ii) Cochran's C test, and (iii) the Brown-Forsythe
test; let α = .05.
*b.What is your decision regarding the null hypothesis of homogeneity of variance? Do the
three tests agree?
27.
[3.5] The experiment in Exercise 26 was repeated with a random sample of 27 boys in the
fifth grade. The following data were obtained:
a1a2a3
171016
134 7
169 6
135 18
9 1519
105 2
151211
111620
159 5
Test the hypothesis that the population variances are homogeneous. Follow the instructions
in Exercise 26.
*28.
[3.6] Use the first approach described in the text to determine which, if any, transformation
is appropriate for the data in Exercise 27.
*29.
[3.6] Use the procedure illustrated in Table 3.6-2 to determine which, if any, transformation
is appropriate for the data in Exercise 27.
30.
[3.6] The legibility of four versions of a dial was investigated. Twenty pilots and navigators
with over 2000 hours of flying time were randomly assigned to one of four groups, and each
group read one version of the dials. Pictures of the dials were shown on a computer screen
for 600 milliseconds. Each subject read 100 settings of a dial. The dependent variable was
the number of reading errors. The following data were obtained:
a.Use the first approach described in the text to determine which, if any, transformation is
appropriate for these data.
b.Use the procedure illustrated in Table 3.6-2 to determine which, if any, transformation is
appropriate for these data.
*31.
[3.6] Give rules of thumb for deciding which transformation is appropriate for a set of data.
32.
[3.6] The effects of lesions in the parafascicular nucleus of 25 Norway rats (Rattus
norvegicus) on running time in a straight-alley maze were investigated. The rats were
randomly assigned to one of five groups, subject to the restriction that each group
contained five rats. The extent of the lesions differed for each group. The following data are
running times in seconds for the five groups:
a.Use the first approach described in the text to determine which, if any, transformation is
appropriate for these data.
b.Use the procedure illustrated in Table 3.6-2 to determine which, if any, transformation is
appropriate for these data.
*33.
[3.6] If data are not suitable for ANOVA and an appropriate transformation cannot be found,
what recourse does a researcher have?
*34.
[3.8] Assume a fixed-effects model for a CR-p design. Give the expected value of each of
the following and indicate the relevant rule(s) (B.1, B.2, …, B.11) from Appendix B.
*a.
*b.
*c.
*d.
*e.
f.
g.
h.
i.
j.
k.
l.
*35.
[3.8] Do Exercise 34 assuming a random-effects model.
36.
[3.8] Derive the expected values of MSBG and MSWG in a CR-p design assuming a
random-effects model.
1An identity is a statement of equality that is true for all values of the variables that have
meaning.
3Examples showing how to use the SPSS and SAS statistical packages to analyze each
ANOVA design are available on my Web page.
4SPSS provides Levene's test of the assumption of homogeneity of population variances. The
test is less robust to nonnormality than the Brown-Forsythe test.
*The reader who is interested in “just the facts” can omit this section without loss of continuity.
http://dx.doi.org/10.4135/9781483384733.n3
Chapters 1 t o 3 introduced some basic concepts and statistical tools that are used in
experimental design. In this and the following chapters, those designs that appear to have the
greatest usefulness to researchers in the behavioral sciences, health sciences, and education
are examined in detail.
One of the simplest experimental designs from the standpoint of data analysis and assignment
of subjects or experimental units to treatment levels is the completely randomized design. The
design is denoted by the letters CR-p, where CR stands for completely randomized and p is the
number of levels of the treatment. The layout for a completely randomized design with four
treatment levels is shown in Figure 4.1-1.
Figure 4.1-1 ▪ Layout for a completely randomized design (CR-4 design) with p = 4
treatment levels denoted by a1, a2, a3, and a4. The subjects are randomly assigned to the
treatment levels. The n = 5 subjects in Group1, receive treatment level a1, those in
Group2 receive treatment level a2, and so on. The dependent-variable means for the
subjects who receive treatment levels a1, a2, a3, and a4 are denoted by , and ,
respectively.
A CR-p design is appropriate for experiments that meet, in addition to the general assumptions
of analysis of variance summarized in Section 3.5, the following two conditions:
1.One treatment with p ≥ 2 treatment levels. The levels of the treatment can differ either
quantitatively or qualitatively. When the experiment contains only two treatment levels, the
design is indistinguishable from the t test for independent-samples design that is described
in Section 2.2.
2.Random assignment of experimental units to the treatment levels, with each experimental
unit designated to receive only one level. The number of experimental units in each
treatment level need not be equal, although this is desirable. According to Section 3.5, the
F statistic is more robust to violation of some assumptions when the sample ns are equal.
I describe the model equation for a completely randomized design in Section 2.2 and the
assumptions for the model in Sections 3.3 and 3.5. Here I elaborate on the assumptions for the
fixed-effects model.
αj is the treatment effect for population j and is equal to μj – μ, the deviation of the grand
mean from the jth population mean.
εi(j) is the error effect associated with Yij and is equal to Yij − μj. The error effect represents
effects unique to subject i, effects attributable to chance fluctuations in subject i's behavior,
and any other effects that have not been controlled—in other words, all effects not
attributable to treatment level aj.
2.The experiment contains all of the treatment levels, αjs, of interest. As a result, the
treatment effects sum to zero, .
3.The error effect, εi(j), is normally and independently distributed within each treatment
population with mean equal to zero and variance equal to . This assumption is often
The fixed-effects model is the most commonly used model for a CR-p design. The random-
effects model in which the p treatment levels are randomly sampled from a population of P
levels (p < P) is discussed in Section 4.6.
In Section 4.1, I described the assumptions of the error effects: The εi(j)s should be normally
distributed, have equal variances, and be mutually independent. To determine whether the
assumptions are tenable, it is helpful to examine a plot of standardized residuals, denoted by
zi(j). A residual (error effect) is given by . Standardization is achieved by dividing the
residuals by their standard deviation. For a completely randomized design, the standard
deviation of the residuals is , where N = n1 +…+ np. The computation of
SSWG is illustrated in Section 4.3. A standardized residual for subject i in treatment level j is
given by
If the assumptions of the model are tenable, the standardized residuals should be normally
and independently distributed with mean equal to 0 and variance equal to 1; zi(j) is NID(0, 1).
Hence, to check on the model assumption, one looks for deviations from patterns that would be
expected of independent observations from a standard normal distribution.
Residuals and standardized residuals for the data in Table 4.2-1 are shown in Table 4.2-2. In
Figure 4.2-1(a), the standardized residuals in Table 4.2-2 are displayed in the form of frequency
distributions. If the model assumptions are tenable, approximately 68.3% of the standardized
residuals should be between −1 and 1, approximately 95.4% between −2 and 2, and
approximately 99.7% between −3 and 3. Based on the residual plots, there is no reason to
doubt the tenability of the normality and homogeneity of variance assumptions. Other
procedures for testing the hypothesis of homogeneity of the population variances are described
in Section 3.5.
Table 4.2-2 ▪ Residuals and Standardized Residuals for the Data in Table 4.2-1
Figure 4.2-1(b) displays a different kind of information. Here, the residuals are plotted against
the order in which the hand-steadiness measurements were collected. If the independence
assumption is tenable, the standardized residuals should be randomly distributed around zero
with no discernable pattern. Nonindependence is indicated if the zi(jk)s show a consistent
downward or upward trend or they have the shape of a megaphone. The independence
assumption appears to be satisfied for treatment levels a1 through a3. However, the
standardized residuals in treatment level a4 increase as a function of the order in which the
measurements were collected—strong evidence that the independence assumption is violated.
A researcher would certainly want to review the data collection procedures for this treatment
level.
Outliers
Occasionally one encounters data with one or more observations that deviate markedly from
other observations in the sample. Such observations are called outliers. In a standardized
residual plot, they are observations for which > 2.5. An examination of Figure 4.2-1(a)
reveals that there are no outliers. Box plots also are useful for detecting outliers and treatment
populations that are not symmetrical. Box plots are discussed in most introductory statistics
books.
When outliers occur, they call for detective work. A researcher must decide whether the
residuals merely represent extreme manifestations of the random variability inherent in data or
are the result of deviations from prescribed experimental procedures, recording errors,
equipment malfunctions, and so on. If they reflect the random variability inherent in data, they
should be retained and processed in the same manner as the other observations. If some
physical explanation for the outlier can be found, a researcher may (1) replace the observation
with new data, (2) correct the observation if records permit, or (3) reject the observation and
Winsorize. Winsorization is described in Section 3.6. After performing an exploratory data
analysis and deciding that the assumptions of the model are tenable, the next step is a
confirmatory data analysis.
The statistical hypotheses for the hand-steadiness data in Table 4.2-1 are
The level of significance adopted is α = .05. Procedures for computing the sums of squares
used in testing the null hypothesis are illustrated in Table 4.3-1. The AS Summary Table is so
named because variation among the 32 scores reflects the effects of the treatment A and the
subjects, denoted by S for subjects. The computational scheme in parts (ii) and (iii) of the table
uses the abbreviated symbols [AS], [A], and [Y] that were introduced in Section 3.2. This
abbreviated notation simplifies the presentation of the computational formulas.
An ANOVA table summarizing the results of the analysis is shown in Table 4.3-2. The mean
square (MS) in each row is obtained by dividing the sum of squares (SS) by the degrees of
freedom (df) in its row. Recall from Section 3.3 that an MS is an estimator of a population
variance and is given by
The F statistic is obtained by dividing the mean square in the first row by the mean square in
the second row. This is indicated symbolically by . According to Appendix Table E.4, the
value of F that cuts off the upper .05 region of the sampling distribution for 3 and 28 degrees of
freedom is F.05; 3, 28 = 2.95. Because the obtained F = 5.56 exceeds the table value, F > F.05;
It is customary to include in an ANOVA table the p value associated with the F statistic and a
measure of effect magnitude. The p value for the F statistic was obtained from Microsoft's Excel
FDIST function
To illustrate, I replaced x with 5.56 (the value of the F statistic), deg_freedom1 with 3, and
deg_freedom2 with 28 as follows
Excel returned the p value of .004. The effect magnitude statistic, , in Table 4.3-2 is
discussed in Section 4.4. A decision to reject or not reject the null hypothesis should be based
on the researcher's preselected level of significance, .05 in the example. The inclusion of the p
value permits readers to, in effect, set their own level of significance.
(American Psychological Association, 2010, p. 141) states that a tabular presentation can
minimize the need for lengthy textual descriptions.
After the omnibus null hypothesis1 is rejected, the next step in the analysis is to decide which
population means differ. Multiple comparison procedures are used for this purpose and are
described in Chapter 5.
Strength of Association
The most widely used measures of strength of association in analysis of variance are omega
squared, ω2, introduced by William Hays (1994, p. 408) for fixed treatment effects and the
intraclass correlation, ρI, for random treatment effects. For a completely randomized design,
both measures are defined as
where is the variance of the treatment effects and is the variance of the error effects. ω2
and ρI indicate the proportion of the population variance in the dependent variable that is
accounted for by specifying the treatment-level classification, and thus they are identical in
general meaning. Both ω2 and ρI are measures of strength of association for a qualitative or
quantitative independent variable and a quantitative dependent variable.
The parameters and i n equation (4.4-1) are generally unknown, but they can be
estimated from sample data. In Section 3.3, you learned that
for the random-effects model. It follows that unbiased estimators of and are given by
for the random-effects model. If the estimators for and are substituted in equation (4.4-1),
the following formulas for and can be obtained with the aid of a little algebra:
Thus, the four levels of sleep deprivation account for 30% of the variance in the hand-
steadiness scores. Not only is the association statistically significant, as is evident from the
significant F statistic in Table 4.3-2, but also the association is quite strong.
Based on Cohen's (1988, pp. 284–288) classic work, the following guidelines are suggested for
interpreting strength of association:
When a sample omega squared is negative, the best estimate of the population value is 0.
Sedlmeier and Gigerenzer (1989) and Cooper and Findley (1982) reported that the typical
strength of association in the journals that they examined was around .06—a medium
association.
Omega squared and the intraclass correlation also can be computed from a knowledge of the F
statistic, sample size in each treatment level, and number of treatment levels. The alternative
formula for and the value of for the hand-steadiness data are
where F, n, and p are obtained from Table 4.3-2. If treatment A represents random effects, the
intraclass correlation can be computed from
These formulas for and can be used to assess the practical significance of published
research where only the F statistic and degrees of freedom are provided.
The formulas for given earlier assume that the sample ns are equal. If the sample ns are not
too different, Vaughan and Corballis (1969) have suggested the following formula for
approximating omega squared:
Omega squared and the intraclass correlation, like the F statistic, are omnibus (overall)
statistics. Researchers generally are not as interested in this omnibus statistic as they are in
knowing how much of the variance in the dependent variable is accounted for by the difference
between selected treatment levels, say, the means for treatment levels a1 and a2. One degree-
of-freedom omega-squared correlation measures that address this kind of question are
discussed in Section 6.5.
In interpreting omega squared, it is important to remember that the treatment levels are
selected a priori rather than by random sampling as is the case for the intraclass correlation.
The presence of a truncated range or the selection of extreme values of a quantitative
independent variable can markedly affect the value of . Omega squared applies to the
treatment levels in the experiment; any generalization to levels not included in the experiment is
a leap of faith. Note also that and are computed from the ratio of unbiased estimators;
hence, they are biased estimators of the corresponding population parameters. In general, the
ratio of two unbiased estimators is not itself an unbiased estimator. Carroll and Nordholm
(1975) have shown that the degree of bias in is slight.
Other statistics such as R2, coefficient of multiple determination or eta squared ( ), and
also are used to measure the strength of association between the independent
and indicates the sample proportion of variance in the dependent variable that is accounted for
adjustment due to Wherry (1931) can be applied to R2 to obtain a better estimate of the
Effect Size
where
Cohen (1988, pp. 284–288) suggested the following guidelines for interpreting the measure
of effect size:
Based on Cohen's guidelines, the treatment effects for the sleep deprivation experiment are
classified as large effects. The same conclusion was reached using . In fact, the two indexes
are related as follows:
For a discussion of the merits of measures of strength of association and effect size, the reader
is referred to Cumming (2012), Henson (2006), Huberty (2002), and Kline (2004).
variable and a quantitative dependent variable. Cohen's and similar measures estimate the
relative size of treatment effects. Both kinds of measures provide important information that is
not contained in a test of significance. When the results of significance tests are reported,
researchers should always include a measure of effect magnitude.
The parameter λ is a measure of the degree to which the null hypothesis is false. The value of λ
is determined by the size of the sum of squared treatment effects relative to . Tang
(1938) prepared charts that simplify the calculation of power. Tang's charts, which are
reproduced in Appendix Table E.12, are based on a function of the noncentrality parameter. To
use the charts, the parameter ϕ (Greek phi),
is entered in the appropriate chart for v1= p − 1 and v2 = p(n − 1) degrees of freedom and a
significance level of either .05 or .01.
The calculation of power is illustrated for the data summarized in Table 4.3-2. In practice, the
An estimate of ϕ is
with v1 = p − 1 = 3 and v2 = p(n − 1) = 4(8 − 1) = 28. Appendix Table E.12 contains eight power
charts: a chart for v1 = 1, …, 8. Each chart contains power curves for α = .05 and α = .01. Use
the .05 curves because .05 is the level of significance adopted in the sleep deprivation
experiment. The value of is located along the α = .05 baseline in the v1 = 3 chart.
Extend an imaginary vertical line above until it intersects a point just to the right of the v2
= 30 curve; the chart does not contain a v2 = 28 curve. If you read across to the vertical axis,
the power of the ANOVA F test is found to be approximately .83, which just exceeds the
minimum acceptable power of .80.
Cohen (1988, pp. 289–354) provides more extensive tables for determining power than those in
Appendix E.12. His tables contain values for v1 = 1 through 6, 8, 10, 12, 15, and 24 and α =
.10, .05, and .01. To use his tables, a researcher computes Cohen's effect size. This effect
. Cohen's tables and those in Appendix E.12 are appropriate for fixed
effects. Montgomery (2009, pp. 625–628) gives tables for calculating power for random effects.
A plethora of free easy-to-use power and sample size calculators can be found on the Internet.
One of my favorites is G*Power 3.
Choosing a sample size is a bewildering task for many researchers. Researchers want to use
enough subjects to detect meaningful effects, but they don't want to use too many subjects
and squander research resources. Three approaches to estimating sample size are illustrated.
The procedures differ in terms of the information that a researcher must provide and in their
simplicity. The first approach requires the most information. A researcher must specify the (1)
level of significance, α; (2) power, 1 – β; (3) size of the population variance, ; and (4) the sum
However, there are ways to circumvent this problem. One way is to estimate and
from a pilot study. Alternatively, estimates of and may be obtained from research
For the purpose of illustration, suppose that the hand-steadiness data in Table 4.2-1 were
obtained in a pilot study to estimate sample size; let α = .05 and 1 – β = .80. This choice of
values for α and 1 – β is based on the widely accepted conventions that the probably of making
a Type I error should be less than or equal to .05 and the minimum acceptable power should
be greater than or equal to .80. With these conventions and the pilot-study information from
Table 4.3-2, a researcher can use trial and error to estimate the required sample size. The
process consists of inserting trial sample-size values, denoted by n, in
and determining from Tang's charts whether a power of .80 has been achieved. I begin the trial-
and-error process with n = 7.
with v1 = 3 and v2 = 4(8 − 1) = 28 gives a power of .83. Thus, if a researcher uses np = (8) (4) =
32 subjects, the power is approximately .83.
If accurate estimates of and are not available from a pilot study or previous
research, the procedure just described for calculating n cannot be used. However, there is an
alternative approach that does not require this information. The approach does require a
general idea about the size of the difference between the largest and smallest population
means that would be useful to detect relative to the size of . To use this approach, the
difference between the largest and smallest population means that a researcher wants to
detect is specified as some multiple, denoted by d, of the population standard deviation; that is,
μmax – μmin = dσε. An examination of Figure 4.5-1 should help to clarify the meaning of d. For
example, the difference between μmax and μmin that a researcher wants to detect might be
one and a half times larger than σε, d = 1.5, or the difference might be three fourths as large as
σε, d = 0.75. This approach to estimating sample size requires the specification of d but not
Figure 4.5-1 ▪ Each treatment mean is represented by a square. The mean of the p means,
the grand mean, is denoted by . Two of the treatment effects,
, are not equal to zero. The remaining p − 2 treatment
When there are more than two means in an experiment, many configurations of means will
produce the same value of μmax – μmin = dσε. It can be shown that the sum of the squared
treatment effects, , is minimal when two of the means, μmin and μmax, are not equal
and the remaining p − 2 means are equal to the grand mean. This configuration of means is
illustrated in Figure 4.5-1. It should be apparent from the figure that the treatment effect for
μmin i s e q u a l t o . Similarly, the treatment e f f e c t f o r μmax i s e q u a l t o
Because power increases with an increase in , it follows that a choice of values for the
αjs other than these will always lead to greater power. Hence, if the sample size necessary to
achieve a given power is computed for these treatment effects, a researcher can be certain that
any other configuration for which the maximum difference between means is equal to dσε will
yield a power greater than that specified. The ϕ formula for estimating sample size is obtained
Assume that an experiment contains four treatment levels and I am interested in detecting
differences among means such that μmax – μmin is equal to 1.5σε. In this example, d = 1.5, α
= .05, 1 – β = .80, and v1 = p − 1 = 3. Various trial sample-size values, n′, can be tried in the
formula for ϕ until the desired power is obtained. I begin the trial-and-error process with n′ = 8.
where v1 = p − 1 = 3 and v2 = p(n′ − 1) = 4(11 − 1) = 40. I get a power of .81. Thus, to detect a
difference between the largest and smallest means that is 1.5 times as large as σε, I should
use np = (11)(4) = 44 subjects. The advantage of this approach to estimating sample size is that
The third approach to estimating sample size can be used when a researcher knows nothing
In Section 4.4, Cohen's guidelines for interpreting ω2 are described. Recall that
Suppose that a researcher is interested in determining the sample size necessary to detect a
large association, ω2 = .138, for a completely randomized design with p = 4 treatment levels.
Assume that the researcher has followed the convention of setting α = .05 and 1 – β = .80. The
sample size can be determined from Appendix Table E.13 f o r v1 = 4 − 1 = 3 a n d
randomized design.2 The value of n is obtained from the column headed by ω2 = .138 and the
row labeled 1 – β = .80. According to Table E.13, the sample n is 18. The experiment requires
np = (18)(4) = 72 subjects.
The effect-size index, f, developed by Cohen (1988) also can be used to determine the required
sample size. Cohen suggested the following guidelines for interpreting f:
Suppose that a researcher is interested in determining the sample size necessary to detect a
large effect size, f = .40, for a completely randomized design with p = 4 treatment levels.
Assume that α = .05 and 1 – β = .80. The required sample size can be determined from
Appendix Table E.13 for v1 = 4 − 1 = 3 and , where v1 and denote the degrees of
freedom for a completely randomized design. The value of n is obtained from the column
headed by f* = f = .400 and the row labeled 1 – β = .80. According to Table E.13, the sample n
is 18. The experiment requires np = (18)(4) = 72 subjects.
Appendix Table E.13 can be used to estimate the sample size if α = .05, 1 – β = .70, .80, or .90,
and the design contains two to four treatment levels. If these conditions are not satisfied, Tang's
charts in Appendix Table E.12 can be used to estimate n. The charts are entered with
Suppose that a researcher plans to use a completely randomized design and wants to detect a
large strength of association, ω2 = .138, for an experiment with p = 5 treatment levels. Assume
that α = .05 and 1 – β = .80. Various n′s can be tried in the formula for ϕ until the desired power
is obtained. I begin with n′ = 13.
There is a tendency among researchers to underestimate the sample size required to obtain
practical significance. In the last example, np = (16)(5) = 80 subjects are required to detect a
large association. Medium and small associations require, respectively, (39)(5) = 195 subjects
and (240)(5) = 1200 subjects.
Three approaches to estimating sample size have been described. The use of ω2 or f combined
with Cohen's guidelines for interpreting values of ω2 and f requires the least amount of
information and is the simplest. Cohen's guidelines are offered as a useful starting point.
An estimate of the sample size necessary to detect effects that are practically significant should
always be made before an experiment is performed. A researcher may find, for example, that
the contemplated sample size is wastefully large, in which case the sample size can be
reduced. On the other hand, a researcher may find that the contemplated sample size is too
small and gives less than a 60% chance of detecting treatment effects considered of practical
significance. In this case, a researcher may (1) attempt to secure enough subjects to obtain a
power of .80, (2) decide not to conduct the experiment, or (3) attempt to modify the experiment
so as to reduce the required number of subjects. The modification could involve selecting a
less stringent level of significance, settling for lower power, increasing the size of treatment
effects that are of interest, or redesigning the experiment to obtain a more precise estimate of
treatment effects and a smaller error term.
The experimental design model equation for a completely randomized design is given in
Section 4.1 as
There I assumed that the treatment effects are fixed effects, μ is a constant, and εi(j) is
A comparison of the expected values of the mean squares for the two models is given in Table
4.6-1. The derivation of E(MS) is given in Section 3.8. For both models, a test of the null
hypothesis αj = 0 for all j (model I) or (model II) is given by
MSBG
MSWG
where f() denotes a function of the effects in parentheses. If any treatment effects exist, the
numerator of the F statistic should be larger than the denominator. This F statistic adheres to a
basic principle that is shared by all ANOVA F statistics: The expected value of the numerator
should always contain one more term than the expected value of the denominator. For the
underlies observations in the population.3 If the null hypothesis is rejected, the second
equation is adopted; if not, the first equation remains tenable.
As you have seen, the fixed- and random-effects models are identical except for the
assumptions about the nature of the treatment effects. This difference is important because it
determines the nature of the conclusions that can be drawn from an experiment. For the fixed-
effects model, conclusions are restricted to the p treatment levels in the experiment. For the
random-effects model, conclusions apply to the P treatment populations from which the p
treatment levels were randomly sampled.
1.The effects of differences among subjects are controlled by random assignment of the
subjects to treatment levels. For this to be effective, subjects should be relatively
homogeneous or a large number of subjects should be used.
2.When many treatment levels are included in the experiment, the required sample size may
be prohibitive.
1.Terms to remember:
a.confirmatory data analysis (4.2)
b.exploratory data analysis (4.2)
c. standardized residual (4.2)
d.outlier (4.2)
e.omega squared (4.4)
f. intraclass correlation (4.4)
g.coefficient of multiple determination (4.4)
h.central F distribution (4.5)
i. noncentral F distribution (4.5)
j. noncentrality parameter (4.5)
k. model I (4.6)
l. model II (4.6)
*2.Two approaches to learning problem solving strategies—more specifically, generating
alternative solutions—were investigated. Thirty sixth-graders were randomly assigned to
one of the two approaches and a control condition. Treatment level a1, referred to as the
training condition, involved participating in five sessions per week during 3 consecutive
weeks. Students assigned to this condition observed a videotape introduction for 10
minutes, practiced the skill for 15 minutes, observed peer models via videotape for 15
minutes, and watched a videotaped review for 10 minutes. Treatment level a2, a film and
discussion condition, was conducted concurrently with the training condition and for the
same amount of time. Films related to generating alternative solutions were shown followed
by group discussions. The students in the control condition, treatment level a3, did not
receive any form of training. At the conclusion of the experiment, five problem situations
were presented and the students were instructed to write down as many solutions to each
one as they could. The dependent variable was the number of solutions proposed, summed
across the five problems. The following data were obtained. (Experiment suggested by
Poitras-Martin, D., & Steve, G. L. Psychological education: A skills-oriented approach.
Journal of Counseling Psychology.)
a1a2a3
11117
121418
191016
139 11
17129
151310
171013
148 14
131412
161112
*a.[4.2] Perform an exploratory data analysis on these data (see Table 4.2-1 and Figure
4.2-1). Assume that the observations within each treatment level are listed in the order
in which the observations were obtained. Interpret the analysis.
*b.[4.3] Test the null hypothesis μ1 = μ2 = μ3; let α = .05. Construct an ANOVA table and
make a decision about the null hypothesis.
*c.[4.4] Compute and interpret and for these data.
*d.[4.5] Calculate the power of the test in part (b).
*e.[4.5] Use the results of part (b) as a pilot study and determine the number of subjects
required to achieve a power of approximately .80.
*f.[4.5] Determine the number of subjects required to achieve a power of .80, where the
largest difference among means is 1.10σε.
*g.[4.5] Determine the number of subjects required to detect a medium association with
power equal to .80.
h.Prepare a “results and discussion section” appropriate for the Journal of Counseling
Psychology.
a1a2
1015
6 8
1210
9 7
8 5
174
159
1111
149
1112
*a.[4.2] Perform an exploratory data analysis on these data (see Table 4.2-1 and Figure
4.2-1). Assume that the observations within each treatment level are listed in the order
in which the observations were obtained. Interpret the analysis.
*b.[4.3] Use ANOVA to test the hypothesis μ1 = μ2; let α = .05. Construct an ANOVA table
and make a decision about the null hypothesis.
*c.[4.4] Compute and interpret and for these data.
*d.[4.5] Calculate the power of the test in part (b).
*e.[4.5] Use the results of part (b) as a pilot study and determine the number of subjects
required to achieve a power of approximately .80.
*f.[4.5] Determine the number of subjects required to detect a large association; let 1 – β =
.80.
*g.[4.5] Determine the number of subjects required to achieve a power of .80, where the
largest difference among means is 1.15σε.
h.Prepare a “results and discussion section” appropriate for the Journal of Experimental
Psychology: Human Learning and Memory.
information group, treatment level a2, read a booklet that covered the same information but
did not contain the self-testing feature. Subjects in the passive information group, treatment
level a3, read a booklet about the historical development of hypnosis but with no
information about how to experience hypnosis. Subjects in the control group, treatment
level a4, were given several magazines and told to browse through them in a relaxed
manner. Following this phase of the experiment, subjects took the Stanford Hypnotic
Susceptibility Scale, Form C. The dependent variable was the subject's score on this scale.
The following data were obtained. (Experiment suggested by Diamond, Michael Jay,
Steadman, Clarence, Harada, D., & Rosenthal, J. The use of direct instructions to modify
hypnotic performance: The effects of programmed learning procedures. Journal of
Abnormal Psychology.)
a.[4.2] Perform an exploratory data analysis on these data (see Table 4.2-1 and Figure
4.2-1). Assume that the observations within each treatment level are listed in the order
in which the observations were obtained. Interpret the analysis.
b.[4.3] Test the hypothesis μ1 = μ2 = μ3 = μ4; let α = .05. Construct an ANOVA table and
make a decision about the null hypothesis.
c. [4.4] Compute and interpret
and for these data.
d.[4.5] Calculate the power of the test in part (b).
e.[4.5] Use the results of part (b) as a pilot study and determine the number of subjects
required to achieve a power of approximately .80.
f. [4.5] Determine the number of subjects required to detect a medium association; let 1 –
β = .80.
g.[4.5] Determine the number of subjects required to achieve a power of .80, where the
largest difference among means is 0.95σε.
h.Prepare a “results and discussion section” for the Journal of Abnormal Psychology.
5.An experiment was designed to evaluate the effects of different levels of training on
children's ability to acquire the concept of an equilateral triangle. Fifty 3-year-old children
were recruited from daycare facilities and randomly assigned to one of five groups, with 10
children in each group. Each group contained an equal number of boys and girls. Children
in treatment level a1 (visual condition) were shown 36 blocks, one at a time, and instructed
to look at them but not to touch them. Children in treatment level a2 (visual plus motor
condition) looked at the blocks and were permitted to play with them. They also were asked
to perform specific tactile-kinesthetic exercises, such as tracing the perimeter of the blocks
with their index finger. Children in treatment level a3 (visual plus verbal condition) looked at
the blocks and were told to notice differences in their shape, color, size, and thickness.
Children in treatment level a4 (visual plus motor plus verbal condition) used a combination
of visual, motor, and verbal means of stimulus predifferentiation. Children in treatment level
a5 (control condition) engaged in unrelated play activity. All training was done individually.
The day after training, the children were shown a “target” block for 5 seconds and then
asked to identify the block in a group of seven blocks. This task was repeated six times
using different target blocks. The dependent variable was the number of target blocks
correctly identified. The following data were obtained. (Experiment suggested by Nelson, G.
K. Concomitant effects of visual, motor, and verbal experiences in young children's concept
development. Journal of Educational Psychology.)
a.[4.2] Perform an exploratory data analysis on these data (see Table 4.2-1 and Figure
4.2-1). Assume that the observations within each treatment level are listed in the order
in which the observations were obtained. Interpret the analysis.
b.[4.3] Test the null hypothesis μ1 = μ2 = … = μ5; let α = .05. Construct an ANOVA table
and make a decision about the null hypothesis.
c. [4.4] Compute and interpret and for these data.
d.[4.5] Calculate the power of the test in part (b).
e.[4.5] Use the results of part (b) as a pilot study and determine the number of subjects
required to achieve a power of approximately .80.
f. [4.5] Use Appendix Table E.12 to determine the number of subjects required to detect a
medium association; let 1 – β = .80.
g.[4.5] Determine the number of subjects required to achieve a power of .80, where the
largest difference among means is 0.95σε.
h.Prepare a “results and discussion section” for the Journal of Educational Psychology.
*8.[4.4] For the following designs, estimate the number of subjects required to achieve a
power of .80, where the largest difference among means is equal to .
9.Section 4.2 described an experiment concerning the effects of sleep deprivation on hand-
steadiness. Assume that a second sleep deprivation experiment was performed in which the
dependent variable was simple reaction time to the onset of a light. The following data (in
hundredths of a second) were obtained.
*a.[4.2] Perform an exploratory data analysis on these data (see Table 4.2-1 and Figure
4.2-1). Assume that the observations within each treatment level are listed in the order
in which the observations were obtained. Interpret the analysis.
*b.[4.3] Test the null hypothesis μ1 = μ2 = μ3 = μ4; let α = .05. Construct an ANOVA table
10.
The effects of viewing “mug shots” on accuracy of eyewitness identification were
investigated. Twenty-four subjects observed a videotape of six men who they were later
asked to identify in a recognition test. The subjects were randomly assigned to one of four
groups. Subjects in group a4 searched through a sequence of 75 mug shots to identify the
suspects, those in group a3 searched through 50 mug shots, and those in group a2
searched through 25 mug shots. Subjects in a1 spent an equivalent amount of time looking
for articles about crime in Time magazine. Following this, the subjects were shown pictures
that included the suspects and asked to identify them. The dependent variable is the
number of suspects identified. The following data were obtained.
a.[4.2] Perform an exploratory data analysis on these data (see Table 4.2-1 and Figure
4.2-1). Assume that the observations within each treatment level are listed in the order
in which the observations were obtained. Interpret the analysis.
b.[4.3] Test the null hypothesis μ1 = μ2 = … = μ4; let α = .05. Construct an ANOVA table
and make a decision about the null hypothesis.
c. [4.4] Compute and interpret and for these data.
*11.
[4.6] How do model I and model II differ for a CR-p design?
1The omnibus null hypothesis states that all of the population means are equal.
2I am indebted to Barbara Mobley Foster, who developed the sample-size tables from which
Table E.13 was taken.
http://dx.doi.org/10.4135/9781483384733.n4
The most common use of analysis of variance is in testing the hypothesis that p ≥ 3 population
means are equal. If the omnibus hypothesis of equality of means is rejected, a researcher is
still faced with the problem of deciding which of the means are not equal. Thus, an omnibus F
test is merely one step in analyzing a set of data. A significant F test indicates that something
has happened in an experiment that has a small probability of happening by chance. In this
chapter, I describe a variety of procedures for pinpointing what has happened. Specifically, I
examine a number of test statistics for deciding which population means are not equal. But
first, I need to introduce some important concepts.
A contrast or comparison among means is a difference among the means, with appropriate
algebraic signs. I use the symbols ψi and to denote, respectively, the ith contrast among
population means and a sample estimate of the ith contrast. For example, ψ i = μj – μj, is a
The contrasts on the right involve the average of two means versus a third mean. Such
contrasts could be used, for example, to compare the average of two experimental groups with
a control group.
More formally, a contrast or comparison among means is a linear combination of means that
have known weights or coefficients. The coefficients are denoted by cj and satisfy two
conditions: (1) at least one coefficient is not equal to zero (cj ≠ 0 for some j), and (2) the
and
are, respectively, population and sample contrasts if cj ≠ 0 for some j and . The
contrasts in equations (5.1-1) can be expressed as linear combinations of sample means by the
appropriate choice of coefficients:
Notice that for each contrast, cj ≠ 0 for some j and . For convenience in comparing
the magnitudes of different contrasts, the coefficients of each contrast can be chosen so that
the sum of their absolute values is equal to 2; that is,
where indicates the absolute value of cj and is equal to the positive member of c and −cj.
All six of the preceding contrasts satisfy . For example, the sum of the absolute
values of the coefficients for and is, respectively,
When all of the coefficients of a contrast except two are equal to zero, the contrast is called a
pairwise comparison; otherwise, the contrast is a nonpairwise comparison. The number of
pairwise comparisons that exist for p means is equal to p(p − 1)/2. For example, contrasts
, and in equations (5.1-2) exhaust the 3(3 − 1)/2 = 3 pairwise comparisons among
three means. The situation is quite different for nonpairwise comparisons—the number is
infinite. Consider the following examples in which an average of two means is compared with a
third mean:
The coefficients and in contrast 7, for example, indicate that is weighted twice as much
as when the means are averaged. The pattern of coefficients
Orthogonal Contrasts
An infinite number of contrasts can be constructed for p ≥ 3 means. Each of these contrasts
can be expressed as a linear combination of p − 1 contrasts. For example, the contrast
Thus, contrast 2 provides no information that cannot be obtained from contrasts 1 and 4.
Contrasts 2 and 3 are redundant because they can be expressed as linear combinations of
contrasts 1 and 4.
contrasts are called orthogonal contrasts. There is a simple rule for determining whether two
contrasts are orthogonal. Let and , denote the ith and i′th contrasts and cij and ci′j their
respective coefficients, where j = 1, …, p. The two contrasts are orthogonal if
and assume that the njs are equal. These two contrasts are orthogonal because the sum of the
products of their coefficients is zero:
However, contrasts
Contrasts and are one of the infinite number of sets of orthogonal contrasts among
three means. Three other sets of orthogonal contrasts are
because
I have now identified four of the infinite number of sets of orthogonal contrasts among three
means. Consider the set and again. The reader may wonder if it is possible to find
another contrast that is orthogonal to and . The answer is no. The maximum number of
orthogonal contrasts in any set is equal to p − 1. For my example, that number is 3 − 1 = 2. To
summarize, for p ≥ 3 means, there are an infinite number of sets of orthogonal contrasts, but
each set contains only p − 1 orthogonal contrasts.
It can be shown that any orthogonal set of p − 1 contrasts provides a basis for constructing all
other contrasts that involve p means; that is, all contrasts can be expressed as linear
combinations of the contrasts in an orthogonal set. For example, I showed that and are
orthogonal and that they could be used to construct and :
and
and :
As I have shown, there are always p − 1 nonredundant questions that can be answered from
the data in an experiment. However, a researcher may not be interested in all of the p − 1
questions. For example, in an experiment with three means, a researcher may want to test the
The analysis of variance (ANOVA) provides a test of the omnibus null hypothesis that μ1 = μ2 =
… = μp. This test is equivalent to a simultaneous test of the hypothesis that all possible
contrasts among the p means are equal to zero. It is no accident that the between-groups
degrees of freedom in a completely randomized ANOVA design is equal to p − 1, which is also
the number of orthogonal contrasts that can be constructed from p means.
In planning an experiment, a researcher usually has in mind a specific set of hypotheses that
the experiment is designed to test. Tests that involve these hypotheses are called a priori or
planned tests. This situation can be contrasted with another in which the researcher believes
that the treatment affects the dependent variable, and the experiment is designed to accept or
reject this notion. If the F test of the omnibus null hypothesis is significant, the researcher
knows that at least one contrast among the population means is not equal to zero. Interest then
turns to determining which contrast or contrasts among the population means is not equal to
zero. Tests that are used for data snooping—that is, for identifying population contrasts that
are not equal to zero following a significant omnibus test—are called a posteriori, unplanned,
or post hoc tests.
Frequently, an experiment involves both a priori and a posteriori tests. After all of the a priori
tests have been performed, the researcher may want to test hypotheses suggested by an
inspection of the data. The collection of data is often time-consuming and costly. Hence, it is
important to extract all information contained in the data. This objective can be accomplished
by the judicious use of both a priori and a posteriori tests.
Exploratory versus confirmatory data analysis. A posteriori tests are often used in
exploratory data analysis; a priori tests are usually used in confirmatory data analysis. Although
both approaches to data analysis have long been used in research, the terms assumed more
specialized meanings in the 1970s as a result of John Tukey's work on exploratory techniques
and Karl Jöreskog's work on confirmatory techniques. Exploratory data analysis is concerned
with identifying patterns and features of data and revealing these features. Exploratory
techniques are typically used in the preliminary stages of a research program when the
researcher does not have sufficient information to make precise predictions or formulate
testable models. An important characteristic of the exploratory approach is flexibility in probing
the data and responding to patterns that are uncovered in successive stages of the analysis.
Confirmatory data analysis is used after the researcher has accumulated enough information
to make predictions or formulate models. The confirmatory approach stresses the evaluation of
evidence as compared with the exploratory approach, which stresses the flexible search for
evidence. Data that have been collected for a confirmatory analysis should always be subjected
to an exploratory analysis. As mentioned earlier, it is important to extract all information
contained in the data.
Thus far, I have discussed two issues that are particularly important in selecting a multiple
comparison procedure: (1) Are the contrasts orthogonal or nonorthogonal? and (2) Are the
contrasts a priori, a posteriori, or a combination of the two? In the following section, I discuss
another important factor: the conceptual unit for a Type I error.
When an experiment involves one contrast, the probability of making a Type I error corresponds
to the significance level that is assigned to the contrast. This value, denoted here by α′, is
usually either .05 or .01. When the experiment involves two or more contrasts, the situation is
which is approximately equal to C × α′ for small values of α′. The rationale underlying equation
(5.1-3) is as follows. If a contrast is tested at the α′ level of significance, the probability of not
making a Type I error for that contrast is 1 – α′. If C independent contrasts are each tested at
the α′ level of significance, the probability of not making a Type I error for the first, second, …,
and Cth contrast is, according to the multiplication rule for independent events, the product of
the respective probabilities:
The expression (1 – α′)C is the probability of not making a Type I error for C independent
contrasts. The probability of making one or more Type I errors is
Probability of one or more Type I errors = 1 – (probability of not making a Type I error for C
independent contrasts)
As the number of independent tests increases, so does the probability of obtaining spuriously
significant results. For example, if α′ = .05 and a researcher tests, say, 3, 5, or 10 independent
contrasts, the probability of one or more Type I errors is, respectively,
For nonindependent tests, the probability of making one or more Type I errors is
It is apparent that if enough contrasts are tested, each at the α′ level of significance, a
researcher will probably reject one or more null hypotheses even though they are all true. An
alternative research strategy is to control the Type I error at α for the collection or family of
contrasts that are tested. I discuss this strategy next.
A family of contrasts consists of those contrasts that are related in terms of their content and
intended use. For example, contrasts that involve a control group and two experimental groups
are a family. John Tukey (1953) described three kinds of Type I error rates for such contrasts:
per-contrast error rate, familywise error rate, and per-family error rate. These error rates are
denoted by, respectively, αPC, αFW, and αPF. Suppose that many experiments, each involving
a family of contrasts, are performed and we are able to count the number of erroneous
conclusions. The three Type I error rates can be defined as follows:
The per-contrast error rate is the probability that any one of the contrasts will be incorrectly
declared significant. Testing each contrast at the α′ level of significance allows the error rate for
the family of contrasts to increase as the number of tests increases. An alternative research
strategy is to adopt the family of contrasts as the conceptual unit for making a Type I error. If
this strategy is adopted, a researcher can choose to control the familywise error rate or the per-
family error rate. The familywise error rate is the probability of making one or more erroneous
statements per family. The per-family error rate is the long-run average number of erroneous
statements made per family. This error rate is not a probability but rather the expected number
of errors per family of contrasts.
An example will help clarify the definitions of the three error rates. Suppose that 1000
replications of an experiment are performed and that for each experiment, 10 contrasts are
tested—10,000 tests in all. Also suppose that of the 10,000 tests, 90 tests are incorrectly
declared significant, and these 90 incorrect decisions are distributed among 70 of the
experiments. The three error rates are as follows:
The three Type I error rates become more and more divergent as the number of contrasts in an
experiment increases; the three error rates are the same when the experiment involves only one
contrast. For C ≥ 2 independent tests, the relationship among the three error rates is
If a researcher tests five mutually independent contrasts, each at, say, α′ = .05, the error rates
are
For small values of α′, the per-family and familywise error rates are numerically almost identical.
For example, if a researcher tests five mutually independent contrasts, each at α′ = .01, the
per-family error rate is .05:
If a researcher controls the per-family error rate for C ≥ 2 tests at αPF, the familywise error rate
cannot exceed αPF.
When a completely randomized analysis of variance design is used to test the omnibus null
hypothesis
the p treatment levels are the conceptual unit for a Type I error. If a test of the omnibus null
hypothesis is significant, interest usually shifts to determining which contrasts among the
treatment means are significant. It is customary to assign the same error rate to the family of
contrasts as was assigned to the omnibus null hypothesis. This principle generalizes to
multitreatment ANOVA designs. A factorial design with two treatments involves three tests:
treatment A, treatment B, and the A × B interaction. If, say, a test of treatment A is significant at
the α = .05 level of significance, it is customary to assign the same α = .05 error rate to the
family of contrasts associated with treatment A.
For multitreatment ANOVA designs, another conceptual unit for a Type I error can be identified:
the experiment. If the familywise Type I error is .05 for treatment A, .05 for treatment B, and .05
for the A × B interaction, the experimentwise error rate, αEW, is equal to
The merits of making the contrast or some larger unit, such as the family or experiment, the
conceptual unit for the error rate were extensively debated in the early 1960s. If an experiment
involves only one contrast, there is no debate; the error rates for the contrast, family, and
experiment are the same. The question only arises when the family or experiment involves two
or more contrasts. As you will see, the answer to the question, “What is the correct conceptual
unit for a Type I error rate?” depends on the nature of the contrasts of interest.
If orthogonal contrasts have been planned in advance, contemporary practice favors adopting
the contrast as the conceptual unit for a Type I error. Earlier you saw that testing a priori,
orthogonal contrasts is equivalent to partitioning the data so that each test involves
nonredundant pieces of information. Such contrasts are chosen in advance because they
address particular research questions of interest. Furthermore, the number of such research
questions cannot exceed the number of nonredundant questions, p − 1, that can be answered
from a set of data. By comparison, nonorthogonal contrasts involve redundant information; the
outcome of one test is not independent of those for other tests. Here contemporary practice
favors adopting a larger unit such as the family of contrasts as the conceptual unit for a Type I
error.
A strong case can be made for these practices. Consider the following two a priori orthogonal
contrasts: ψ1 = μ1 – μ2 and ψ2 = (μ1 + μ2)/2 – μ3. A researcher could choose to conduct a
single experiment to test hypotheses about the two contrasts. Alternatively, a researcher could
choose to conduct two separate experiments: The first experiment could test the hypothesis μ1
– μ2 = 0 and the second the hypothesis (μ1 + μ2)/2 – μ3 = 0. The outcome of the first
experiment would provide no information about the probable outcome of the second
experiment. This research situation can be contrasted with a second situation in which a
researcher is interested in the three pairwise, nonorthogonal contrasts among three means: ψ1
= μ1 – μ2, ψ2 = μ1 – μ3, and ψ3 = μ2 – μ3. Again, the researcher could conduct a single
experiment or separate experiments. If the researcher chose to conduct separate experiments
to test the three null hypotheses, the reader might anticipate that it would be necessary to
conduct three separate experiments. Actually, only two separate experiments are necessary
because the outcome of testing H0: μ1 – μ2 = 0 and H0: μ1 – μ3 = 0 could be used to predict
the outcome of the third experiment. This follows because, as I showed earlier, :
Contemporary practice treats experiments involving orthogonal contrasts differently from those
involving nonorthogonal contrasts. In the first example involving a priori orthogonal contrasts, it
is customary to control the per-contrast Type I error rate. In the second example involving
nonorthogonal contrasts, contemporary practice favors controlling the familywise or per-family
Type I error rate.
The practice of treating orthogonal contrasts differently from nonorthogonal contrasts extends
to the analysis of variance. Consider a two-treatment factorial ANOVA design with equal sample
sizes in which the researcher has advanced a priori hypotheses about treatments A and B and
the A × B interaction. The two treatments and the interaction represent three orthogonal
families of contrasts. The usual practice in analysis of variance is to control the familywise Type
I error rather than the experimentwise error.
In choosing a multiple comparison procedure, it is important to consider the nature of the null
hypothesis that is to be tested. A null hypothesis can be complete, which means that all
population means are equal, or partial, which means that only a subset of the means is equal.
Hayter (1986) recommended that if a researcher wants to control, say, the familywise error rate,
a multiple comparison procedure should be chosen that controls the maximum familywise error
rate attainable under any complete or partial null hypothesis. Not all multiple comparison
procedures meet this requirement. One example is the LSD (least significant difference)
multiple comparison procedure proposed by Fisher (1935a), which consists of two steps. In the
first step, the omnibus null hypothesis is tested with an analysis of variance F test with αFW =
α′, where the α′ is equal to, say, .05. If the F test is not significant, the omnibus null hypothesis
is not rejected, and no more tests are performed. If the omnibus null hypothesis is rejected,
Student's t statistic is used to test each pairwise contrast with αPC = α′, where the α′ = .05.
Fisher's procedure controls the familywise Type I error rate when the complete null hypothesis
is true. However, if the experiment has more than three treatment levels and the complete null
hypothesis is rejected, the familywise error rate exceeds α′ (Hayter, 1986). Therefore, Fisher's
procedure is not recommended when an experiment has more than three treatment levels
because it fails to control the maximum familywise error rate attainable under any complete or
partial null hypothesis at a preselected level of significance.
Earlier, I discussed the merits of making the contrast or some larger unit, such as the family or
experiment, the conceptual unit for the error rate. A similar issue arises in connection with
power. The power of a multiple comparison procedure is the probability of rejecting a false null
hypothesis. Other things equal, a researcher wants to use a procedure that both controls the
Type I error rate at an acceptable level and provides maximum power. That is easier said than
done because there are a number of ways of defining power. One conception of power is
overall power—the probability of rejecting a false complete null hypothesis. This is the power
associated with the F test in analysis of variance. Another conception of power, introduced by
Einot and Gabriel (1975), is P-subset power.P-subset power focuses on detecting the
heterogeneity of means from a subset of a particular size—say, two means or per-pair power,
three means or per-triplet power, and so on. Per-pair power, for example, is often expressed as
the average probability of detecting true differences among all pairs of means.
In 1978, Ramsey introduced two more conceptions of power: any-pair power and all-pairs
power. Any-pair power is the probability of detecting at least one true difference among all
pairs of means. All-pairs power is the probability of detecting all true differences among all
pairs of means. There is some debate as to which of the four conceptions of power is more
appropriate. Consequently, when researchers investigate the relative power of multiple
comparison procedures, it is customary to report data for each kind of power. The different
conceptions of power yield different power numbers because any-pair power focuses on the
largest mean difference, all-pairs power focuses on the smallest mean difference, and per-pair
power is an average that is appropriate for only those two-mean differences that are equal to
the average. As would be expected, the any-pair power of multiple comparison procedures is
higher than the all-pairs power; the per-pair power falls between that for any-pair power and all-
pairs power.
Most multiple comparison procedures use one of the following test statistics:
where p is the number of means and s is the number of means in a set of the means. The
labels t, q, and F are a convenient way to identify the statistics and their sampling distributions:
Student's t distribution, the Studentized range distribution, and the F distribution, respectively.
Critical values for these distributions are given in Appendix E. The numerator of the t and q
statistics is always a contrast, which is a kind of range; the denominator is either the standard
error of a contrast, , or the standard error of a mean, . The numerator of an F statistic is
computed from all of the means included in a set of means. Both the numerator and the
denominator of an F statistic are variances.
For pairwise contrasts with equal sample sizes, the three statistics are related as follows:
In general, the F statistic tends to be more powerful than the q statistic, but as you will see, it
requires much more computation. Differences among the t, q, and F statistics are examined in
more detail in the following section. Computational examples for the three statistics are given in
Sections 5.2 to 5.6.
In a 1990 literature survey, I identified more than 30 multiple comparison procedures used by
researchers (Kirk, 1990). And the list continues to grow. It is convenient to classify multiple
comparison procedures as either single-step or multiple-step procedures. A single-step
procedure uses one critical value to test hypotheses about contrasts.2 Any test statistic that
exceeds or equals the critical value is declared significant, and the associated null hypothesis
is rejected. A variety of single-step procedures are described in Sections 5.2 to 5.7. A multiple-
step procedure uses two or more critical values to test hypotheses. There are three types of
multiple-step procedures: two-step, step-down, and step-up procedures. Fisher's two-step
multiple comparison procedure was described earlier in this section. The general features of
step-down and step-up procedures are described next.
If a q statistic is used, the step-down procedure begins by testing the contrast involving the
smallest and largest means—that is, a contrast in which the means are separated by r = p
steps, in this example, five steps. If the null hypothesis μ1 – μ5 = 0 is rejected, then
hypotheses μ1 – μ4 = 0 a n d μ2 – μ5 = 0 a r e tested. These hypotheses involve means
separated by r = 4 steps. If these hypotheses are rejected, all hypotheses involving means
separated by r = 3 steps are tested and finally all means separated by r = 2 steps. The critical
value that a q statistic must exceed is a function of the number of steps that separate the
means. The critical value is largest for contrasts whose means are separated by five steps and
smallest for contrasts whose means are separated by two steps. If the null hypothesis for a
contrast is not rejected, by implication the null hypotheses for all contrasts encompassed by
the nonrejected contrast are not rejected. This testing strategy ensures coherence. F o r
example, if a test of the hypothesis μ1 – μ3 = 0 is not rejected, then tests of μ1 – μ2 = 0 and μ2
– μ3 = 0 are not rejected by implication.
When an F statistic is used with a step-down procedure, the first test is the same as an ANOVA
F test of the omnibus null hypothesis—that is, a test of μ1 = μ2 = μ3 = μ4 = μ5. If this
hypothesis is rejected, the F statistic is used to test the homogeneity of all subsets of s = p − 1
= 4 means—that is, a test of (1) μ1 = μ2 = μ3 = μ4; (2) μ1 = μ2 = μ3 = μ5; (3) μ1 = μ2 = μ4 =
μ5; (4) μ1 = μ3 = μ4 = μ5; and (5) μ2 = μ3 = μ4 = μ5. Next, the homogeneity of all subsets of
three means is tested, excluding those subsets declared homogeneous by implication. Finally,
the homogeneity of all subsets of two means is tested, again excluding those subsets declared
homogeneous by implication.
It should be evident that more tests are required when an F statistic is used than when a q
statistic is used. For example, to reject the hypothesis μ1 – μ2 = 0, the q statistic must have
Step-up procedure. A third type of multiple-step procedure is the step-up procedure. Once
the p means have been ordered from smallest to largest, hypotheses involving adjacent means
are tested. If a null hypothesis for one of these contrasts is rejected, then by implication all null
hypotheses that contain the rejected contrast also are rejected. For example, suppose that the
null hypothesis μ3 – μ4 = 0 is rejected but μ1 – μ2 = 0, μ2 – μ3 = 0, and μ4 – μ5 = 0 are not
rejected. The explicit rejection of μ3 – μ4 = 0 results in the implicit rejection of μ1 – μ5 = 0, μ1 –
μ4 = 0, μ2 – μ5 = 0, μ2 – μ4 = 0, and μ3 – μ5 = 0. Because the hypotheses μ1 – μ2 = 0 and μ2
– μ3 = 0, for example, involving adjacent means are not rejected, it is necessary to explicitly test
the contrast μ1 – μ3 = 0, which is separated by three steps.
Step-down procedures are widely used in the behavioral sciences, health sciences, and
education. Step-up procedures are used less often. Both kinds of procedures tend to be more
powerful than single-step procedures. However, step-down and step-up procedures suffer from
several shortcomings: (1) In general, they cannot be used to construct confidence intervals; (2)
with a few exceptions, they cannot be used to test directional hypotheses; and (3) they tend to
require more computation than single-step procedures.
From a review of the literature in the behavioral sciences, health sciences, and education, I
have identified five hypothesis-testing situations that occur with some degree of regularity:
testing hypotheses about
Contrasts in the first category are a priori and orthogonal; those in the other four categories are
nonorthogonal. As discussed earlier, for contrasts in the first hypothesis-testing situation, the
usual practice is to adopt the individual contrast as the conceptual unit for a Type I error. For
the other four hypothesis-testing situations, it is customary to adopt the family of contrasts as
the conceptual unit for a Type I error.
Statisticians have developed a variety of test statistics that can be used to control the Type I
error rate in these five situations. Table 5.1-1 summarizes the test statistics that I recommend
for each situation. The procedures in the upper part of the table assume normality of the
population distributions, random sampling or random assignment, and homogeneity of
population variances. Tukey's test, the REGW FQ test, and the REGW Q test also require
equal-sized samples. If the assumption of homogeneity of population variances is not tenable
or the requirement of equal-sized samples is not met, the multiple comparison procedures in
the lower part of Table 5.1-1 can be used. As you will see, the power of the recommended
procedures differs markedly. In general, test statistics that were designed for testing a select,
limited number of a priori contrasts are more powerful than those designed to test all pairwise
comparisons or all possible contrasts. Hence, when possible, it is to a researcher's advantage
to specify in advance either orthogonal contrasts or a limited number of contrasts. The problem
facing a researcher is to choose the test statistic that provides both the desired kind of Type I
error protection and maximum power. The following sections describe the recommended test
statistics for each of the five research situations.
Table 5.1-1 ▪ Multiple Comparison Procedures That Are Recommended for Five Common
Research Situations
Student's t statistic is a single-step procedure that can be used to test null hypotheses of the
form
where the p − 1 contrasts are a priori and mutually orthogonal. It is not necessary to test the
omnibus null hypothesis with an ANOVA F statistic prior to testing the individual contrasts. An
omnibus test answers the general question, “Are there any differences among the population
means?” If a specific set of orthogonal contrasts has been advanced, a researcher is not
interested in this general question. Rather, the researcher is interested in answering a limited
number—p − 1 or fewer—of specific questions. As I discussed in Section 5.1, current practice
favors testing each of the p − 1 contrasts at αPC = α′, that is, controlling the per-contrast error
rate.
where , is the standard error of the ith contrast and MSerror is a pooled estimator of the
population error variance. For data that fit a completely randomized ANOVA design, a within-
groups mean square (MSWG) is used to estimate the population error variance and is given by
with degrees of freedom. If the sample sizes are equal, the formula for MSWG is
with p(n − 1) degrees of freedom. A two-sided null hypothesis is rejected if the absolute value of
t exceeds or equals the critical value, tα/2,v, obtained from Student's t distribution in Appendix
Table E.3, where α represents the per-contrast error rate and v is the degrees of freedom
associated with the denominator of the t statistic. A one-sided null hypothesis is rejected if the
absolute value of t exceeds or equals the critical value, tα, v, and the t statistic is in the
predicted tail of the t sampling distribution.
The use of Student's t statistic to test hypotheses about a priori orthogonal contrasts is
illustrated for an experiment in which 45 subjects have been randomly assigned to five
qualitative treatment levels, with 9 subjects in each level. Suppose that the five treatment
means are
Assume that the treatment populations are approximately normally distributed and the
variances are homogeneous. The design of this experiment corresponds to a completely
randomized ANOVA design; hence, MSWG is the appropriate estimator of the common
population error variance. Assume that the estimate of the population error variance is MSWG =
29.0322 with degrees of freedom equal to p(n − 1) = 5(9 − 1) = 40. The researcher plans to test
the four hypotheses listed in Table 5.2-1. The .05 level of significance is adopted for each test.
The reader can verify that the contrasts in Table 5.2-1 are mutually orthogonal.
The critical value, t.05/240, required to reject the null hypotheses is 2.021 according to
Student's t distribution in Appendix Table E.3. Because the absolute value of the t statistic for
contrasts ψ1, ψ2, and ψ4 exceeds the critical value, the associated null hypotheses can be
rejected.
The four t tests use the same error mean square (29.0322) in the denominator. As a result, the
tests of significance are not statistically independent, even though the contrasts are statistically
independent. Research by Norton and Bulgren, as cited by Games (1971), indicates that when
the degrees of freedom for MSerror are moderately large, say 40, multiple t tests can, for all
practical purposes, be regarded as independent.
interval analogs. Next, I describe a confidence interval analog of Student's multiple t statistic.
where
A researcher can be 95% confident that the open interval [0.17, 10.43] contains the population
contrast. This confidence interval corresponds to the darkened portion of the real number line
as follows:
Confidence intervals permit a researcher to reach the same kind of decision as tests of null
hypotheses. Because the interval [0.17, 10.43] does not include zero, a researcher knows that
the null hypothesis H0: μ2 – μ3 = 0 can be rejected. If the confidence interval includes zero, the
null hypothesis cannot be rejected. Confidence interval procedures permit a researcher to
consider the tenability of all possible null hypotheses, not just the hypothesis that a contrast is
equal to zero. For example, the null hypothesis H0: μ2 – μ3 = 12 would be rejected but not H0:
μ2 – μ3 = 8. Also, the size of the confidence interval provides information about the error
variation associated with an estimate and, hence, the strength of the inference. The preference
in the Publication Manual of the American Psychological Association (American Psychological
Association, 2010) for reporting the outcome of confidence intervals rather than hypothesis
tests is understandable. Both procedures involve the same assumptions, but confidence
intervals provide more information about the data.
The assumptions associated with using Student's t statistic to test a hypothesis or construct a
confidence interval are (1) the observations are drawn from normally distributed populations; (2)
the observations are random samples from the populations, or the experimental units are
randomly assigned to the treatment levels; and (3) the variances of the populations are equal.
Effects of nonnormality. The effects of sampling from nonnormal populations on the F test in
analysis of variance are discussed in Section 3.5. Research indicates that if the treatment
populations have the same shape—for example, all positively skewed or all leptokurtic—and
the sample sizes are fairly large, then the actual probability of making a Type I error is fairly
close to the nominal or specified probability. Much less is known about the effects of
nonnormality on Student's t statistic and on other multiple comparison procedures. Boneau's
(1960) research on the t statistic and H. J. Keselman and Rogan's (1978) research on a variety
of multiple comparison procedures suggest that the results obtained for ANOVA generalize to
these procedures. In other words, when sample sizes are large, the t statistic and other
multiple comparison procedures appear to be robust with respect to nonnormality. This
conclusion is consistent with that of Ramseyer and Tcheng (1973). However, the research of
Micceri (1989) and Hill and Dixon (1982) on the prevalence of extreme nonnormality in the
behavioral sciences, medical sciences, and education is reason for concern.
values of the contrast coefficients are unequal—for example, and 1—the Type I error rate
is also likely to be affected by heterogeneous population variances. Under these conditions,
one of the robust procedures described next can be used.
If the assumption of the equality of population variances is not tenable, the pooled estimator in
the denominator of the t statistic can be replaced with individual variance estimators. The
resulting statistic, denoted by t′, is
The earliest attempts to determine the sampling distribution of t′ were made by Behrens (1929)
and enlarged upon by Fisher (1935a). There is no exact solution for this problem. A number of
approximate solutions have been proposed: (1) Cochran (1964), (2) Satterthwaite (1946), and
(3) Welch (1938, 1947). In general, there is close agreement among these approximate
solutions; accordingly, only the approximations of Cochran and Welch are described.
Cochran's procedure uses the t′ statistic defined in equation (5.2-1). The two-tailed critical value
of t′ is given by
where and are the critical values of Student's t distribution at the α level of
significance for vj = nj − 1 and vj = nj, − 1 degrees of freedom, respectively. The critical value for
Cochran's t′ is always between the ordinary t values for vj and vj, degrees of freedom. For a
one-tailed test, values of and are used. If nj = nj, then t′ = t, and the conventional t
value with nj − 1 degrees of freedom can be used. The t′ test is conservative because the
critical value for t′ tends to be slightly too large.
Welch's (1938, 1947, 1949) procedure also uses the t′ statistic defined in equation (5.2-1). An
excellent approximation to the critical value of t′ can be obtained from Student's t distribution
Wang (1971) reported that when the sample ns are greater than five, Welch's approximate
solution controls the Type I error fairly close to α for a wide range of population variances.
Similar results were reported by Scheffé (1970). Kohr and Games (1977) reported that Welch's
procedure provides reasonable protection against Type I errors when the variances are
heterogeneous and the sample sizes or the absolute values of the coefficients of a contrast are
unequal.
In summary, when the assumption of the homogeneity of population variances is not tenable,
the t′ statistic with Welch's modified degrees of freedom is recommended for testing
hypotheses about p − 1 a priori orthogonal contrasts. Welch's modified degrees of freedom also
can be used with other test statistics; I return to this point later.
The purpose of many experiments is to compare each of p − 1 treatment means with a control
group mean. Dunnett (1955) developed a single-step, multiple comparison procedure for this
purpose—that is, for testing p − 1 null hypotheses that have the following form:
where μ1 denotes the control group mean. More specifically, Dunnett's procedure is applicable
to any set of p − 1 a priori nonorthogonal contrasts for which the p − 1 correlations between the
contrasts are equal to 0.5. A correlation of 0.5 occurs, for example, when each of p − 1
experimental group means is compared with a control group mean and the sample sizes are
equal. To illustrate, consider the contrasts in Table 5.3-1, where is the control group mean
and the other means are experimental group means. The correlation between the ith and the i
Table 5.3-1 ▪ p − 1 Contrasts With a Control Group Mean, [Data are from Section 5.2,
where MSWG = 29.0322, p = 5, n = 9, and v = p(n − 1) = 5(9 − 1) = 40.]
Suppose that each sample n is equal to 9. The correlation between contrasts and is
It can be shown that the correlation between contrast and the other three contrasts, , ,
and , also is 0.5.
Dunnett's procedure uses Student's t statistic with equal sample ns. The test statistic is
denoted by tDN.
A two-sided null hypothesis is rejected if the absolute value of the tDN statistic exceeds or
equals the critical value obtained from Appendix Table E.7, where α represents the
familywise error rate; p is the number of treatment means, including the control group mean;
and v is the degrees of freedom associated with the denominator of the tDN statistic. A one-
sided null hypothesis is rejected if the absolute value of tDN exceeds or equals and the
tDN statistic is in the predicted tail of the tDN sampling distribution. Dunnett's procedure
controls the probability of falsely rejecting one or more null hypotheses—the familywise error
rate. It is not necessary to test the omnibus null hypothesis using an ANOVA F test prior to
using the tDN statistic. Indeed, such a test would be pointless.
Instead of computing p − 1 test statistics, it is often more convenient to test the p − 1 null
hypotheses by comparing each contrast with a critical difference—a value that the absolute
value of a contrast must equal or exceed to be statistically significant. Earlier, I showed that a
tDN statistic is significant if
It follows that the absolute value of any contrast, , that exceeds or equals
is statistically significant. The letters denote the critical difference for Dunnett's
procedure. The use of a critical difference to test hypotheses is illustrated for the data in Table
5.3-1, where is the control group mean. The critical difference that the absolute value of the
contrasts in Table 5.3-1 must exceed or equal for a two-tailed test at the αFW = .05 level of
significance is
Because the absolute values of contrasts , and exceed the critical difference, the
associated null hypotheses can be rejected.
Dunnett's procedure also can be used to establish p − 1 simultaneous 100(1 – α)% confidence
intervals involving the control group mean. A confidence interval is given by
where
The following assumptions are associated with using Dunnett's tDN statistic to test a
hypothesis or construct a confidence interval: (1) The observations are drawn from normally
distributed populations; (2) the observations are random samples from the populations, or the
experimental units are randomly assigned to the treatment levels; (3) the p − 1 correlations
between contrasts are equal to 0.5; and (4) the variances of the populations are equal. Dunnett
(1964) has described modifications of his procedure that can be used when the variance of the
control group population is not equal to the variance of the p − 1 treatment groups. Hochberg
and Tamhane (1987, pp. 140–144) provide tables that can be used when the correlation
between two contrasts is not equal to 0.5—a situation that arises when the sample ns are not
equal.
In designing an experiment, a researcher usually has a specific set of C hypotheses that the
experiment is designed to test. Before the research begins, not only is the number of
hypotheses known but also which hypotheses are to be tested. Often the associated contrasts
are not orthogonal as in comparing a control group mean with p − 1 experimental group means
or making selected pairwise comparisons among p means. The procedures described in this
section can be used to test hypotheses for C a priori nonorthogonal contrasts among p means.
A number of multiple comparison procedures have been developed for this purpose. Three are
described here: the popular Dunn test, the slightly more powerful Dunn-Šidák test, and the
less widely used but more powerful Holm test. The three tests are presented in order of
increasing power.
Fisher described two multiple comparison procedures in his classic experimental design text
(Fisher, 1935a, Section 24). One of the procedures bears his name—the Fisher LSD test—and
is described in Section 5.1. The originator of the second procedure is unknown. Because Dunn
(1961) examined the properties of the second procedure and prepared tables that facilitate its
use, the procedure is referred to as Dunn's multiple comparison procedure. Some writers
use the designation Bonferroni procedure because the procedure is based on the Bonferroni
or Boole inequality.
Dunn's procedure controls the long-run average number of erroneous statements made per
family—the per-family error rate. This is accomplished by dividing αPF into C ≥ 2 parts: αPF/C
= α′. If each of the C contrasts is tested at the α′ level of significance, the error rate for the
collection of C contrasts is . For example, if a researcher wants to test C = 4
contrasts and wants the per-family error rate to be .05, each contrast can be tested at α′ =.05/4
= .0125 level of significance. By testing each contrast at α′ = .0125, the per-family error rate is
αPF = .0125 + .0125 + .0125 + .0125 = .05.
Dunn developed the procedure using Student's t statistic and sampling distribution. However,
the procedure can be used with other test statistics and sampling distributions, which helps to
account for its popularity. When Student's t statistic and sampling distribution are used with
Dunn's procedure, the statistic is denoted by tD:
For pairwise contrasts with equal sample sizes, the statistic simplifies to
A two-sided null hypothesis is rejected if the absolute value of the tD statistic exceeds or equals
the critical value tDα/2; C, v obtained from Appendix Table E.14, where α is the per-family error
rate, C is the number of contrasts, and v is the degrees of freedom associated with the
denominator of the tD statistic. A one-sided null hypothesis is rejected if the absolute value of
tD exceeds or equals tDα; C, v and the tD statistic is in the predicted tail of the tD sampling
distribution. Dunn's procedure controls the per-family error rate at αPF; hence, the familywise
error rate is less than αPF. It is not necessary to test the omnibus null hypothesis using an
ANOVA F test prior to using the tD statistic.
Suppose that a researcher is interested in testing hypotheses for the four nonorthogonal
contrasts in Table 5.4-1. The tD test statistics are
Table 5.4-1 ▪ Coefficients for C a Priori Nonorthogonal Contrasts [Data are from Section
5.2, where MSWG = 29.0322, p = 5, n = 9, and v = p(n − 1) = 5(9 − 1) = 40.]
The critical value, tD.05/2; 4, 40, required to reject the two-sided null hypotheses is, according
to Appendix Table E.14, equal to 2.616. Because the absolute value of the tD statistic for
contrasts , and exceeds the critical value, the associated null hypotheses can be
rejected.
Appendix Table E.14 contains critical values for one- and two-tailed tests. Microsoft's Excel
TINV function can be used to obtain critical values for one- and two-tailed tests for any per-
family significance level. To obtain a critical value, access the TINV function in Excel,
and replace “probability” with the value of αPF/C for a two-tailed test and with (2αPF)/C for a
one-tailed test and “deg_freedom” with the degrees of freedom for MSerror. For example, if
one-sided null hypotheses had been proposed for the contrasts in Table 5.4-1, the required
value of tD.05; 4, 40 would be given by
Dunn's procedure can be used to establish simultaneous 100(1 – α)% confidence intervals. A
confidence interval is given by
where
The popularity of Dunn's multiple comparison procedure can be attributed to three factors: (1)
The procedure provides a simple way to control the per-family and, hence, the familywise Type
I error; (2) the concept of dividing αPF among C a priori contrasts is a simple one that can be
used with any test statistic; and (3) the procedure is flexible, as the following example
illustrates. If a researcher considers the consequences of making a Type I error to be equally
serious for all C contrasts, it is reasonable to divide αPF equally among the contrasts. If,
however, the consequences of making a Type I error are not equally serious for all C contrasts,
αPF can be allocated unequally among the contrasts in a manner that reflects the researcher's
a priori concern for Type I and Type II errors. Consider, for example, an experiment involving
four contrasts in which the .05 level of significance has been adopted. Instead of testing each
contrast at α′ = .05/4 = .0125, the researcher could allocate αPF a s f o l l o w s :
. The per-family error rate is αPF = .02 + .01 + .01 + .01 =
.05, which is the same per-family error rate that would be obtained if αPF were divided equally
among the four tests.
As I have shown, Dunn's procedure has a number of desirable properties. The Dunn-Šidák
procedure described next shares most of these properties and is slightly more powerful; hence,
it is preferred over Dunn's procedure.
Dunn's procedure provides an upper bound to the familywise Type I error rate. For small values
of αPF, the approximation of the exact familywise Type I error is excellent. However, an even
better approximation is provided by a multiplicative inequality proved by Šidák (1967). He
showed that the familywise error rate for C nonindependent tests is less than or equal to αFW ≤
1 – (1 – α′)C, which is always less than or equal to . To control the familywise error
rate, each contrast can be tested at the 1 – (1–αFW) = α″ level of significance. For example,
suppose a researcher plans to test five nonorthogonal contrasts and wants the familywise Type
I error rate to be less than or equal to .05. Use of the additive and multiplicative inequalities
results in testing each contrast at, respectively,
Because the Dunn-Šidák procedure is slightly more powerful, it is recommended over Dunn's
procedure.
When Student's t statistic and sampling distribution are used with the Dunn-Šidák procedure,
the statistic is denoted by tDS:
A two-sided null hypothesis is rejected if the absolute value of the tDS statistic exceeds or
equals the critical value tDSα/2; C, v obtained from Appendix Table E.15, where α denotes the
familywise error rate, C is the number of contrasts, and v is the degrees of freedom associated
with the denominator of the tDS statistic. A one-sided null hypothesis is rejected if the absolute
value of tDS exceeds or equals tDSα; C, v and the tDS statistic is in the predicted tail of the tDS
sampling distribution.
For the four contrasts in Table 5.4-1, the values of the Dunn-Šidák test statistic are the same as
The critical value, tDS.05/2; 4, 40, required to reject two-sided null hypotheses for these
contrasts is 2.608, according to Appendix Table E.15. This critical value is slightly smaller than
that for Dunn's procedure, which is tD.05/2; 4, 40 = 2.616. Thus, the Dunn-Šidák procedure is
slightly more powerful. Both procedures result in rejecting the null hypotheses for contrasts v1,
v2, and v3.
The computation of confidence intervals for the Dunn-Šidák procedure follows that for Dunn's
procedure. The term in Dunn's confidence interval is replaced with , where
Use tDSα/2; C, v for a two-sided confidence interval and tDSα;C, v for a one-sided confidence
interval.
Appendix Table E.15 gives critical values for αFW = .20, .10, .05, and .01. Microsoft's Excel
TINV function can be use to obtain critical values for any familywise significance level.3 To
obtain critical values, access the TINV function in Excel,
and replace “probability” with the value of 1 – (1 – αFW)1/C for a two-tailed test and with 2[1 –
(1 – αFW)1/C] for a one-tailed test and “deg_freedom” with the degrees of freedom for MSerror
For example, if one-sided null hypotheses had been proposed for the contrasts in Table 5.4-1,
the required value of tDS.05; 4, 40 would be given by
The assumptions associated with using the Dunn and the Dunn-Šidák procedures are the
same as those described earlier for Student's t statistic. Comments about the effects of
nonnormality and heterogeneous variances on Student's t statistic also apply to the Dunn and
Dunn-Šidák statistics. Martin, Toothaker, and Nixon (1989) evaluated 19 multiple comparison
procedures and concluded that both the Dunn and Dunn-Šidák procedures provide excellent
Type I error rate protection when the assumptions of normality and homogeneity of population
variances are not tenable. If a researcher is concerned about the heterogeneity of population
variances, the t′ statistic with Welch's modified degrees of freedom, discussed in Section 5.2,
can be used with the Dunn and Dunn-Šidák procedures.
Holm (1979) proposed a modification that converts Dunn's (Bonferroni) single-step procedure
into a more powerful step-down procedure. The modification is quite simple. It consists of
ranking the absolute value of C test statistics from the largest to the smallest and then testing
the largest test statistic at the level of significance, the next largest test statistic at
, the next largest at and the smallest test statistic at . The testing
procedure terminates when a nonsignificant test statistic is encountered. If the sample sizes are
not equal, the test statistics should be ranked on the basis of the p values of the test statistics.
Holm showed that the procedure controls the familywise Type I error rate at less than αFW.
Holm's procedure is more powerful than Dunn's procedure because it uses a less stringent
level of significance for the second through the Cth tests. Recall that Dunn's test is a single-
step procedure that uses the same α′ = αPF/C level of significance for all tests.
Holm suggested that a slightly more powerful version of the test could be obtained by using the
multiplicative inequality instead of the additive inequality. If the multiplicative inequality is used,
the largest test statistic (or smallest p value) is tested at the level of
significance, the next largest at , the next largest at , and
the smallest test statistic at . Holm's procedure is a general one that can be used with t, q,
and F statistics. When Student's t statistic and sampling distribution are used with Holm's
procedure, the statistic is denoted by tH:
tα/2,v. A two-sided null hypothesis is rejected if the absolute value of the tH statistic exceeds or
equals the critical value or tα/2, v, where α denotes the familywise error rate; Ci is equal
to C for the largest test statistic, C − 1 for the next largest test statistic, …, and 1 for the
smallest test statistic; and v is the degrees of freedom associated with the denominator of the
tH statistic. A one-sided null hypothesis is rejected if the absolute value of tH exceeds or equals
for Ci ≥ 2 or tα,v for Ci = 1 and the tH statistic is in the predicted tail of the sampling
distribution.
Holm's procedure is illustrated for the four contrasts in Table 5.4-1. The first step in testing the
null hypothesis for these contrasts is to rank the absolute values of the test statistics from
largest to smallest. The statistics and critical values for the four contrasts are as follows:
The asterisks identify significant test statistics. The null hypothesis for contrasts ψ1, ψ2, and
ψ 3 can be rejected. The critical values required to reject the null hypotheses for the Dunn and
Dunn-Šidák procedures are:
In this example, all three procedures lead to the same decisions. However, Holm's procedure is
clearly more powerful than the Dunn and Dunn-Šidák procedures. Because all three
procedures control the Type I error rate at or less than αFW, a researcher is advised to use the
most powerful procedure, which is Holm's procedure. Holm's procedure shares a disadvantage
of most multiple-step procedures: It cannot be used to construct confidence intervals.
The assumptions associated with using Holm's procedure are the same as those described
earlier for Student's t statistic. Comments about the effects of nonnormality and heterogeneous
variances on Student's t statistic also apply to Holm's procedure. If a researcher is concerned
about heterogeneity of population variances, the t′ statistic with Welch's modified degrees of
freedom, discussed in Section 5.2, can be used with Holm's procedure.
A number of modifications of Holm's procedure have been proposed (Holland & Copenhaver,
1987; Shaffer, 1986). These modifications result in a slight increase in power but at the cost of
increased complexity. Tables for implementing Shaffer's modifications of Holm's procedure are
provided by Seaman and Serlin (1989). A. Y. Gordon and Salzman (2008) provide a thorough
examination of the merits of Holm's procedure relative to other step-down procedures.
A variety of procedures have been recommended for testing hypotheses about all pairwise
contrasts. Probably the most widely used procedure is the HSD (honestly significant difference)
test developed by Tukey (1953). This single-step procedure controls the familywise Type I error
rate for the collection of all a posteriori pairwise contrasts. Tukey's HSD test is based on the
sampling distribution of the Studentized range, which, like the t distribution, was derived by
William Sealey Gosset. The letter q is used to denote the Studentized range distribution.
Tukey's HSD statistic, qT, is the ratio of a contrast to the standard error of a mean:
A two-sided null hypothesis is rejected if the absolute value of qT exceeds or equals the critical
value qα; p, v obtained from Appendix Table E.6, where α denotes the familywise error rate, p is
the number of means in the family, and v is the degrees of freedom associated with the
denominator of the qT statistic. Note that the critical value for Tukey's test, unlike that for the
Dunn, Dunn-Šidák, and Holm tests, does not depend on the number of contrasts actually
tested but on p, the number of means. Tukey's procedure, like all a posteriori procedures, is
appropriate for testing only two-sided null hypotheses. The procedure has another limitation: It
requires equal sample ns. If the sample ns are not equal, the Tukey-Kramer procedure,
described later, can be used.
Tukey's statistic can be used to test the omnibus null hypothesis, μ1 = μ2 = … = μp, by
comparing the largest sample mean with the smallest sample mean. If Tukey's statistic exceeds
or equals qα; p, v for this contrast, the omnibus null hypothesis is rejected. Alternatively, the
omnibus null hypothesis can be tested using the ANOVA F statistic. The omnibus qT test is
usually slightly less powerful than the F test, although there are some configurations of means
for which the qT test is more powerful. For example, a qT test is more likely to reject the
omnibus null hypothesis if p − 2 of the population means are equal and located halfway
between the smallest and largest means. On the other hand, an F test is more likely to reject
the omnibus null hypothesis if half of the means are equal to the largest mean and the other
half are equal to the smallest mean. Both the F and qT tests control the familywise error rate at
or less than αFW; hence, the more powerful test should be used.
When a researcher wants to test all pairwise contrasts among p means and the sample ns are
equal, it is usually more convenient to compute the one critical difference that each contrast
must exceed or equal than it is to compute p(p − 1)/2 test statistics. The critical difference,
, that a pairwise contrast must exceed or equal is given by
Suppose a researcher wants to test all pairwise contrasts for the data in Table 5.5-1.
Table 5.5-1 ▪ Absolute Values of All Pairwise Contrasts Among Means [Data are from
Section 5.2, where MSWG = 29.0322, p = 5, n = 9, and v = p(n − 1) = 5(9 − 1) = 40. The
means in the table are ordered from the smallest to the largest so that the absolute value
of the largest contrast, 12.0, appears in the upper right corner of the table.]
It is often convenient to construct a table like Table 5.5-1 that gives the absolute value of all
pairwise contrasts. Any contrast that exceeds or equals the critical difference is declared
significant. It is apparent from Table 5.5-1 that three contrasts exceed the critical difference;
hence, the null hypotheses for these contrasts can be rejected.
Tukey's procedure can be used to establish 100(1 – α)% simultaneous confidence intervals for
all pairwise population contrasts. The confidence interval is given by
Earlier I noted that Tukey's procedure requires equal-size samples. In addition, the procedure
assumes that (1) the observations are drawn from normally distributed populations; (2) the
observations are random samples from the populations, or the experimental units are randomly
assigned to the treatment levels; and (3) the variances of the populations are equal. The next
two sections describe procedures for testing all pairwise contrasts that do not require equal
sample sizes or the assumption that the population variances are equal.
Researchers in the social sciences and education frequently want to test all pairwise contrasts
among means. Many of the most popular multiple comparison procedures used for this
purpose, such as Tukey's HSD test, require equal sample sizes. Unfortunately, researchers live
in an imperfect world in which unequal sample sizes are the rule rather than the exception. This
dilemma has sparked a search for alternative procedures that can be used when sample sizes
are unequal. Most of the research has focused on finding alternatives to Tukey's HSD test.
Suggested alternatives include Gabriel's (1978) test, Genizi and Hochberg's (1978) test,
Hochberg's (1974) GT2 test, Hunter's (1976) H test, Spjøtvoll and Stoline's (1973) T′ test, and
the Tukey-Kramer test (Kramer, 1956; Tukey, 1953). The results of numerous studies of the
various alternatives are clear cut: The preferred procedure is the Tukey-Kramer test. This
procedure controls the Type I error at less than αFW and has the highest power of the
procedures investigated (Dunnett, 1980a; Hayter, 1984; Stoline, 1981).
Tukey-Kramer test. The Tukey-Kramer test was independently proposed by Tukey (1953) and
Kramer (1956) for the case in which the sample ns are unequal and the basic assumptions of
normality, homogeneity of variances, and so on are tenable. The test statistic, denoted by qTK,
is
A two-sided null hypothesis is rejected if the absolute value of qTK exceeds or equals the
critical value qα; p, v obtained from the Studentized range distribution in Appendix Table E.6,
where α denotes the familywise error rate, p is the number of means in the family, and v is the
degrees of freedom associated with the denominator of the qTK statistic.
A variety of procedures have been proposed for testing hypotheses about all pairwise contrasts
among p means when the population variances are heterogeneous. The leading contenders
are Dunnett's (1980b) T3 and C tests and the Games-Howell GH test (Games & Howell, 1976).
All three tests can be used when the sample sizes are unequal. However, if the population
variances are homogeneous, the Tukey-Kramer procedure is recommended because of its
superior power.
Dunnett's T3 test. The test statistic for Dunnett's T3 procedure, denoted by mT3, is
A two-sided null hypothesis is rejected if the absolute value of mT3 exceeds or equals the
critical value mα; C,v′ obtained from the Studentized maximum modulus distribution in
Appendix Table E.16, where α denotes the familywise error rate, C = p(p − 1)/2, and v′ denotes
the use of Welch's modified degrees of freedom, discussed in Section 5.2:
Dunnett's C test. The test statistic for Dunnett's C procedure, denoted by qC, is
A two-sided null hypothesis is rejected if the absolute value of qC exceeds or equals the critical
value
where is obtained from the Studentized range distribution, α denotes the familywise error
rate, p is the number of means in the family, and vj is equal to nj − 1. This critical value is based
on Cochran's (1964) approximate solution to the Behrens-Fisher problem discussed in Section
5.2.
Games-Howell test. The test statistic for the Games-Howell procedure, denoted by qGH, is
A two-sided null hypothesis is rejected if the absolute value of qGH exceeds or equals the
critical value qα; p, v′ obtained from the Studentized range distribution in Appendix Table E.6,
where α denotes the familywise error rate, p is the number of means in the family, and v′
denotes the use of Welch's modified degrees of freedom. The formula for v′ is the same as that
given earlier for Dunnett's T3 procedure.
The relative merits of these multiple comparison procedures and others that have been
recommended for the case of heterogeneous variances and unequal sample sizes have been
investigated by a number of researchers (Dunnett, 1980b; Games, Keselman, & Rogan, 1981;
H. J. Keselman & Rogan, 1978; Tamhane, 1979). The results of these investigations can be
summarized as follows:
Fortunately, the use of these procedures, particularly the Games-Howell procedure, does not
lead to a substantial loss of power relative to procedures that assume equal variances.
Fisher-Hayter Test
Hayter (1986) proposed a modification of Fisher's LSD test that can be used to test hypotheses
about all pairwise contrasts. This two-step procedure, which assumes equal variances, controls
the familywise Type I error at αFW when the sample ns are equal or when the sample ns are
unequal and the number of means is p = 3. When the sample ns are unequal and p > 3, the
Type I error rate cannot exceed αFW. The procedure, which is called the Fisher-Hayter test,
has two steps. In the first step, the omnibus null hypothesis is tested at the α″ = αFW
significance level using either an F or a q statistic. The critical values for F and q are denoted
by, respectively, and , where is obtained from Appendix Table E.6 a n d v
denotes the degrees of freedom for MSerror. If this test is not significant, the omnibus null
hypothesis is not rejected and no more tests are performed. If the omnibus null hypothesis is
rejected, each of the pairwise contrasts is tested at the α″ = αFW significance level using
A two-sided null hypothesis is rejected if the absolute value of qFH exceeds or equals the
critical value obtained from the Studentized range distribution in Appendix Table E.6,
and α denotes the familywise Type I error. Note that the table is entered for p − 1 means
instead of p means. The Fisher-Hayter procedure shares a disadvantage of most multiple-step
procedures: It cannot be used to construct confidence intervals. The assumptions associated
with the procedure are the same as those described earlier for Student's t statistic (see Section
5.2).
When a researcher wants to test all pairwise contrasts among p means and the sample ns are
equal, it is usually more convenient to compute the one critical difference that each contrast
must exceed or equal than it is to compute p(p − 1)/2 test statistics. If the omnibus null
hypothesis is rejected, the critical difference, , that a pairwise contrast must exceed or
equal is given by
With this critical difference, four contrasts in Table 5.5-1 can be rejected—one more than was
rejected using Tukey's critical difference. This result is not surprising. Seaman, Levin, and
Serlin (1991) compared 23 multiple comparison procedures in terms of familywise Type I error
protection and power. They concluded that the Fisher-Hayter test was just slightly less
powerful than the most powerful procedures—the REGW and Peritz tests—and represented an
excellent trade-off between power and ease of application.
It is apparent that when all pairwise contrasts are tested, the Fisher-Hayter procedure is more
powerful than the other procedure. However, the Dumm and Dunn-Šidák procedures become
more powerful relative to the Fisher-Hayter procedure as the number of comparisons among
the p means is reduced. For example, if a researcher had planned to make only 4 instead of all
10 pairwise comparisons, the critical difference for the Dunn-Šidák procedure would have been
which is less than the 6.81 required for the Fisher-Hayter procedure.
Holm's procedure also is more powerful than the Fisher-Hayter procedure if only four pairwise
contrasts are tested. The critical differences for four contrasts are as follows:
The point of this discussion is that for a priori contrasts, a researcher should carefully consider
which multiple comparison procedure provides the desired Type I error protection and
maximizes power.
T. A. Ryan (1959, 1960) proposed a step-down multiple comparison procedure for testing
hypotheses for all pairwise contrasts. The procedure can be used with either F or q statistics.
His procedure, which is more powerful than the Tukey and Fisher-Hayter procedures, uses
adjusted significance levels denoted by . To use Ryan's procedure, the means are ordered
from the smallest to the largest mean. A contrast involving the smallest and largest means is
said to be separated by r = p steps (the number of means). This contrast is tested at the
level of significance, where r = p. If and only if the contrast is significant, the two
contrasts involving means separated by r = p − 1 steps are tested at the level of
significance, and so on. Consider an example with p = 5 means; let αFW = .05. Means
separated by r = 5 steps are tested at the level of significance, means
separated by r = 4 steps are tested at the level of significance, …, and means
separated by r = 2 steps are tested at the level of significance. If the null
hypothesis for a contrast is not rejected, by implication the null hypotheses for all contrasts
encompassed by the nonrejected contrast are not rejected.
The advantage of Ryan's procedure is that it controls the familywise Type I error at less than
αPF and has greater power than procedures that use a uniform level of significance, such as
Tukey's procedure.
Einot and Gabriel (1975) used Ryan's idea of adjusted significance levels but replaced the
Bonferroni additive inequality with the multiplicative inequality; means separated by r steps are
modifications and has appeared under a number of different names.4 To give the major
contributors—Ryan, Einot, Gabriel, and Welsch—their just due, the procedure is referred to as
the REGW procedure. The designations REGW F and REGW Q are used to distinguish
between the F and q versions of the test.
Shaffer (1979) proposed yet another improvement on Ryan's idea of adjusted significance
levels that can be used with any step-down q procedure. The improvement consists of first
performing an omnibus ANOVA F test on the p means. If the F test is not significant, the testing
sequence terminates. If the F test is significant, means separated by r = p steps are tested
using the q critical value appropriate for means separated by r = p − 1 steps. Subsequent tests
are performed with the usual q critical values. This procedure is referred to as the REGW FQ
procedure.
The REGW FQ procedure is illustrated in Table 5.5-2. The procedure begins with an F test of
the omnibus null hypothesis. The F test statistic is
Table 5.5-2 ▪ Computational Procedures for the REGW FQ Test [Data are from Table 5.5-1,
where MSWG = 29.0322, p = 5, n = 9, and v = p(n − 1) = 5(9 − 1) = 40. To simplify the
presentation, the means have been relabeled so that
denotes the smallest mean and denotes
the largest mean: ]
A two-sided null hypothesis is rejected if (1) the absolute value of qREGW exceeds or equals
the critical value obtained from the Studentized range distribution in Appendix Table E.6
and (2) the means in the hypothesis are not encompassed by a nonrejected hypothesis at an
earlier stage in the testing procedure. The critical value, , for means separated by r = 5
steps is q.05, 4, 40 = 3.79, where (Einot and Gabriel's contribution), r =
5 − 1 = 4 (Shaffer's contribution), and v = 40. The remaining critical values are as follows:
The REGW F procedure is illustrated in Table 5.5-3. The critical values and decisions
(significant, not significant, not significant by implication) are given in the fourth column of the
table. The REGW F procedure resulted in rejecting three hypotheses for pairwise contrasts: μ1
= μ4, μ1 = μ5, and μ2 = μ5. All other hypotheses for pairwise contrasts are not rejected by
implication because they are contained in sets of means—{μ1 μ2 μ3}, {μ2 μ3 μ4}, and {μ3 μ4
μ5}—that were not significant in earlier tests.
Table 5.5-3 ▪ Computational Procedures for the REGW F Test [Data are from Table 5.5-1,
where MSWG = 29.0322, p = 5, n = 9, and v = p(n − 1) = 5(9 − 1) = 40. The means have
been relabeled so that denotes the smallest mean and
denotes the largest mean:
]
The REGW F, FQ, and Q procedures require critical values of and F that are not available in
the Studentized range and F tables. Values of can be obtained by linear interpolation using
the natural log of . For example, the following information from Appendix Table E.6 can be
used to obtain the approximate critical value for q.0303; 3, 40:
It is apparent that the critical value for q.0303; 3, 40 is between 3.44 and 4.37. More precisely
the critical value is
Microsoft's Excel FINV function can be used to obtain critical values for F. For example, the
critical value for F.0303, 2, 40 is obtained from the FINV function
Monte Carlo studies (Einot & Gabriel, 1975; Ramsey, 1978, 1981; Seaman et al., 1991;
Seaman, Levin, Serlin, & Franke, 1990) indicate that multiple comparison procedures that use
an F statistic or an omnibus F statistic followed by a q statistic tend to be slightly more powerful
than those that use an omnibus q statistic. These conclusions, however, are affected by the
type of power and the pattern of means and sample sizes that were investigated. Unfortunately,
the F statistic requires a considerable amount of computation, as can be seen from the
example in Table 5.5-3.
The assumptions associated with the REGW F test are the same as those for the ANOVA F test
discussed in Section 3.5. The assumptions associated with the REGW FQ test are the same as
those for the ANOVA F test and Tukey's HSD test. If the sample sizes are unequal, the Tukey-
Kramer test statistic can be used in place of the qREGW statistic. If the variances are unequal,
one of the test statistics in the section on Procedures for Heterogeneous Variances can be
used.
Scheffé's S Test
The fifth common research situation mentioned in Table 5.1-1 involves testing contrasts
suggested by an inspection of the data when that inspection identifies one or more interesting
nonpairwise contrasts. The procedure of choice for this situation is Scheffé's (1953) S test. The
S test controls the familywise Type I error rate for the infinite number of contrasts that can be
performed among p ≥ 3 means. Scheffé's test is much less powerful than Tukey's HSD test, for
example, and is recommended only when some nonpairwise contrasts are of interest. Scheffé's
procedure uses the F sampling distribution and, like ANOVA, is robust with respect to
nonnormality. The procedure also can be used when the sample sizes are unequal. Scheffé's
test statistic, denoted by FS, is
A two-sided null hypothesis is rejected if FS exceeds or equals the critical value , where
is obtained from the F distribution in Appendix Table E.4, v1 = p − 1, α denotes the
familywise Type I error rate, and v2 is the degrees of freedom associated with MSerror.
Scheffé's procedure is always congruent with the omnibus ANOVA F test. If the omnibus F test
is significant, at least one contrast among the means is significant according to Scheffé's test
and vice versa. Scheffé's procedure is one of the most flexible data-snooping procedures
available. But this flexibility comes at a price—low power. Hence, the procedure should be used
only when the hypotheses of interest include a nonpairwise contrast.
Brown-Forsythe Test
If a researcher is interested in testing all contrasts, including nonpairwise contrasts that appear
interesting from an inspection of the data, and if the population variances are heterogeneous,
then the Brown-Forsythe (Brown & Forsythe, 1974a) procedure can be used. The procedure,
which is a modification of Scheffé's procedure, uses the F sampling distribution and Welch's
modified degrees of freedom. The test statistic, denoted by FBF, is
A two-sided null hypothesis is rejected if FBF exceeds or equals the critical value ,
where , is obtained from the F distribution in Appendix Table E.4, α denotes the
familywise error rate, v1 = p − 1, and denotes Welch's modified degrees of freedom:
Table 5.1-2, discussed earlier, contains multiple comparison recommendations for the five
common hypothesis-testing situations that occur in the behavioral sciences, health sciences,
and education. The 17 recommended procedures control the per-contrast, familywise, or per-
family Type I error rate for any complete or partial null hypothesis. In addition, each of the
recommended procedures has one or more other virtues such as excellent power, ease of
computation and interpretation, availability of confidence intervals, and robustness. Missing
from the list are two nonrecommended tests: the Newman-Keuls test (Keuls, 1952; Newman,
1939) and Duncan's (1955) test. Both of these step-down procedures are used to test all
pairwise contrasts among p means. They are competitors to the Tukey HSD, Fisher-Hayter, and
REGW procedures described in Section 5.5. Researchers like the Newman-Keuls and Duncan
procedures because of their excellent power. However, the Newman-Keuls procedure is not
recommended because it fails to control the familywise Type I error rate when the family
contains more than three means; Duncan's procedure fails to control the familywise Type I error
rate when the family contains more than two means. Because of this serious shortcoming, I will
say no more about these tests.
Peritz's Test
In 1970, Peritz introduced a step-down procedure that is a blend of the REGW and Newman-
Keuls procedures. Peritz's procedure can be used with an F or a q statistic or a combination of
an omnibus F statistic followed by a q statistic. The procedure controls the familywise Type I
error and has been shown to have the highest per-pair power of all multiple comparison
procedures investigated (Einot & Gabriel, 1975) and is among the highest in all-pairs power
(Martin et al., 1989; Ramsey, 1981; Seaman et al., 1991). Seaman et al. (1990) have described
two modified Peritz procedures that achieve a slight gain in power when p > 4. Unfortunately,
the Peritz procedure and the two modifications are complex and are best performed with the aid
of a computer. The interested reader can consult Begun and Gabriel (1981), Hochberg and
Tamhane (1987), Kirk (1994), Ramsey (1981), or Toothaker (1991).
Throughout this chapter, I have emphasized the importance of controlling the familywise error
rate for nonorthogonal contrasts. Testing a large number of contrasts can result in very low
power for individual tests. Benjamini and Hochberg (1995) proposed controlling the false
discovery rate (FDR) instead of the familywise error rate. The FDR is the expected proportion
of contrasts falsely declared significant. Their idea was to make certain that the proportion of
false discoveries relative to the total number of discoveries is kept small, say, no more than 5%.
The false discovery rate can be defined as follows:
By controlling αFDR, a researcher is less likely to make Type I errors than procedures that
control the per-contrast error rate. And controlling αFDR instead of αFW provides more power
to detect contrast that should be declared significant. When all null hypotheses are true, αFDR
= αFW; when at least one null hypothesis is false, controlling αFDR at, say, .05 means that
αFW will exceed .05. It follows that controlling the false discovery rate is not appropriate for all
research situations. Control of αFDR has been recommended for exploratory research and
when the number of contrasts is extremely large (H. J. Keselman, Cribbie, & Holland, 1999).
Research on the merits of controlling the false discovery rate and on procedures for controlling
the rate is in its infancy. The interested reader can consult Hemmelmann, Horn, Süsse,
Vollandt, and Weiss (2005); Horn and Dunnett (2004); Korn, Troendle, McShane, and Simon
(2004); Sakar (2002); and Somerville and Hemmelmann (2008).
Twenty-two multiple comparison procedures have been described in this chapter. The
procedures and their salient characteristics are summarized in Table 5.8-1. The relative merits
of various multiple comparison procedures have engendered much debate among statisticians
in recent years. Each of the procedures in Table 5.8-1 has been recommended by one or more
statisticians. The problem facing a researcher is to select the test statistic that provides the
desired kind of protection against Type I errors and at the same time provides maximum power.
The characteristics of the most frequently recommended procedures have been described in
some detail along with pertinent references so that researchers can make informed choices.
1.Terms to remember:
a.contrast (comparison) (5.1)
b.pairwise comparison (5.1)
c. nonpairwise comparison (5.1)
d.orthogonal contrast (5.1)
*b.
*c.
d.
e.
8.[5.1] Construct three sets of orthogonal contrasts among five means. Each set should
contain four contrasts.
*9.[5.2] The religious dogmatism of members of four church denominations in a large
midwestern city was investigated. A random sample of 30 members from each
denomination took a paper-and-pencil test of dogmatism. The sample means were
, and ; MSWG = 120 and v2 = 4(30 − 1) = 116. The researcher
hypothesis is rejected, proceed to test all pairwise comparisons. Construct a table like
Table 5.5-1; let αFW = .05.
b.Construct 1 − 100(1 – .05)% confidence intervals for all pairwise comparisons.
c. Use the Holm test to evaluate all pairwise comparisons.
d.Use the Fisher-Hayter test to evaluate all pairwise comparisons by comparing with
.
e.Use the REGW Q test to evaluate all pairwise comparisons.
f. Rank the procedures in terms of apparent power.
*21.
[5.6] Exercise 10 described an experiment to evaluate the effectiveness of three approaches
to drug education in junior high school. Assume that the omnibus null hypothesis was
rejected at the .05 level of significance.
*a.Use Scheffé's procedure to test the following null hypotheses:
*b.Suppose that the sample variances for this problem are , and
. Use the Brown-Forsythe procedure to test the null hypotheses.
22.
[5.6] The effects of simulator training involving synergistic 6-degrees-of-freedom platform
motion on the acquisition of basic approach and landing skills of 63 undergraduate pilot
trainees were investigated. The trainees were randomly divided into three groups. Those in
group a1 received 10 sorties with platform motion in the Advanced Simulator for Pilot
Training. Those in group a2 also received 10 sorties but without motion. Trainees in group
a3, the control group, received the standard syllabus of preflight and flightline instructions.
The dependent variable was instructor-pilot ratings of trainee performance in a T-37 aircraft.
The sample means were , and ; MSWG = 39.94 and v2 = 3(21 −
1) = 60. Assume that the omnibus null hypothesis was rejected at the .05 level.
a.Use Scheffé's procedure to test the following null hypotheses:
b.Suppose that the sample variances for this problem are , and
. Use the Brown-Forsythe procedure to test the null hypotheses.
1If the means are normally and independently distributed with mean equal to μ and variance
j
3Procedures for obtaining familywise critical values and p values using S-Plus, SAS, and SPSS
are described by Kirk and Hetzer (2006, pp. 149–153).
4The following names have been used: (1) modified Ryan's test (Jaccard, Becker, & Wood,
1984), (2) REGWF and REGWQ tests (SAS Institute, Inc., 1985), (3) revised Ryan's procedure
(Ramsey, 1978), (4) Ryan's procedure (Einot & Gabriel, 1975; Ramsey, 1981), and (5) Tukey-
Welsch procedure (Hochberg & Tamhane, 1987, p. 69; Lehman & Shaffer, 1979).
http://dx.doi.org/10.4135/9781483384733.n5
Trend Analysis
The levels of a treatment can represent either a qualitative variable or a quantitative variable.
The levels of a qualitative treatment differ in type or kind, for example, different kinds of
therapy, methods of instruction, and the four taste qualities: sweet, sour, salt, and bitter. The
levels of a quantitative treatment differ in amount. The independent variable in the sleep
deprivation experiment described in Chapter 4 is an example of a quantitative treatment. The
four levels of sleep deprivation—12, 18, 24, and 30 hours—can be ordered along a continuum,
that is, described in terms of more or less of the variable.
In Chapter 5, I describe procedures for testing the null hypothesis that a contrast for population
means is equal to zero. The procedures are applicable to both qualitative and quantitative
treatments. If the treatment represents a quantitative variable, researchers can use trend
analysis to answer additional interesting questions about the means. I describe these questions
next.
1.Is there a trend in the population means? This question can be rephrased as, Are the
population means for the dependent variable influenced by changes in the independent
variable?
2.Is the trend of the dependent variable population means linear or nonlinear?
3.If the trend of the dependent variable population means is nonlinear, what higher-degree
polynomial equation is required to provide a satisfactory fit for the means?
4.Are the patterns of the predicted and observed means the same? That is, does a particular
equation based on the data provide a satisfactory fit for the means?
5.Is the trend of the population means for one treatment the same for different levels of a
second treatment?
This last question is not applicable to a completely randomized design because this design has
only one treatment. I return to question 5 in Chapter 8, where I describe factorial designs.
Readers who are familiar with regression analysis may recognize similarities between the
questions that are addressed by trend analysis and regression analysis. There is a close
connection between the two procedures: They are both used to determine the nature of the
relationship between a quantitative independent variable and a quantitative dependent variable.
A decision to use one or the other procedures is influenced primarily by the nature of the
independent variable. Regression analysis is preferred when the independent variable is
continuous with many distinct values. Suppose a researcher wants to know whether IQ scores
can be used to predict whether six-year-old children are ready to learn to read. To answer the
question, the researcher administers IQ tests to 50 six-year-olds and follows their progress in
learning to read. Regression analysis is the statistical procedure of choice because the IQ
variable has a wide range of distinct values, and few children have the same IQs.
Trend analysis is used when subjects are assigned to a relatively small number of discrete
quantitative treatment levels. For example, it is the preferred statistical procedure for examining
the relationship between sleep deprivation—12, 18, 24, and 30 hours—and hand-steadiness.
Unlike the IQ example, the sleep deprivation variable has only four values. In the IQ study, the
researcher could categorize the continuous variable by assigning IQ scores to, say, 6 to 10
class intervals and then perform a trend analysis. Categorization of a continuous independent
variable with many distinct values inevitably results in some loss of information. Hence,
categorization in order to do a trend analysis is not recommended.
Trend analysis is especially well suited to the kinds of treatments that are used in analysis of
variance (ANOVA) designs. The treatment in such designs is often a quantitative variable, and
the subjects are randomly assigned to a small number of treatment levels.
An answer to whether there is a trend in the population means is provided by an F test of the
omnibus null hypothesis H0: μ1 = μ2 = … = μp. Rejection of this null hypothesis indicates that
there is a trend: The population means differ at two or more levels of the independent variable.
If there is no trend, the means have the appearance shown by the squares in Figure 6.1-1. The
filled and open circles in the figure illustrate, respectively, linear and nonlinear trends.
Figure 6.1-1 ▪ The squares illustrate the absence of a trend; filled and open circles
illustrate, respectively, linear and nonlinear trends.
If the F test of the omnibus null hypothesis indicates that there is a trend in the population
means for a quantitative variable, a researcher may want to determine the nature of the trend.
In the broadest terms, a trend is either linear or nonlinear; the means form either a straight line
with a nonzero slope or a curved line. Procedures for describing the nature of the trend are
presented in the following sections. There I illustrate procedures for testing hypotheses about
various trend components to find the simplest equation that provides a satisfactory fit for a
trend.
The trend of means for a quantitative treatment can be described by a variety of equations. A
perfect fit for means can be obtained with a polynomial equation that contains one less trend
component than there are treatment levels. Thus, if a treatment has p levels, the polynomial
equation can have up to p – 1 trend components. The question of whether a polynomial
equation is the most appropriate way to fit a set of means is outside the scope of this book. An
introduction to curve-fitting procedures can be found in Cohen, Cohen, West, and Aiken (2003)
and Kutner, Nachtsheim, Neter, and Li (2005). Polynomial equations are introduced here
because they provide a simple way to fit the trend of means.
A polynomial is an algebraic expression that contains more than one term.1 For example, a
polynomial of degree p − 1 is
where
β0 is a constant.
The degree of a polynomial is the largest exponent that occurs for an X j term with a nonzero
coefficient. An equation of the form μj = β0 + β1X j, where β1≠ 0, is a first-degree or linear
equation because the exponent of X j is 1. The equation is a second-degree
or quadratic equation, is a third-degree or cubic equation, and so
on. The linear component describes a straight line like that in Figure 6.2-1(a). Note that this line
does not change direction. The line for the quadratic relationship in Figure 6.2-1(b) changes
direction once, and the line for the cubic relationship in Figure 6.2-1(c) changes direction two
times. In general, for p means, the line for the p − 1th component changes direction p − 2
times.
Figure 6.2-1 ▪ (a) Means represented by the circles follow a linear trend. (b) Means follow
a quadratic trend. (c) Means follow a cubic trend.
Suppose that I want to fit a polynomial equation to the means in Figure 6.2-1(c). The following
polynomial equation of degree 3 provides a perfect fit for the means:
where denotes the predicted mean for the jth treatment. A disadvantage of this equation is
that the three trend components are not statistically independent. Instead of finding values of
in equation (6.2-1), it is more useful to find values of in
the equation
5.1. Equation (6.2-2) with trend components , and has an advantage over
equation (6.2-1). Equation (6.2-2) enables researchers to independently test a null hypothesis
for each of the three trend components.
Orthogonal polynomial coefficients, like the orthogonal contrast coefficients for means in
Section 5.1, satisfy two conditions: The coefficients for the ith trend sum to zero, , and
the sum of the product of the coefficients for the ith and i′th trends is equal to zero,
. Consider the linear, quadratic, and cubic coefficients shown in Table 6.2-1. Each set of
coefficients satisfies the two conditions:
Table 6.2-1 ▪ Coefficients for Linear, Quadratic, and Cubic Trend Components
and
The derivation of orthogonal polynomial coefficients, which are functions of the value of the
treatment levels (X j), is illustrated in Appendix C. Orthogonal polynomial coefficients for p equal
to 3 through 10 are given in Appendix Table E.10 for the case in which the levels of the
independent variable are separated by equal intervals and the sample ns are equal. If the two
conditions are not satisfied, the coefficients must be derived following the procedure in
Appendix C. I assume for the remainder of this chapter that the two conditions are met.
where the linear, quadratic, and cubic trend contrasts are given by, respectively,
and cubic—of the trend. This can be shown for the data in Figure 6.2-1. The sample means in
Figure 6.2-1(a) are , and . For Figure 6.2-1(a), suppose that I use
a third-degree polynomial equation to obtain . The equation is
where
Note that equation (6.2-3) yields , which is the mean for X 1. Because the trend in Figure
6.2-1(a) is clearly linear, only the linear component is needed to describe the trend; the
quadratic and cubic components are both equal to zero. However, if I computed for the
quadratic trend in Figure 6.2-1(b), the quadratic component would not equal zero, , but
the linear and cubic components would both equal zero.
Each trend sum of squares has one degree of freedom. In the following section, I show that
SSBG can be partitioned into p − 1 orthogonal trend sums of squares, that is,
This means that the best prediction equation for a set of population means is of no higher
degree than p − 1. In the interest of parsimony, researchers like to use an equation of degree
less than p − 1 when possible. In the following sections, I show how to determine the simplest
polynomial equation that provides a satisfactory fit for population means.
A basic question that confronts a researcher is deciding how many trend components are
required to adequately describe the relationship between the dependent and independent
variables. As a first approximation, a researcher can determine whether a linear equation
provides a satisfactory description of the trend. If it does not, other trend components can be
added to the equation.
A graph of the hand-steadiness means from Table 4.2-1 of Chapter 4 is shown in Figure 6.2-2.
For the moment, assume that the researcher has an a priori hypothesis about the linear
contrast for these data:
Let α = .05. Procedures for testing the null hypothesis are shown in Table 6.2-2. Because F =
15.32 > F.05; 1, 28, the researcher concludes that the linear component of the trend is
significant.
Table 6.2-2 ▪ Computational Procedures for Testing the Significance of the Linear
Contrast and Departure From Linearity
Omega squared is used to estimate the proportion of the population variance in the dependent
variable that is accounted for by the linear contrast. According to the following computation,
the linear contrast accounts for 31% of the variance. Thus, the linear contrast is statistically
significant and accounts for a large proportion of the variance. Clearly, the linear component is
necessary for describing the relationship between the independent and dependent variables.
But other components also may be necessary because the sample means do not all fall on a
straight line, as shown in Figure 6.2-2. The test for departure from linearity that is described
next addresses whether additional trend components are necessary to describe the
relationship.
A test for departure from linearity tests the hypothesis that one or more of the remaining p − 2 =
2 nonlinear trend contrasts makes a significant contribution to describing the relationship
between the levels of sleep deprivation and hand-steadiness. The statistical hypotheses are
Computational procedures for testing the hypothesis that the trend does not depart from
linearity are illustrated in Table 6.2-2. Because F = 0.67 < F.05; 2, 28, the researcher has no
reason for believing that the trend departs from linearity. Variability of the sample means around
the best-fitting straight line is assumed to represent error variability.
If the test for departure from linearity is significant, a researcher then wants to determine which
higher-degree trend component or components contribute to the trend. These tests are a
posteriori: A decision to perform the tests is made after examining the data. For the hand-
steadiness data, the error rate for the collection of the quadratic and cubic trend components
should be the same as that assigned to the test for departure from linearity, which is α = .05.
The Dunn and the Dunn-Šidák procedures described in Chapter 5 provide a simple way to
control the per experiment or experimentwise error rate for the collection of the C = 2 tests. The
critical values for the Dunn and the Dunn-Šidák procedures can be obtained using Microsoft's
Excel FINV function and are, respectively, a n d
A researcher may believe that some or all of the trend components contribute to a trend. In this
case, each trend contrast for which a priori hypotheses have been advanced should be tested
at the α level of significance. Recall from Section 5.1 that the conceptual unit for a Type I error
for a priori orthogonal contrasts is the individual contrast.
Suppose that the researcher advanced a priori hypotheses about the linear, quadratic, and
cubic trend components for the hand-steadiness data in Table 4.2-1. The statistical hypotheses
are
The linear-contrast hypothesis was tested in Table 6.2-2 and found to be significant.
Procedures for testing the quadratic and cubic contrast hypotheses are illustrated in Table 6.3-
1. It is apparent from Table 6.3-1 that neither the quadratic nor the cubic contrast is significant.
Notice that the p − 1 = 3 degrees of freedom for SSBG is partitioned into three trend sums of
squares:
Table 6.3-1 ▪ Computational Procedures for Testing the Significance of the Quadratic and
Cubic Contrasts
If the null hypotheses for the quadratic and cubic contrasts had been rejected, a researcher
would use omega squared to estimate the proportion of the population variance in the
dependent variable that is accounted for by these contrasts. The computations are as follows:
When a sample omega squared is negative, as is the case for , the best estimate of the
population value is 0.
In reporting one degree-of-freedom trend measures, one should always describe the nature or
direction of the trend. For the hand-steadiness experiment in Chapter 4, one could say that as
sleep deprivation increased from 12 to 30 hours, the mean number of times that the stylus
touched the side of the half-inch hole during a 2-minute interval increased from 3 to 6.
A summary of the trend tests that have been performed is shown in Table 6.3-2. The table
illustrates the results of using two approaches to testing hypotheses about trends. A researcher
should decide in advance which of the two approaches is appropriate and use that approach.
For example, if a researcher has an a priori hypothesis about the linear contrast and wants to
know whether it provides a satisfactory fit for the trend, he or she should perform the tests in
rows 1a and 1b of Table 6.3-2. If the test for departure from linearity in row 1b is significant, the
researcher will want to perform the tests in rows 1c and 1d to determine which of these higher-
degree trend components is required to obtain a satisfactory fit for the trend. In this example,
the test for departure from linearity was not significant, so the tests in rows 1c and 1d should
not be performed. Note that the level of significance for the collection of tests in rows 1c and 1d
should equal α = .05, the level of significance adopted for the test for departure from linearity.
Alternatively, if a researcher has a priori hypotheses about the linear, quadratic, and cubic
trends, he or she would perform the tests in rows 1a, 1c, and 1d. Each hypothesis should be
tested at the .05 level of significance. In this case, it would be pointless to test the departure
from linearity mean square in row 1b.
In experiments that involve many treatment levels, it is unlikely that tests of trend contrasts
beyond the cubic degree will add materially to a researcher's understanding of the data. The
usual practice, then, is to test trend contrasts for which there are a priori hypotheses and pool
the remaining trend components in a test for higher-order trends. The F statistic is
where
The SS LOWER-ORDER TRENDS i s the sum of the SS for each trend component not
included as one of the higher-order components.
As I have shown, trend analysis is a valuable tool for gaining a clearer understanding of data
when the treatment represents a quantitative variable. Researchers should avoid extrapolating
beyond the data. For example, in the sleep deprivation experiment, we know that hand-
steadiness is linearly related to sleep deprivation over the interval of 12 to 30 hours. In all
likelihood, the relationship would become quadratic if the amount of sleep deprivation was
increased to, say, 48 or 60 hours. When the treatment represents a fixed effect, the results
apply only to the treatment conditions included in the experiment.
Assume that a researcher advanced an a priori hypothesis concerning the linear trend for the
hand-steadiness data. On the basis of the results in Table 6.2-2, the researcher concludes that
a polynomial equation of the form provides a satisfactory description of the
relationship between the independent and dependent variables. This equation contains the
linear component that is significant, F(1, 28) = 15.32, p < .001, and that accounts for a large
proportion of the variance in the dependent variable, . The simplest (lowest possible
degree) description of the trend is given by
It is interesting to see how well the predicted means, , correspond to the observed means,
. Computation of the predicted means for each level of the independent variable is shown in
part (ii) of Table 6.3-3. A graph of the predicted and observed means is shown in Figure 6.3-1.
Figure 6.3-1 ▪ Comparison of the predicted trend based on a first-degree equation from
Table 6.3-3 and the observed trend from Table 6.2-2.
The fit appears quite good. An exact fit would be obtained if a third-degree polynomial equation
was used. Obtaining an exact fit is not important because any set of p means can always be
fitted perfectly by a polynomial equation of degree p − 1. The most parsimonious description of
the trend of p means is the simplest equation for which each trend component makes a
significant contribution.
A goodness-of-fit test can be used to determine whether the difference between a set of
observed and predicted means is statistically significant. For example, a theory might predict a
quadratic relationship between the independent and dependent variables. A test of goodness
of fit helps a researcher decide whether a theory's predictions provide a good fit to the observed
means. A goodness-of-fit test can be significant for two reasons. First, the predicted and
observed means differ by some constant amount even though the pattern or shape of the two
sets of means is the same. Alternatively, the test can be significant because the pattern of the
predicted and observed means is different. Usually, a researcher is interested only in the latter
situation, that is, in determining if there is a departure from pattern. If the pattern of the
predicted and observed means is the same, it is a simple matter to add or subtract a constant
from each predicted mean to make the grand mean equal to that of the observed means.
For purposes of illustration, I use the hand-steadiness data to illustrate the test for departure
from pattern. In this case, we know that the predicted means in Figure 6.3-1 provide a
satisfactory fit to the observed means. Furthermore, it is not necessary to add or subtract a
constant from the predicted means because the grand mean of the predicted means is equal to
the grand mean of the observed means. Computational procedures for an F test for departure
from the pattern are shown in Table 6.3-4. Because F = 0.67 < F.05; 1,28, there is no reason for
believing that the pattern of the predicted means differs from that for the observed means. The
F DEP. FROM LIN test described earlier is a special case of the more general F DEP. FROM
PATTERN test. When a set of means has been fitted by a linear equation, the two F statistics
yield identical values.
In Section 4.4, I showed how to estimate the population strength of association between the
independent and dependent variables using and . Here I show how to compute the
Pearson product-moment correlation, denoted by r, and the correlation ratio o r eta,
denoted by , using ANOVA sums of squares. Computation of r uses the sum of squares for the
linear contrast, , described earlier. Descriptive measures of the linear correlation and the
curvilinear correlation are given by, respectively,
Sums of squares for the hand-steadiness data are given in Table 6.3-2: , SSTO =
110.875, and SSBG = 41.375. The linear correlation between sleep deprivation and hand-
steadiness is
The sign of r is plus because an increase in sleep deprivation is associated with an increase in
the number of times that subjects touch the side of the hole with the stylus. The curvilinear
correlation is
If the variables are linearly related, | r | is equal to . As the sample relationship departs from
linearity, increases relative to r.
In Section 6.3, I showed that SSBG can be partitioned into p − 1 orthogonal trend contrasts:
The ideas developed in Sections 6.2 and 6.3 also can be applied to contrasts among means.
Suppose that a researcher is interested in the six pairwise contrasts among the four hand-
steadiness means in Table 6.2-2. The omnibus null hypothesis is rejected. However, simply
rejecting the hypothesis that μ1 = μ2 = μ3 = μ4 is not very informative. Knowing which pairwise
contrasts are significant and the amount of the variance that they account for helps a
researcher better understand the data. Using the Fisher-Hayter multiple comparison test, I
show here that two of the six contrasts are significant, contrasts 3 and 5:
The researcher can use omega squared to determine how much of the hand-steadiness
variance is accounted by the two significant contrasts. The sums of squares for contrasts 3 and
5 are, respectively,
The two significant contrasts account for 30% + 20% = 50% of the hand-steadiness variance.
and are examples of one degree-of-freedom strength of association measures that
are recommended in the Publication Manual of the American Psychological Association
(American Psychological Association, 2010, p. 34). It can be shown that when the mean
contrasts are mutually orthogonal, the sum of the associated p − 1 omega squareds is equal to
the omega squared for the omnibus, overall, null hypothesis: .
1.Terms to remember:
a.qualitative treatment (6.1)
b.quantitative treatment (6.1)
c. trend analysis (6.1)
d.polynomial (6.2)
e.orthogonal polynomial coefficients (6.2)
f. one degree-of-freedom measures (6.3)
g.higher-order trends (6.3)
h.SS LOWER-ORDER TRENDS (6.3)
i. goodness-of-fit test (6.3)
j. departure from pattern (6.3)
k. Pearson product-moment correlation (6.4)
l. correlation ratio (eta) (6.4)
*2.T he effects of morphine on responses to restraint stress were investigated using male
Wistar rats. Thirty-two rats were randomly assigned to one of four conditions: a1 = 0 mg/kg
of morphine (control group), a2 = 2 mg/kg of morphine, a3 = 4 mg/kg of morphine, and a4 =
8 mg/kg of morphine. The rats were handled for 5 days prior to the start of the experiment
and singly housed in plastic cages. The rats were treated with daily injections of morphine
or saline for 10 days. Seven days after the last injection, they were subjected to 30 minutes
of restraint in Plexiglas tubes (7.5-cm diameter × 21.5-cm length). Immediately after the
restraint, the rats were placed in a plastic arena (40-cm length × 26-cm width × 16-cm
height). The roof of the arena was constructed from stainless steel bars. The arena was
illuminated by a 40-watt red light. The rats were tested for length of social interaction
(sniffing, grooming, following, crawling under, crawling over, and passive contact) with an
untreated, unfamiliar weight-matched male rat. A camera in the experimental room recorded
the rats’ behavior. The following data were obtained. (Experiment suggested by Blatchford,
K. E., Diamond, K., Westbrook, R. F., & McNally, G. P. Increased vulnerability to stress
following opiate exposures: Behavioral and autonomic correlates. Behavioral Neuroscience.)
*a.[4.2] Perform an exploratory data analysis on these data (see Table 4.2-1 and Figure
4.2-1). Assume that the observations within each treatment level are listed in the order
in which the observations were obtained. Interpret the analysis.
*b.[6.2] Use an ANOVA F test to determine whether there is a trend in the data; let α = .05.
*c.[6.2, 6.3] Assume that a priori hypotheses about the p − 1 trend contrasts have been
advanced. (i) Test the null hypothesis for each of the trend contrasts at α = .05. (ii) What
proportion of the population variance in the dependent variable is accounted for by each
of the trend contrasts? (iii) Write the simplest polynomial equation necessary to
adequately describe the trend. (iv) Make a figure that compares the observed and the
predicted tends. (v) Determine whether the patterns of the predicted and the observed
means are different.
*d.[5.3, 6.5] Use Dunnett's multiple comparison statistic to determine which population
treatment means differ from the control group mean. Compute omega squared for the
contrasts that are significant.
*e.[6.4] Compute r and .
f. Prepare a “results and discussion section” appropriate for Behavioral Neuroscience.
*a.[4.2] Perform an exploratory data analysis on these data (see Table 4.2-1 and Figure
4.2-1). Assume that the observations within each treatment level are listed in the order
in which the observations were obtained. Interpret the analysis.
*b.[6.2] Use an ANOVA F test to determine whether there is a trend in the data; let α = .05.
*c.[6.2, 6.3] Assume that a priori hypotheses about the p − 1 trend contrasts have been
advanced. (i) Test the null hypothesis for each of the trend contrasts at α = .05. (ii) What
proportion of the population variance in the dependent variable is accounted for by each
of the trend contrasts? (iii) Write the simplest polynomial equation necessary to
adequately describe the trend. (iv) Make a figure that compares the observed and the
predicted tends. (v) Determine whether the patterns of the predicted and the observed
*a.[4.2] Perform an exploratory data analysis on these data (see Table 4.2-1 and Figure
4.2-1). Assume that the observations within each treatment level are listed in the order
in which the observations were obtained. Interpret the analysis.
b.[6.2] Use an ANOVA F test to determine whether there is a trend in the data; let α = .05.
c. [6.2, 6.3] Assume that an a priori hypothesis about the linear trend contrast has been
advanced. (i) Test the null hypothesis for the linear trend; let α = .05. Perform a test for
departure from linearity; let α = .05. (ii) What proportion of the population variance in the
dependent variable is accounted for by the linear trend contrasts? (iii) Write the simplest
polynomial equation necessary to adequately describe the trend. (iv) Make a figure that
compares the observed and the predicted tends. (v) Determine whether the patterns of
the predicted and the observed means are different.
d.[5.3, 6.5] Use Dunnett's multiple comparison statistic to determine which population
means differ from the mean of the 12-hour sleep deprivation condition. Compute omega
squared for the contrasts that are significant.
e.[6.4] Compute r and .
5.From around one year of age, children in Western cultures spend considerable time
engaged in picture-book reading with their parents. A reenactment procedure was use to
determine whether 18-, 24-, and 30-month-old children could learn how to perform a novel
sequence of actions from picture-book interaction. Eight children in each age category were
recruited from newspaper birth announcements. Following a brief warm-up, a child sat
comfortably on a floor pillow while the experimenter read a six-page picture-book. The
experimenter drew the child's attention to each picture by pointing to it as she read the
following story.
Sandy has found some toys. Sandy has found a ball, a jar, and a stick. She can
use these things to make a rattle. Sandy is pushing the ball into the jar. Sandy is
picking up the stick and putting it in the jar. Sandy is shaking the stick to make a
noise. Shake, shake. Wow, Sandy made a rattle. Good job Sandy.
During the reenactment test, the experimenter placed the three objects in front of the child
and said, “Show me how you can use these things to make a rattle.” The experimenter
identified seven target actions necessary to make the rattle. Step 1 is looking at the three
objects, step 2 is picking up the ball, and step 3 is inserting the ball in the jar. The final step
is shaking the stick to make a noise. The dependent variable is the number of steps that the
child completed. The following data were obtained. (Experiment suggested by Simcock, G.,
& DeLoache, J. Get the picture? The effects of iconicity on toddlers’ reenactment from
picture books. Developmental Psychology.)
a1 a2 a3
18-months old24-months old30-months old
4 4 2
1 4 6
6 7 6
2 2 3
2 3 7
0 6 5
3 5 4
3 5 5
a.[4.2] Perform an exploratory data analysis on these data (see Table 4.2-1 and Figure
4.2-1). Assume that the observations within each treatment level are listed in the order
in which the observations were obtained. Interpret the analysis.
b.[6.2] Use an ANOVA F test to determine whether there is a trend in the data; let α = .05.
c. [6.2, 6.3] Assume that a priori hypotheses about the p − 1 trend contrasts have been
advanced. (i) Test the null hypothesis for each of the trend contrasts at α = .05. (ii) What
proportion of the population variance in the dependent variable is accounted for by each
of the trend contrasts? (iii) Write the simplest polynomial equation necessary to
adequately describe the trend. (iv) Make a figure that compares the observed and
predicted tends. (v) Determine whether the patterns of the predicted and the observed
means are different.
d.[5.5, 6.5] Use the Fisher-Hayter multiple comparison statistic to determine which
pairwise contrasts among means are statistically significant. Compute omega squared
for the contrasts that are significant.
e.[6.4] Compute r and .
f. Prepare a “results and discussion section” appropriate for Developmental Psychology.
1More precisely, a real polynomial in X is any expression that can be obtained from the real
numbers and X using only the operations of addition, subtraction, and multiplication.
http://dx.doi.org/10.4135/9781483384733.n6
This chapter introduces the general linear model and matrix operations for computing sums of
squares and estimating parameters in analysis of variance and multiple regression. The
classical sum-of-squares approach to the analysis of a completely randomized design
described in Chapters 3 and 4 evolved in the pre-computer era. It continues to be the preferred
approach when computations are performed with calculators. In contrast, a computer and
software for manipulating matrices are required to use the general linear model approach.1
Before describing this approach, I compare the main features of analysis of variance and
multiple regression and then introduce the matrix algebra operations that are used in the
general linear model.
In the typical analysis of variance situation, the researcher also wants to determine the extent to
which variation in the dependent variable is associated with variation in the independent
variables. And, as in regression analysis, the value of each independent variable is selected in
advance so that Y is observed for one or a combination of fixed X values. However, in analysis
of variance, even though the independent variables are quantitative, they are treated as if they
are qualitative. This is a key difference between the two situations. Analysis of variance ignores
the magnitude of differences among the levels of the independent variable; regression analysis
uses this information. When the independent variables are qualitative, there is no fundamental
difference between regression analysis and analysis of variance, and both approaches lead to
the same results.
Earlier I noted that the analysis of variance and multiple regression models are subsumed
under the general linear model. The vector and matrix operations that are required to
understand the general linear model are described next. A more detailed introduction to matrix
algebra is given in Appendix D.
A scalar quantity is a number or a symbol, such as Yij, that stands for a number. In previous
chapters, I used scalar (ordinary) algebra to derive formulas for sums of squares and scalar
arithmetic to compute the sums of squares. The emphasis on scalar quantities is appropriate if
computations are done with a calculator. If a computer is available, simpler, more elegant
formulas can be used to compute sums of squares. Computers are not limited to processing
scalar quantities; they can perform operations with arrays of numbers called vectors and
matrices.
Vector Operations
A vector is an ordered set of real numbers or symbols that stand for numbers. It is customary
to denote vectors by lowercase boldface letters and to enclose the elements of a vector in
brackets. For example, the dependent variable, Y11 = 3, Y21 = 1, Y31 = 4, Y41 = 2, in a
completely randomized design can be represented as a column vector and denoted by y:
Alternatively, the dependent variable can be represented as a row vector and denoted by y′:
The optional numbers below the label for a vector—for example, 1 × 4, indicate—respectively,
the numbers of rows and columns of the vector—that is, its dimension or order. A letter
without a prime denotes a column vector and, with a prime, a row vector. A letter with a prime—
say, y′—also denotes the transpose of y. The transpose of a vector is obtained by writing the
ordered elements of a column vector as a row vector and vice versa. The transpose of a
Two or more vectors can be added and subtracted. The addition of two vectors consists of
adding their corresponding elements—that is, adding the first element of one vector to the first
element of the second vector, and so on. Let x and y denote two n × 1 vectors. Then
The subtraction of two vectors is defined in a similar way. Let x′ and y′ denote two 1 × 2 vectors.
Then
The addition or subtraction of vectors is defined only for vectors that have the same dimensions
—either both n × 1 vectors or both 1 × n vectors. Such vectors are said to be conformable for
addition and subtraction.
A vector can be multiplied by another vector. If the first vector is a 1 × n row vector and the
second vector is a n × 1 column vector, the product of the two vectors, called the inner
product, is a scalar. The inner product is obtained by multiplying the first element of the row
vector by the first element of the column vector, and so on, and then adding the products.
Consider the following example:
The inner product of the transpose of a vector with the vector itself is equal to the sum of the
squared elements of the vector—that is, :
A sum vector, 1′ = [1 1 1 … 1], is so named because it can be used to obtain the sum of the
elements in a vector. The inner product of the transpose of a sum vector with another vector, y,
is equal to the sum of the elements of the y vector, that is, . Let 1′ = [1 1 1 1] and y′ =
[3 1 4 2]; then
Matrix Operations
A matrix is a rectangular array of real numbers arranged in rows and columns. It is customary
to denote matrices by uppercase boldface letters and to enclose the elements of a matrix in
brackets. For example, the pre- and posttest scores of four subjects can be represented by the
following matrix, where the rows represent subjects 1 to 4 and the columns, the pretest and
posttest scores, respectively:
The transpose of Y, which is denoted by Y′, is obtained by interchanging rows and columns so
that the ith row of the original matrix becomes the ith column, with the order retained from the
first to last row and column:
The addition of two or more matrices of the same order consists of adding their corresponding
elements—that is, adding the ijth element of one matrix to the ijth element of the other matrix
for all ij. For example,
Then
The multiplication of one matrix by another matrix is defined for matrices in which the number of
columns of the left-hand matrix equals the number of rows of the right-hand matrix. That is,
multiplication is defined if
The ijth element of Z is the inner product of the ith row of X and the jth column of Y, where i =
1, …, n and j = 1, …, p. Consider the following matrices:
and denote the product of X and Y by Z. The element in the first row and first column of Z is
the inner product of row 1 of X and column 1 of Y:
where 12 = (2 × 3) + (–1 × 6) + (3 × 4). The element in the first row and second column of Z is
the inner product of row 1 of X and column 2 of Y:
where 21 = (2 × 1) + (–1 × 2) + (3 × 7). The element in the second row and first column of Z is
the inner product of row 2 of X and column 1 of Y:
where 54 = (0 × 3) + (5 × 6) + (6 × 4). Finally, the element in the second row and second
column of Z is the inner product of row 2 of X and column 2 of Y:
Another type of product is called the Kronecker product.2 The Kronecker product of X and Y,
denoted by X ⊗ Y, resembles scalar multiplication of a matrix in that Y is multiplied by each
element of X. More specifically, X ⊗ Y is defined as
It is apparent from this definition that each element of X premultiplies Y. Furthermore, the
Kronecker product is defined for matrices and vectors of any order; the product of an n × m
matrix and a p × q matrix is a matrix of order np × mq. Consider the following examples:
Some matrices are frequently used in statistics and as a result have been given names. Some
of these matrices are variations of a square matrix in which the number of rows and the
number of columns are the same. One example is a diagonal matrix, in which the elements on
the main diagonal are scalars, not all equal, and the off-diagonal elements are zeros:
Another example is a scalar matrix. A scalar matrix is a diagonal matrix in which all of the main
diagonal elements are the same:
An identity matrix, denoted by I, is a scalar matrix in which the elements on the main diagonal
are 1s:
A simple way to obtain a J matrix with a computer software package is to compute the outer
product of two sum vectors, as in the following example:
This result and another result shown earlier, , lets us write the matrix formula for the
total sum of squares. Suppose that Y1 = 3, Y2 = 1, Y3 = 4, and Y4 = 2. The total sum of
The trace of a square matrix is the sum of the elements on the main diagonal. The trace,
denoted by trace(X) or tr(X), of the matrix
is 3 + 2 = 5.
The columns (rows) of a matrix can be linearly dependent or linearly independent. Consider the
following matrix:
On close inspection, it is apparent that the third column of X is equal to 2 times the first
column:
Whenever the columns (rows) of a matrix can be expressed as a linear combination of other
columns (rows), the columns (rows) are said to be linearly dependent. If the columns (rows) can
not be so expressed, the columns (rows) are linearly independent. The rank of a matrix,
denoted by r, is the number of linearly independent columns (rows) in the matrix. If the rank of
a matrix is equal to the smaller of the number of rows and columns in the matrix, the matrix is
said to be of full rank; otherwise, the matrix is of less than full rank. The X matrix just described
has three columns, but only two of the columns are linearly independent. Thus, the rank of X is
2, and the matrix is of less than full rank. The importance of the rank of a matrix becomes
In scalar algebra, the inverse of a number is its reciprocal. A number X multiplied by its inverse,
In matrix algebra, the inverse of a matrix Y is another matrix, denoted by Y−1, such that
where I is the identity matrix. The inverse of a matrix is defined only for square matrices and
exists only if the matrix is of full rank. If the inverse exists, the matrix is said to be nonsingular;
otherwise, the matrix is singular. The inverse of diagonal and scalar matrices is easy to
compute: Just replace each diagonal element by its reciprocal. For example, the inverse of
We can verify that D−1 is the inverse of D by computing DD−1 or D−1D. The product must
equal the identity matrix:
The inverse of a nondiagonal square matrix can be obtained with the Gauss matrix inversion
procedure, which is described in Appendix D. Unfortunately, the procedure is tedious. The
computation of matrix inverses is best left to computers.
As you will see, an inverse is used in solving simultaneous linear equations. Before illustrating
the procedure, I summarize several properties of inverses, transposes, and matrix rank. More
complete summaries are given in Appendix D. Let U, V, and W denote matrices that are
conformable for addition or multiplication. Let X, Y, and Z denote square, nonsingular matrices
of the same order. The summary of properties follows:
1.XX−1 = X−1X = I
2.(X−1)−1 = X
3.(XYZ)−1 = Z−1Y−1X−1
4.(X′)−1 = (X−1)′
Assuming that a ≠ 0, I multiply both sides of the equation by the inverse of a—for example,
Assuming that A−1 exists, I premultiply both sides of the equation by A−1:
To solve for B in
premultiply both sides by A−1 and postmultiply both sides by C−1, assuming that A−1 and C−1
exist:
The procedure just described works if the matrices are square. It would not work if, say, A is a
There is a way to get around this problem. Consider the following equation and assume that N
is not equal to h:
The trick to solving for B is to premultiply both sides of the equation by the transpose of A as
follows:
This operation makes A′A an h × h square matrix. If A is of full rank, the inverse of A′A is (A′A)
−1. Premultiplying equation (7.2-1) by (A′A)−1 gives
This concludes an abbreviated introduction to vector and matrix operations. The reader is
encouraged to study the expanded introduction in Appendix D and to consult texts on matrix
algebra such as Searle (2006) and Seber (2008). The concepts introduced in this section are
used to present the general linear model.
In this section, I express the multiple regression model and the analysis of variance model in
matrix notation. The matrix representation, which is called the general linear model, is the same
for both models.
A linear model consists of two parts: a model equation and associated assumptions. The
model equation relates the dependent variable, Y to the underlying parameters, βs and ε, and
random variables, Xs, in a linear manner. The assumptions specify the nature of the random
components and any restrictions that the parameters must satisfy. Consider the multiple
regression model:
where
is a random error term with mean equal to zero and variance equal to and
are uncorrelated for all i and .
1.The observed value of Yi is the sum of two components: the constant predictor term, β0 +
β1X i1 + β2X i2 + … βh − 1X i,h − 1, plus the random error term, εi. The error term reflects
that portion of Yi that is not accounted for by the independent variables. Because the error
term is a random variable, Yi is also a random variable.
2.The expected value of the error term equals zero, E(εi) = 0; it follows that the expected
value of Yi is equal to the constant predictor term
3.If the independent variables are quantitative, the unknown parameters can be interpreted
as follows. The parameter β0 is the Y intercept. The β1, …, βh − 1s are partial regression
coefficients, or weights applied to the X ijs, to optimally predict Yi. The parameter β1, for
example, indicates the change in the mean response per unit increase in X i 1 when X i2,
X i3, …, X i,h − 1 are held constant. The interpretation of the parameters when the
independent variables are qualitative is discussed in Section 7.5.
4.When there is only one independent variable, the model is a simple regression model;
when there are two or more independent variables, it is a multiple regression model.
5.The term error, εi, is assumed to have constant variance . It follows that
the variance of Yi is
6.The error terms are assumed to be uncorrelated. The value of εi is not related to the value
of εi for all i and i′, i ≠ i′. Because the εis are uncorrelated, the Yis also are uncorrelated.
Model (7.3-1) is called a linear model because it is linear in the parameters: All of the
parameters appear in the first power; none is multiplied or divided by other parameters or
appears as exponents, and so on. The model is also linear in the independent variables
because the variables appear only in the first power.
The multiple regression model can be written with vectors and matrices. Equation (7.3-1)
represents the following system of equations:
If the equivalence of equations (7.3-2) and () is not apparent, consider the first row of (7.3-2). It
can be obtained from (7.3-3) by computing the inner product of the first row of X and the
column vector β and adding ε1 to the product—that is,
Equation (7.3-4) is the general linear model equation; it can be used to represent a variety of
models, including the analysis of variance model, as I show next.
Consider the experimental design model equation for a completely randomized design that I
introduced in Section 2.2.
This system can be rewritten using an indicator variable that takes on the integer values zero
and 1. The indicator variable is used to indicate whether a parameter is or is not included in an
equation. If a parameter appears in an equation, its indicator variable is equal to 1; otherwise,
its indicator variable is equal to zero.
The equations in this system, like those in the regression model, are linear in the parameters.
Furthermore, they are linear in the indicator variables. I can write system (7.3-5) in matrix form
as
Note that X is a matrix of indicator variable values. Such a matrix is called a structural matrix.
Note also that the general linear model equations for the multiple regression model (7.3-4) and
the analysis of variance model (7.3-6) are identical. Thus, both models are subsumed under
the general linear model. In the following sections, I show how to solve for the vector of
unknowns, β, in the general linear model. Several solutions are described: Three involve
different coding schemes for the independent variable in a qualitative multiple regression model
(see Section 7.5), and one involves an analysis of variance cell means model (see Section 7.7).
Chapters 3 and 4 explained how to estimate and test hypotheses about the parameters of an
analysis of variance model. In this section, I show how this is done for a multiple regression
model.
There are several methods for estimating the parameters of a linear regression model. The
most frequently used method and the one that is outlined here is called the method of least
squares. The method is concerned with minimizing a function of the error in predicting Yi from
, where the are parameter estimators and the Xs a r e
independent variable values. An error , also called a residual, is the difference between the
value of the ith score Yi and the value predicted for that score —that is, . The
objective of the method of least squares is to find estimators of the parameters β0, β1, …, βh −
where .
The method of finding numerical values for that minimize uses differential calculus and is
beyond the scope of this text.3 I report here in matrix form the least squares normal equations.4
To obtain the vector of parameter estimates , premultiply both sides of equation (7.4-2) by the
inverse of X′X, assuming that the inverse exists:
This matrix equation gives the set of parameter estimates that minimize or
.
Two important properties of the least squares estimators are succinctly stated in the Gauss-
Markov theorem. For models (7.3-1) and (7.3-4), the least squares estimators for
are unbiased and have the minimum variance among all unbiased linear estimators. In deriving
the least squares estimators, it is not necessary to make any assumptions regarding the shape
of the distribution of the error term εi. Later when I test hypotheses about the parameters, it is
necessary to assume that the εis (and hence the Yis) are normally distributed. Then the
assumption in models (7.3-1) and (7.3-4) that the error terms are uncorrelated so that the
covariances are equal to zero becomes the assumption of independence of the error terms.
In the previous section, I described a procedure for obtaining estimates of the parameters β1,
β2, …, βh − 1 of the general linear regression model. The question then arises as to whether
one or more of the parameters differs from zero. In this and subsequent sections, I lay the
foundation for using an F statistic to test the null hypothesis that the parameters β1, β2, …, βh
− 1 are equal to zero versus the alternative that βj ≠ 0 for some j, where j = 1, …, h − 1.
I begin by partitioning the total sum of squares SSTO for a regression model. The partition is
similar to that for the completely randomized design in Section 3.2. There I started with the
identity
and, after squaring both sides and summing over the np scores, obtained
I showed that SSTO reflects the total dispersion of scores around the grand mean and is equal
to SSWG + SSBG.
where
These deviations for a simple regression model are illustrated in Figure 7.4-1. To obtain sums of
squares, I square both sides of equation (7.4-3) and sum over the N scores as follows:
It can be shown that the middle term on the right is equal to zero. Hence,
where SSE denotes the error sum of squares and SSR the regression sum of squares. These
formulas are not the most convenient ones to use in computing sums of squares. Before
presenting more usable formulas, I discuss the degrees of freedom for the sums of squares.
The reader can think of degrees of freedom as (1) the number of independent observations for
a source of variation minus the number of independent parameters estimated in computing the
variation or (2) the number of observations whose values are free to vary. Following the first
definition, the total sum of squares has N − 1 degrees of freedom because
involves N independent observations, but one degree of freedom is used up when we estimate
the unknown parameter μ with the sample mean . By the same reasoning, the error sum of
squares has N – h degrees of freedom because involves N independent
observations, but the function involves estimating h independent parameters, β0, β1, …, βh
−1. Following the second definition, the regression sum of squares has h − 1 degrees of
freedom because in contains h parameter estimators, , but
the deviations . are subject to the constraint that must equal zero and hence
only h − 1 of the parameter estimators are free to vary.
A mean square is obtained by dividing a sum of squares by its degrees of freedom. The mean
squares for the regression model are given by
where
I observed earlier that the partition of the total sum of squares in equation (7.4-4) does not
provide formulas that are convenient for computational purposes. The preferred formulas in
matrix notation are as follows:5
where
y is an N × 1 vector of observations.
J is an N × N matrix of 1s.
C′ is a coefficient matrix that determines the parameters of β that are included in the
null hypothesis.
If MSR is substantially greater than MSE, then we would suspect that the null hypothesis is
false. On the other hand, if MSR and MSE are of the same order of magnitude, we would
suspect that MSR estimates only and the null hypothesis should not be rejected. As the
reader has probably anticipated, the statistic for testing the null hypothesis is
with h − 1 and N – h degrees of freedom. I can justify the use of F here by recourse to
Cochran's theorem. For our purposes, the theorem can be stated as follows:
If N observations are obtained from the same normal population with mean μ and
variance and if SSTO with N − 1 degrees of freedom is partitioned into m sums of
squares with degrees of freedom df1, df2, …, dfm; then the m terms
are independent variables with dfj degrees of freedom if and only if df1 +
df2 + … dfm = N − 1.
Earlier I partitioned SSTO into SSR and SSE, as well as the N − 1 degrees of freedom for the
total sum of squares into h − 1 and N –h. For model (7.3-1), I observed that E(Yi) = β0 + β1X i1
+ β2X i2 + … + βh−1X i,h − 1 a n d . According to Cochran's theorem, if the N
observations are obtained from a normal population and the null hypothesis β1 = β2 = … = βh
is the ratio of two independent chi-square variables, each divided by its degrees of freedom.
Recall from Section 3.1 that this is the definition of an F random variable. Thus, if the null
hypothesis is true, F = MSR/MSE is distributed as the F distribution with h − 1 and N – h
degrees of freedom.
Dummy Coding
Consider the hand-steadiness data for the sleep deprivation experiment described in Section
4.2. The experiment has four treatment levels. The regression model equation for this
experiment can be written as
where X 0 is always equal to 1 and the h − 1 = 3 qualitative independent variables X i1, X i2, and
X i3 are coded as follows:
The X matrix for the sleep deprivation experiment consists of four columns: The first column
always contains 1s, and the second through fourth columns contain coded values for X i1, X i2,
and X i3. The pattern of 1s and zeros for the X matrix is as follows:
This coding scheme for the indicator variables is called dummy coding.
Table 7.5-1 ▪ Computational Procedures for a CR-4 Design Using a Regression Model
With Dummy Coding
Another coding scheme, called effect coding, is often used for analysis of variance designs
with two or more treatments. The coding scheme is as follows:
This coding scheme looks a bit more complicated, but the resulting N × hX matrix is similar to
that for dummy coding except for the last treatment level ap. For treatment level ap, the
independent variables X i1, X i2, … , Xi, h − 1 all receive negative 1s instead of zeros. To
illustrate the X matrix for this coding scheme, let p = 4 and the number of subjects in each
treatment level equal 2. The X matrix with effect coding is as follows:
A third coding scheme, called orthogonal coding, is sometimes used. The distinguishing
feature of orthogonal coding is that the columns of the X matrix, denoted by x0, x1, x2, …, xh
− 1, are mutually orthogonal. In Sections 6.2 and 6.3, I tested trend contrasts with orthogonal
polynomial coefficients. A regression model could be used to test the trend contrasts by coding
x1 with the linear coefficients, x2 with the quadratic coefficients, and x3 with the cubic
coefficients as follows:
With this coding scheme, the contributions to SSR of the linear, quadratic, and cubic trend
contrasts are given by, respectively,
The coefficients , and are obtained from , and x3 are the respective
columns of X, and y is a vector of observations. For example, the sum of squares for the linear
contrast is given by
The various coding schemes—dummy, effect, and orthogonal coding—for X yield different
values for but the same values for SSR and SSE. Next I show how the general
linear regression model is used to compute sums of squares for an analysis of variance.
Most computer packages perform an analysis of variance using a linear regression model with
qualitative independent variables. Here I illustrate the procedure using dummy coding for the
hand-steadiness data in Table 4.3-1. The statistical hypotheses are
The only restriction on C′ is that its rows must be linearly independent. Procedures for
computing the sums of squares required to test the null hypothesis are illustrated in Table 7.5-
1; the results are summarized in Table 7.5-2. The .05 level of significance is adopted. Because
F > F.05, 3, 28, the null hypothesis is rejected. A comparison of Table 4.3-2 with Table 7.5-2
reveals that SSR = SSBG = 41.375, SSE = SSWG = 69.500, and F = 5.56 for both analyses. In
both cases, the null hypothesis is rejected. One begins to suspect that a test of the null
hypothesis, β1 = β2 = β3 = 0, for the linear regression model has some bearing on the
tenability of the null hypothesis, μ1 = μ2 = μ3 = μ4, for the analysis of variance model. I
examine this relationship next.
Correspondence Between Parameters of the Regression and the Analysis of Variance Models
Dummy coding. In the previous section, you saw that the use of a regression model with
dummy coding led to the same decision as that obtained with an analysis of variance model.
Now I show why this is so. For the sleep deprivation experiment, the expected values of Y for
the analysis of variance and regression models are as follows:
Consider the ith observation in treatment level a4; the respective expectations are
Equating these two expectations for treatment level a4, we find the following correspondence
between the parameters:
For the general case, β0 is equal to μp and βj is equal to μj – μp for j = 1, …, h − 1. If the null
hypothesis, β1 = β2 = β3 = 0, is true, it follows that
This is why the statistic F = MSR/MSE, which is used to test β1 = β2 = β3 = 0 for the regression
model, provides a test of μ1 = μ2 = μ3 = μ4, which is the hypothesis tested by F =
MSBG/MSWG in a completely randomized design.
Effect coding. The use of effect coding leads to a different correspondence between the
parameters of the regression and analysis of variance models. Consider the ith observation in
treatment level a1; the respective expectations are as follows:
Equating these two expectations for treatment level a1, we find the following correspondence
between the parameters:
It follows that
It is easy to see where effect coding got its name: The regression parameters β1, β2, and β3
are equal to the treatment effect in the analysis of variance model. For the general case, β0 is
equal to μ and βj is equal to αj for j = 1, …, h − 1. If the null hypothesis, β1 = β2 = β3 = 0, is
true, it follows that
Unbiased estimates of the hand-steadiness population means, μjs, can be obtained by using
These equations are based on dummy coding and yield means that are identical to those given
in Table 4.2-1 for the hand-steadiness data. It is a simple matter to construct a p × h matrix W
such that the product of W and is equal to the vector of treatment means for a completely
randomized design:
where . Based on equations (7.5-1a-d), the W matrix for the general case has
the following form:
Note that the pattern of 1s and zeros in W follows that for the X matrix in Table 7.5-1. For the
data in Table 7.5-1, the vector of treatment means is obtained from
I digress briefly and examine an alternative way of thinking about a test of the null hypothesis
β1 = β2 = … = βh − 1 = 0. The model I have been using to describe Yi is called a full model:
Full model
For the time being, I denote the error sum of squares for this model by SSE(F). We know from
Section 7.4 that this error sum of squares is given by
The error sum of squares for this model is denoted by SSE(R) and is given by
A researcher's task is to use sample data to choose between these two models. If the null
hypothesis is false, the full model provides a better fit to the Yis than the reduced model, and
SSE(F) should be less than SSE(R). If the null hypothesis is true, the reduced model describes
the Yis as well as the full model, and SSE(R) and SSE(F) should differ no more than would be
expected by chance. To summarize, a large positive difference between SSE(R) and SSE(F)
suggests that the null hypothesis is false. A negligible difference between SSE(R) and SSE(F)
suggests that the null hypothesis is tenable. The test of β1 = β2 = … = βh − 1 = 0 is carried out
using a function of SSE(R) – SSE(F)—namely,
where df(R) and df(F) are the degrees of freedom associated with the reduced and full models,
respectively.
Let us examine SSE(R) and SSE(F) more closely. The error sum of squares for the full model,
SSE(F), is simply SSE. For the reduced model, it can be shown6 that . Hence, the
error sum of squares for the reduced model can be written as
Note that the formula on the right is identical to that for SSTO given in equation (7.4-4). Thus,
SSE(R) = SSTO, and SSE(R) – SSE(F) is equal to SSTO – SSE. Furthermore, according to
equation (7.4-4), SSTO – SSE = SSR. Thus, the F statistic can be written as follows:7
where SSR and MSR denote, respectively, the regression sum of squares and the regression
mean square. If F ≥ Fα; h − 1, N – h, a researcher can conclude that one or more of the
parameters β1, β2, …, βh − 1 are not equal to zero and that the full model provides a better fit
to the Yis than the reduced model.
Sometimes it is useful to consider models that are somewhere between the reduced model just
described and the full model—that is, models that include some but not all of the independent
variables. I refer to such models as partially reduced models. Consider a regression model
with three independent variables: X i1, X i2, and X i3. Suppose that a researcher wants to
determine whether X 3 makes a contribution over and above that attributable to X 1 and X 2 to
minimizing . The contribution of X 3 can be determined by comparing the error sum of
squares for model
Model (7.6-1) is the partially reduced model; model (7.6-2) is the full model. I could denote the
error sum of squares for the partially reduced model by SSE(R) and that for the full model by
SSE(F). However, it is useful to have a more explicit notation. One such notation lists the
independent variables that are included in a model in parentheses following SSE and MSE.
With this notation, the error sums of squares for models (7.6-1) and (7.6-2) are denoted by,
respectively, SSE(X 1X 2) and SSE(X 1X 2X 3). The reduction in the error sum of squares due to
fitting the full model over and above that due to fitting the reduced model is denoted by
SSR(X 3| X1X 2) and is given by
To determine whether X 3 makes a contribution to describing the Yis, I can use the test statistic
with 4 − 3 = 1 and N − 4 degrees of freedom. If F ≥ Fα; 1, N − 4, I can reject the hypothesis that
β3 = 0 and conclude that X 3 makes a contribution over and above that of X 1and X2 t o
minimizing .
The data in Table 7.5-1 are used to illustrate the test of β3 = 0. The reduced sum of squares,
SSE(X 1X 2), is computed by eliminating the column labeled x3 from the X matrix and
performing the computations using the first column (x0) and two remaining independent
variables (columns x1 and x2). The error sum of squares for the reduced model is given by
where
The error sum of squares for the full model, SSE(X 1X 2X 3) = SSE = 69.500, is obtained from
Table 7.5-1. The F statistic is
which exceeds the critical value F.05; 1, 28 = 4.20. On the basis of the correspondence
between the parameters of the regression and ANOVA models shown in equation (7.5-1d), I
conclude that β3 = μ3 – μ4 ≠ 0.
The F statistic just illustrated is very versatile and can be used to test a variety of hypotheses.
For example, the contribution of independent variables X 2 and X 3 over and above that of X 1 is
given by
A second scheme for denoting sums of squares is called reduction notation; it is ordinarily
used with analysis of variance models. In reduction notation, following Searle (1997), the
regression sum of squares is denoted by R(), where the parentheses contain the parameters
that have been fitted in the model. For example, the reductions in sums of squares for models
(7.6-1) and (7.6-2) are denoted by R(β0 β1 β2) and R(β0 β1 β2 β3), respectively. The additional
reduction in sum of squares due to fitting model (7.6-2) over and above that due to fitting model
(7.6-1) is denoted by
For purposes of comparison, this same reduction in sum of squares using the first notation is
written as
In this and the preceding sections, I used the multiple regression version of the general linear
model to analyze data for a completely randomized design. As you have seen, the general
linear regression model leads to the same results as those obtained with the classical sum-of-
squares approach described in Chapter 4. The reader may wonder why a variety of regression
model approaches have been developed to perform an analysis of variance. Why not solve for
the analysis of variance parameters directly, instead of indirectly in terms of regression
parameters? It turns out that when a computer is used, it is easier to solve for the β parameters
of a regression model than it is to solve for the β parameters of the classical analysis of
variance model. In the next section, I describe an alternative analysis of variance model called
the cell means model. This model is easy to implement on a computer and is conceptually
simpler than the regression and classical analysis of variance models.
Analysis of variance models can be classified as either full-rank models or less than full-rank
models. The distinction is based on the rank of the structural matrix, X. Recall from Section 7.2
that the rank, r, of a matrix is the number of linearly independent columns (rows) in the matrix.
A matrix is of full rank if its rank is equal to the smaller of the number of its rows and columns;
otherwise, the matrix is of less than full rank. Consider the classical analysis of variance model
for a completely randomized design:
On close inspection, it is apparent that the first column of X is equal to the sum of columns 2
through 5. Thus, the columns of X are not linearly independent, and the matrix is of less than
full rank. For such matrices, the inverse (X′X)−1 does not exist, and it is not possible to solve
for the vector of parameter estimates, , as I did for the multiple regression model using
. Model (7.7-1) is called a less than full-rank model.
Statisticians have developed a number of techniques for obtaining parameter estimates that
circumvent the rank deficiency of X. The three most widely used techniques8 involve the
following:
An alternative approach to the rank-deficiency problem is to use a full-rank model called the
cell means model. This analysis of variance model is
Notice that the columns of X are linearly independent. Thus, the X matrix is of full rank, and it is
possible to solve for the vector of treatment mean estimates, , using . In
addition to being of full rank, the cell means model differs from the classical analysis of
variance model in terms of the parameters of interest. For the cell means model, interest
centers on treatment (cell) means, μ1, μ2, …, μp. For the classical analysis of variance model,
interest centers on treatment effects, α1, α2, …, αp.
Although I described the classical analysis of variance model first, this is not the order in which
the models evolved historically. According to Urquhart, Weeks, and Henderson (1973), Fisher's
early development of ANOVA was conceptualized by his colleagues in terms of treatment
means. It was not until later that the treatment means were given a linear structure in terms of
the grand mean and treatment effects—that is, μj = μ + αj. The classical analysis of variance
model (7.7-1) uses two parameters, μ + αj, to represent one parameter, μj. Because of this
structure, model (7.7-1) is referred to as an overparameterized model. As I have shown, the
model is not of full rank, and this led to the development of alternative approaches to solving
for the model parameters.
The use of the overparameterized analysis of variance model poses other problems for a
tested. This follows because the null hypothesis is expressed in terms of mean contrasts, and
sample estimators of the mean contrasts are incorporated in the between-groups sum-of-
squares formula as I show next.
I use the sleep deprivation experiment described in Section 4.2 to illustrate the cell means
model approach to the analysis of variance. The experiment has p = 4 treatment levels with n =
8 subjects randomly assigned to each level. Three forms of the general linear model can be
used to determine if sleep deprivation affects hand-steadiness: the overparameterized ANOVA
model, the regression model, and the cells means ANOVA model. The null hypothesis for the
overparameterized ANOVA model is
where μ is the grand mean, μ = (μ1 + μ2 + μ3 + μ4)/4. Alternatively, a regression model with
effect coding can be used to determine if sleep deprivation affects hand-steadiness. The
regression model null hypothesis,
with h − 1 parameters provides a test of hypothesis (7.7-3) because, as I show in Section 7.5,
The cell means model also can be used to test the hypothesis about sleep deprivation. The null
hypothesis for this model is expressed with p − 1 contrasts of the treatment means. For
example, the null hypothesis that corresponds to hypothesis (7.7-3) is
A hypothesis of mean contrasts for testing hypothesis (7.7-3) can be expressed in a variety of
ways. A second mean contrasts hypothesis is
In order for the mean contrasts hypothesis H′μ = 0 to be testable, the hypothesis matrix, H′,
must be of full row rank. This means that each row of H′ must be linearly independent of every
other row. The maximum number of such rows is p − 1. To summarize, the cell means model
approach to ANOVA tests hypotheses about contrasts of treatment means instead of
hypotheses about treatment effects. Furthermore, the researcher determines which mean
contrasts are tested when the hypothesis matrix is specified.
The first step in analyzing data with the cell means model is to formulate the mean contrasts
hypothesis: H′μ = 0. This step is straightforward for the completely randomized design. For
more complex designs with missing observations and missing cells, the researcher must make
choices about which mean contrasts to test. For designs with several treatments, the
The cell means formula for computing the between-groups sum of squares is
Notice that the coefficient matrix C′ (or, equivalently, H′ for a CR-p design) appears in the
formula. Hence, there is never any ambiguity about the hypothesis that is tested because the
hypothesis matrix is a part of the formula.
I use the hand-steadiness data in Table 7.5-1 to illustrate the computational procedures for the
cell means model. Procedures for computing sums of squares for the hand-steadiness data are
illustrated in Table 7.7-1. The results in Table 7.7-1 are identical to those in Table 4.3-1 where
the classical sum-of-squares approach to ANOVA was used.
Table 7.7-1 ▪ Computational Procedures for CR-4 Design Using a Cell Means Model
7.8 Summary
This chapter describes three forms of the general linear model approach to ANOVA: the
classical overparameterized model, the linear regression model, and the cell means model.
Many computer software packages use the linear regression model to perform an analysis of
variance. Consequently, a familiarity with the approach is helpful in interpreting ANOVA
computer printouts.
As I demonstrated in this chapter, a qualitative regression model can be used to estimate the
parameters of the analysis of variance model and to partition the total variation so as to test the
ANOVA null hypothesis. This is done by establishing a correspondence between the p − 1
treatment levels of a completely randomized design and the h − 1 independent variables of a
regression model. Three schemes for coding the independent variable of a regression model
are described: dummy coding, effect coding, and orthogonal coding. Although the solution to
depends on the particular coding scheme used, certain linear combinations of
are unique and can be used to estimate the ANOVA treatment means.
Analysis of variance models can be classified as either less than full-rank models or full-rank
models. The distinction is based on the rank of the structural matrix. A model is said to be of
full rank if X is of full rank; otherwise, the model is of less than full rank. The solution to for
the classical overparameterized model
cannot be obtained in the usual way because the inverse (X′X)−1 does not exist. Although
statisticians have developed ways to get around this problem, the simplest solution is to use
the cell means model,
which is a full-rank model. The solution to for the cell means model can be
obtained because (X′X)−1 always exists. This model has other advantages: (1) It is
conceptually simpler than the regression model and the classical overparameterized analysis of
variance model, (2) it enables a researcher to test hypotheses about any linear combination of
treatment (cell) means, and (3) the formula for computing the between-groups sum of squares
contains the means contrasts matrix, so there is never any ambiguity about the hypothesis that
is tested. The importance of the latter advantage becomes more apparent when I discuss
complex designs with unequal cell frequencies.
1.Terms to remember:
a.scalar quantity (7.2)
b.vector (7.2)
c. dimension or order (7.2)
d.transpose (7.2)
e.scalar multiplication (7.2)
f. inner product (7.2)
g.sum vector (7.2)
h.matrix (7.2)
i. Kronecker product (7.2)
j. square matrix (7.2)
k. diagonal matrix (7.2)
*a.y′
*b.y + b
*c.b′y
*d.1′c
*e.B′
*f.3X′
*g.b′B
*h.X′X
*i.c′S + c′
*j.y′y – b′X′y
*k.c⊗S
l. y + 2b
m.y′y
n.1′y
o.Bc
p.X′y
q.y′Jy
r. b′Jb
s. (X′X)(X′y)
t. b′X′y
u.c′Sc
v. S⊗X
w.y′y – y′Jy(1/3)
x. b′X′y – y′Jy(1/3)
*4.[7.2] (i) Determine the ranks of the following matrices. (ii) Which are of full rank?
*5.[7.3] Comment on the following statement: The linear model for a completely randomized
design is Yij = μ + αj + ε i(j) (i = 1, …, n; j = 1, …, p).
*6.[7.3]
*7.[7.3] Write the following systems of equations using vectors and matrices.
*a.Regression model equations with two independent variables:
*8.[7.3] Write the following systems of equations using vectors and matrices.
*a.CR-2 design
b.CR-3 design
c. [7.2, 7.7] Explain why the X matrices for parts (a) and (b) are of less than full rank.
*12.
[7.5] Assume that the X matrix for a qualitative regression model has been coded for the
following ANOVA designs using (i) dummy coding and (ii) effect coding. Indicate the
correspondence between the parameters of the regression and analysis of variance models.
*a.CR-3 design
*b.CR-5 design
c. CR-4 design
d.CR-6 design
*13.
The effects of an interpersonal skill training program on men psychiatric inpatients were
investigated. The program was designed to increase their performance competence in
initiating conversations, dealing with rejection, and being more assertive and self-disclosing.
Six patients were randomly assigned to one of three groups. Patients in group a1, referred
to as the interpersonal skill training group, received three 1-hour training sessions that
covered 11 problem situations. Training techniques included behavior rehearsal, modeling,
coaching, recorded response playback, and corrective feedback. Patients in group a2, the
pseudo-therapy control group, covered the same 11 problem situations, but instead of
receiving suggestions for specific behavioral changes, they were encouraged to explore
their feelings about the problems and to seek insight into the psychological and historical
reasons for these feelings. Patients in group a3, referred to as the assessment control
group, participated in the posttreatment assessment but received no training. At the
conclusion of the experiment, the patients were instructed to perform eight tasks, including
initiating a conversation with a male stranger (a confederate), asking the stranger to lunch,
and terminating the conversation after 10 minutes. The confederate confronted the patient
with three “critical moments”: not hearing the patient's name when introduced, responding
to the lunch invitation with an excuse, and saying to the patient, “Tell me about yourself.”
The dependent variable was the number of interaction tasks (out of eight) that the patient
was able to complete. The following data were obtained. The number of subjects has been
reduced so that the computations can be done without a computer. (Experiment suggested
by Goldsmith, J. B., & McFall, R. M. Development and evaluation of an interpersonal skill-
training program for psychiatric inpatients. Journal of Abnormal Psychology.)
a1a2a3
8 2 1
6 3 3
−1, X′y, , and . Use the Gauss matrix inversion technique that is
described in Appendix D to obtain (X′X)−1.
*e.[7.5] Assume that dummy coding has been used; compute MSR, MSE, and SSTO.
Construct an ANOVA table; let α = .05.
*f.[7.5] Determine the correspondence for this CR-3 design between the parameters of the
regression and classical analysis of variance models when dummy coding is used.
*g.[7.5] Compute the vector of treatment means using ; assume that dummy
coding is used.
*h.[7.6] (i) Write a reduced model for computing SSE(0)—that is, a model that contains no
independent variables. (ii) Compute the following vectors and matrices for this reduced
model (let the subscript 1 identify the reduced model): , and
. (iii) Compute SSE(0) a n d SSR(X 1X 2 | 0 ) = SSE(0) –
SSE(X 1X 2), where SSE(X 1X 2) i s equal to SSE from part 13(e). (iv) Compare
SSR(X 1X 2 | 0) with SSR from part 13(e).
*i.[5.3] Use Dunnett's statistic to test null hypotheses for contrasts involving the control
group mean. The critical value of Dunnett's statistic, tDN05/2; 3, 3, is approximately 4.
*j.[7.5] Determine the correspondence for this CR-3 design between the parameters of the
regression and analysis of variance models when effect coding is used.
*k.[7.5] Assume that effect coding is used; compute the following: X′X, (X′X)−1, X′y, ,
and , and y′JyN−1. Use the Gauss matrix inversion technique that is
14.
The effects of externally imposed deadlines on an individual's task performance and
subsequent interest in the task were investigated. The subjects, eight male undergraduate
college students, were asked to play five sets of enjoyable word games under one of four
sets of instructions. In the no-deadline condition, treatment level a1, subjects played with
the game with no performance requirements or time constraints. In the two deadline
conditions, the subjects were told (1) that they would be allowed to play with the game for
15 minutes, (2) to work as quickly as possible, and (3) that most students could finish the
game within the time period. In the implied-deadline condition, treatment level a2, no further
instructions were given; in the explicit-deadline condition, treatment level a3, the subjects
were told that they were required to finish the game in the allotted time in order for their
data to be useful. In the work-fast condition, treatment level a4, subjects were asked to
work as quickly as possible, but they were given no time limits or information about the
performance of others. Eight subjects were randomly assigned to the four conditions. One
of the dependent variables was the length of time (in minutes) required to complete the
games. The following data were obtained. The number of subjects has been reduced so
that the computations can be done without a computer. (Experiment suggested by Amabile,
T. M., DeJong, W., & Lepper, M. R. Effects of externally imposed deadlines on subsequent
intrinsic motivation. Journal of Personality and Social Psychology.)
j. [7.5] Determine the correspondence for this CR-4 design between the parameters of the
regression and analysis of variance models when effect coding is used.
k. [7.5] Assume that effect coding has been used; compute the following: X′X, (X′X)−1, X
′y, , and . Use a computer to obtain (X′X)−1. If a computer is not available, use the
Gauss matrix inversion technique that is described in Appendix D.
l. [7.5] (i) Assume that effect coding has been used; compute MSR, MSE, and SSTO. (ii)
If part (e) was done, compare the results of the two analyses.
m.[7.5] Compute the vector of treatment means using ; assume that effect coding is
used.
*15.
[7.7] A mean contrasts hypothesis equivalent to H0: α1 = α2 = α3 = α4 = 0 c a n b e
formulated for a cell means model in a variety of ways.
*a.Write four alternative formulations of this null hypothesis using population means.
*b.Write each mean contrasts hypothesis using vectors and matrices.
16.
[7.7]
a.Formulate three mean contrasts hypotheses for a cell means model that are equivalent
to
*17.
[7.7]
18.
[7.7]
*Readers who are interested only in the classical sum-of-squares approach to analysis of
variance can, without loss of continuity, omit this chapter and similarly marked sections.
1Calculators are available that perform matrix operations, including the inversion of a matrix.
4Normal equations are a set of equations derived by the method of least squares to obtain
estimates of the parameters β0, β1, …, βh − 1.
5The derivation of SSE follows from the definition of and equation (7.4-2):
7Some writers express this statistic in terms of the coefficient of multiple determination, R2,
which represents that proportion of the total variability among the Y scores that is accounted for
by the independent variables X 1, X 2, …, X h − 1. One formula for computing R2 is
8Detailed coverage of these methods can be found in Gentle (2010), Searle (2006), and Seber
(2008).
9For a discussion of these problems, see Hocking (1985), Hocking and Speed (1975), Speed
(1969), Timm and Carlson (1975), and Urquhart, Weeks, and Henderson (1973).
http://dx.doi.org/10.4135/9781483384733.n7
This chapter describes two designs: a randomized block design and a generalized randomized
block design. Both designs use a blocking procedure to reduce error variance. In the behavioral
sciences, health sciences, and education, differences among the experimental units can make
a significant contribution to error variance, thereby masking or obscuring the effects of a
treatment. Similarly, administering the levels of a treatment under different environmental
conditions—say, at different times of the day, locations, and seasons of the year—can
contribute to error variance and mask treatment effects. Variation in the dependent variable
attributable to such sources is called nuisance variation. In Section 1.7, I described three
experimental approaches to controlling these undesired sources of variation:
1.Hold the nuisance variable constant, for example, use only 18-year-old women and
administer all of the treatment levels at the same time.
2.Assign the experimental units randomly to the treatment levels so that known and
unsuspected sources of variation among the units are distributed over the entire experiment
and thus do not affect just one or a limited number of treatment levels.
3.Include the nuisance variable as one of the factors in the experiment.
The latter approach uses a blocking procedure to isolate variation attributable to the nuisance
variable so that it does not appear in estimates of treatment and error effects. The procedure
involves forming n blocks of p homogeneous experimental units, where p is the number of
levels of the treatment and the n blocks correspond to the levels of the nuisance variable. The
blocks are formed so that at the beginning of the experiment, the experimental units in each
block are more homogeneous with respect to the nuisance variable than are those in different
blocks.
To illustrate, suppose that I plan to compare the effectiveness of three methods for memorizing
German vocabulary, and I want to isolate the nuisance variable of intelligence or IQ. Suppose
that 30 students are available to participate in the experiment. I could assign the three students
with the highest IQs to block one, the three with the next highest IQs to block two, and so on.
The n = 10 blocks correspond to n arbitrarily determined levels of IQ. The three students in
each block are then randomly assigned to the three memorizing methods, treatment A. This
experiment illustrates an important feature of a randomized block design—the use of a blocking
procedure to isolate a nuisance variable. The design is denoted by the letters RB-p, where p
stands for the number of treatment levels. Alternatively, I could use the completely randomized
design described in Chapter 4 to compare the effectiveness of the three methods for
memorizing German vocabulary. This design is denoted by CR-3. The layouts for the two
designs are shown in Figure 8.1-1.
Figure 8.1-1 ▪ (a) Layout for a randomized block design (RB-3 design) with three treatmen
levels (Treat. Level) denoted by a1, a2, and a3. The dependent variable (Dep. Var.) is
denoted by Yij. Each block contains three students whose IQs are similar. The three
students in each block are randomly assigned to the three treatment levels. (b) Layout for
a completely randomized design (CR-3 design). The 30 students are randomly assigned
to the three treatment levels, with the restriction that 10 students are assigned to each
level. The 10 students in treatment level a1 are called Group1, and so on.
Is one design preferable to the other? If the variables of IQ and number of trials required to
memorize German vocabulary are correlated, then the randomized block design will provide a
more powerful test because it removes the effects of IQ from the estimate of the error variance.
To understand this point, examine the partition of the total sum of squares for the two designs
shown in Figure 8.1-2. The completely randomized design partitions the total sum of squares
into two parts: SSBG and SSWG. The randomized block design partitions the total sum of
squares into three parts: SSA, SSBLOCKS, and SSRESIDUAL. The sums of squares that are
used to estimate the error variance in the two designs are indicated by the rectangles with the
thicker lines. It is apparent from Figure 8.1-2 that if SSBLOCKS accounts for an appreciable
portion of the total sum of squares, then the SSRESIDUAL error term will be smaller than the
SSWG error term. In general, if the block variable is correlated with the dependent variable,
then the F statistic,
Figure 8.1-2 ▪ Partition of the total sum of squares and degrees of freedom for a CR-3
design with n = 10 subjects in each treatment level and an RB-3 design with n = 1 0
blocks. The rectangles with the thicker lines indicate the sums of squares that are used
to compute an estimate of experimental error. The estimate of the experimental error for
the RB-3 design will be less than the estimate for the CR-3 design if SSBLOCKS in the
RB-3 design accounts for an appreciable portion of the total sum of squares because
SSRESIDUAL = SSWG – SSBLOCKS.
for the randomized block design is greater than the F statistic for the completely randomized
design:
Hence, the randomized block design is more powerful. However, the randomized block design
will be less powerful than the completely randomized design if SSRESIDUAL is equal to or just
slightly less than SSWG because the latter design has more degrees of freedom to estimate
error variance and requires a smaller critical value to reject the null hypothesis.
A randomized block design is appropriate for experiments that meet, in addition to the general
assumptions of analysis of variance, the following three conditions:
1.One treatment with p ≥ 2 treatment levels and one nuisance variable with n ≥ 2 levels.
2.Formation of n blocks, each containing p homogeneous experimental units. The variability
among units within each block should be less than the variability among units in different
blocks. Alternatively, a block can contain one experimental unit that is observed p times.
3.If each block contains p matched experimental units, the units in each block should be
randomly assigned to the p treatment levels. The design requires n sets of p homogeneous
experimental units—a total of N = np units. If each block consists of one experimental unit
that is observed under each of the p treatment levels and if the nature of the treatment
permits, the order of presenting the p treatment levels should be randomized independently
for each experimental unit. For this case, the design requires n experimental units that are
each observed p times.
Any variable that is correlated with the dependent variable other than the independent variable
can be used as a blocking variable. In forming blocks, the object is to assign experimental units
to blocks so that those in a given block are as similar as possible with respect to the dependent
variable, while those in different blocks are less similar. Four procedures are often used to
accomplish this objective:
1.Forming sets of subjects who are similar with respect to a nuisance variable that is
correlated with the dependent variable. This is called subject matching.
2.Observing each subject under all conditions in the experiment—that is, obtaining repeated
measures on the subjects.
3.Obtaining sets of identical twins or littermates and randomly assigning members of a pair or
litter to the conditions in the experiment.
4.Obtaining pairs of subjects who are matched by mutual selection—for example, husband-
and-wife pairs or business partners.
The first procedure, subject matching, was used in the experiment on memorizing German
vocabulary. I assumed, with good reason, that IQ and memorizing ability are positively
correlated. Hence, if each block contains subjects with similar IQs, the subjects also should be
similar with respect to their ability to memorize German vocabulary. The second procedure,
obtaining repeated measures, uses subjects as their own controls. This procedure is
appropriate for treatment levels that have relatively short-duration effects. The nature of the
treatment should be such that the effects of one condition dissipate before the subject is
observed under another condition. Otherwise, subsequent observations reflect the cumulative
effects of the preceding conditions. There is no such restriction, of course, if carryover effects,
such as learning or fatigue, are the principal interest of the researcher. If blocks are composed
of littermates or identical twins, the third procedure, homogeneity is achieved with respect to
genetic characteristics. It is assumed that the behavior of subjects with identical or similar
heredities is more homogeneous than the behavior of subjects with dissimilar heredities. The
fourth procedure involves using subjects who are matched by mutual selection—for example,
husband-and-wife pairs, business partners, or club members. A researcher must always
ascertain that subjects matched in this way are in fact more similar with respect to the
dependent variable than are unmatched subjects. Knowing a husband's political attitudes, for
example, may provide considerable information about his wife's political attitudes, and vice
versa. However, knowing a husband's mechanical aptitude is not likely to provide information
about his wife's mechanical aptitude.
The procedure used to form the blocks does not affect the computational procedures; however,
the procedure does affect the interpretation of the results. The results of an experiment with
repeated measures generalize to a population of subjects who have been exposed to all of the
treatment levels. However, the results of an experiment with matched subjects generalize to a
population of subjects who have been exposed to only one treatment level. Some writers
reserve the designation randomized block design for this latter case. They refer to a design with
repeated measurements in which the order of administering the treatment levels is randomized
independently for each subject as a subjects-by-treatments design. A design with repeated
measurements, in which the order of administering the treatment levels is the same for all
subjects, is referred to as a subject-by-trials design.
When researchers consider potential blocking variables, they often overlook characteristics of
the environmental setting. Blocking with respect to time of day, day of the week, season,
location, and experimental apparatus can significantly decrease the size of the error variance.
Time is a particularly effective blocking variable because it often isolates a number of additional
sources of variability: circadian body cycles, fatigue, changes in weather conditions, and drifts
in calibration of equipment, to mention only a few. Ideas for appropriate blocking variables may
come from one's experience in a research area, the literature, common sense, or intuition.
To summarize, the purpose of blocking is to reduce the error variance and obtain a more
precise estimate of treatment effects, thereby obtaining a more powerful test of a false null
hypothesis. If the variation among the experimental units within blocks is appreciably less than
the variation between the blocks, a randomized block design is more powerful than a
completely randomized design. A measure of the relative efficiency of the two designs is
described in Section 8.7.
A score, Yij, in a randomized block design is a composite that reflects the effects of treatment j,
block i, and all other sources of variation that affect Yij. These other sources of variation are
collectively referred to as error effects. Our expectations about Yij can be expressed more
formally by an experimental design model:
where
Yij is the score in the ith block and jth treatment level.
μ is the grand mean of the population means, μ11, μ12, …, μnp. The grand mean is a
constant for all scores in the experiment.
αj is the treatment effect for population j and is equal to μ,j – μ, the deviation of the
grand mean from the jth population mean. The jth treatment effect is a constant for all
scores in treatment level aj and is subject to the restriction .
πi is the block effect for population i and is equal to μi. – μ, the deviation of the grand
mean from the ith population mean. The block effect is a random variable that is
.
εij is the error effect associated with Yij and is equal to Yij – μ.j – μi. + μ. The error
effect is a random variable that is and independent of πi.
Model (8.1-1), which is called a mixed model or model III, is probably the most commonly used
model for an RB-p design. It is appropriate for the vocabulary-memorizing experiment described
earlier. There I was interested only in three methods of memorizing German vocabulary; hence,
the levels of treatment A were fixed. If I replicated the experiment, I would use the same three
methods. However, the n blocks corresponded to n arbitrarily determined levels of IQ. If I
replicated the experiment with a new sample of subjects, the IQ levels would undoubtedly be
different. In this case, it is appropriate to regard the blocks as a random variable. Of course,
because the IQ levels were not randomly sampled, inferences about the results of the
experiment are restricted to the hypothetical population of IQ levels from which the sample in
the experiment would have been a random sample if random sampling had been used.
The error effect, εij, in model (8.1-1) is often referred to as a residual effect. This designation is
an apt one. For model (8.1-1), εij is equal to
In words, εij is that portion of Yij that remains after the grand mean, treatment effect, and block
effect have been subtracted from it. It follows that
The values of the parameters μ, αj, πi, and εij in model (8.1-1) are unknown, but they can be
estimated from sample data as follows:
The partition of the total sum of squares for a randomized block design was illustrated in Figure
8.1-2. We saw that the total sum of squares (SSTO) could be partitioned into three parts: sum
of squares due to treatment A (SSA), sum of squares due to blocks (SSBL), and sum of
squares due to experimental error (SSRES). I now show the basis for this partition. I begin by
rearranging the terms in equation (8.1-2) as follows:
This equation is for a single score, but I can treat each of the np scores in the same way and
sum the resulting equations to obtain
The quantity on the left is the total sum of squares. It can be shown using elementary algebra
and the summation rules in Appendix A that the term on the right is equal to SSA + SSBL +
SSRES.1
The derivation of the expected values of the mean squares follows the procedures illustrated for
the completely randomized design in Section 3.3. The derivation is not given here because no
new principles are involved and the algebra is tedious.2 The results of the derivation are
A test of the null hypothesis is given by F = MSA/MSRES. If the null hypothesis is true and the
assumptions in Section 8.4 are tenable, the F statistic is distributed as the F distribution with p
− 1 and (n − 1)(p − 1) degrees of freedom.
The statistical hypotheses for blocks, where the blocks represent random effects, are
A test of the block null hypothesis is given by F = MSBL/MSRES. If the null hypothesis is true
and the assumptions in Section 8.4 are tenable, this F statistic is distributed as F with n − 1 and
(n − 1)(p − 1) degrees of freedom. This null hypothesis concerns the population of blocks from
which the n blocks in the experiment are a random sample. Ordinarily, a researcher is not
particularly interested in the outcome of testing this null hypothesis. The block null hypothesis
is expected to be significant because the blocks in an experiment are formed to be dissimilar.
Before examining the fixed- and random-effects models for a randomized block design, I
illustrate procedures for computing mean squares and testing the null hypotheses.
Assume that a researcher wanted to evaluate the readability of four versions of a helicopter
altimeter. Eight helicopter pilots with 500 to 3000 flying hours were available to serve as
subjects. It was anticipated that the amount and type of previous flying experience of the pilots
would affect the number of errors that they make in reading experimental altimeters. To isolate
the nuisance variable of previous flying experience and other idiosyncratic characteristics of the
pilots, an RB-4 design with repeated measures was used in which each pilot served as his own
control. Each pilot made 100 readings under simulated flight conditions with each experimental
altimeter. The order in which the four altimeters were presented was randomized independently
for each pilot.
The appropriate design model for the experiment is a mixed model because the levels of
treatment A (altimeters) are fixed, whereas those for blocks (previous flying experience) are
random. If we replicated the experiment, we would use the same four altimeters, but the pilots
and their amounts of their flying experience would be different. The expected values of the
mean squares for this mixed model are described in Section 8.3.
The level of significance adopted was .05. The data and computational procedures are given in
Table 8.2-1. The results of the analysis are summarized in Table 8.2-2. Because F =
MSA/MSRES is significant, we conclude that the population mean number of reading errors is
not the same for all altimeters. Procedures for determining which population means differ are
described in Section 8.5. The F = MSBL/MSRES also is significant. We can infer that previous
flying experience was an effective blocking variable. A measure of the efficiency of this
randomized block design relative to that of a completely randomized design is described in
Section 8.7.
The computational procedures in Table 8.2-1 can be used if there are no missing observations.
If one or more observations are missing, the cell means approach described in Section 8.8
should be used.
I noted in Section 4.5 that a significant F statistic indicates an association between the
independent and dependent variables. For a completely randomized design, strength of
association measures, omega squared and the intraclass correlation, are defined as
For a randomized block design and other more complex designs, strength of association
The first two measures, and , express the variance of the treatment effects relative to
the sum of all effects in the model. The second two measures, called partial omega squared (
) and partial intraclass correlation , ignore the block effects and express the
variance of the treatment effects relative to the sum of only the error effects and the treatment
effects. The dot in the notation denotes the association between the dependent variable,
Y, and treatment A with the effects of blocks ignored. Similar measures can be defined for
blocks:
For designs in which the total sum of squares is partitioned into three or more sums of squares,
partial omega squared and the partial intraclass correlation are more informative than omega
squared and the intraclass correlation.3 The guidelines introduced in Section 4.5 for
interpreting strength of association:
apply to partial omega squared and partial intraclass correlation. I use the partial measures in
the remainder of the book.
Sample estimators of strength of association for a mixed model in which the levels of treatment
A are fixed and those for blocks are random are
where
The rationale underlying the formula for , for example, can be understood in terms of
the expectations of MSA and MSRES. In Section 8.3, I show that
Hence, is given by
where
The association between number of reading errors and type of altimeter is statistically
significant and extremely large: . The blocks, previous flying experience, also
represent a large association, . Hence, previous flying experience was an effective
blocking variable.
where
Sample estimators of strength of association for a fixed effects model in which the levels of
treatment A and blocks are fixed are
where .
Partial omega squared and the partial intraclass correlation also can be computed from F
statistics and a knowledge of n and p. The alternative formulas for treatment A are
where the FA statistic for treatment A and degrees of freedom are obtained from Table 8.2-2.
The alternative formulas for blocks are
These formulas are useful for assessing the practical significance of published research where
only the F statistics and degrees of freedom are provided.
According to Cohen's guidelines, a large treatment effect is one for which f = .40. The value of
and exceed this criterion by a considerable margin.
In Section 4.5, I noted that an F statistic provides no information about the size of treatment
effects, only whether they are statistically significant. Trivial effects can achieve statistical
significance if the sample is sufficiently large. Measures of effect magnitude, , and ,
provide valuable information and should always be reported along with significance tests. For
example, the results of the altimeter experiment can be reported as follows: F(3, 21) = 11.63, p
< .001, .
In Section 4.6, I described procedures for calculating power and sample size for a completely
randomized design. These procedures generalize with a slight modification to a randomized
block design. The altimeter data in Table 8.2-2 are used to illustrate the procedures. Recall
from the previous section that
I now show how to estimate the number of blocks required to achieve a desired power. You
need to specify the (1) number of treatment levels, p; (2) level of significance, α; (3) desired
power, 1 – β; (4) size of the population error variance, ; and (5) sum of the squared
population treatment effects, .
Suppose that the data in Table 8.2-2 represent a pilot study from which I want to estimate the
number of blocks that would be required to achieve a power of 1 – β = .80. With the use of trial
and error, I can insert values of n in the formula:
until a power of .80 is obtained. The notation n′ denotes a trial value of n. I begin the trial-and-
error process with n′ = 5:
Section 4.6 illustrated a simple approach to estimating sample size that requires less
information than the method just described. This approach can be used with a randomized
block design by specifying the (1) number of treatment levels, p; (2) level of significance, α; (3)
power, 1 – β; (4) effect size, fA, that is of interest; and (5) population correlation among the p
treatment levels, p. For example, suppose that a researcher wants to determine the number of
blocks for an RB-4 design necessary to obtain a significance level of α = .05 and a power of 1 –
β = .80. Suppose, also, that the researcher wants to detect a large effect, fA = .400, and thinks
that the population correlation among the treatment levels is close to .30. The required number
of blocks can be determined from Appendix Table E.13 for v1 = p − 1 = 4 − 1 = 3 and
. For a randomized block design, i n Table E.13 is defined as
. The value is between the columns headed by
and f* = .500 in the row labeled 1 – β = .80. By interpolation, the required number of
blocks is 14. This example assumes that the researcher knows the population correlation
coefficient. If, as is usually the case, the researcher doesn't know the value of ρ, Appendix
Table E.13 can be used to estimate n for a range of ρ values—an optimistic value and a
pessimistic value. For example, if ρ is as low as .10, 17 blocks would be required; if ρ is as high
as .50, only 11 blocks would be required. Thus, although a researcher may not know the
precise value of ρ, it is still possible to use Appendix Table E.13 to narrow the choice for the
number of blocks.
Appendix Table E.13 can be used to estimate sample size if α = .05, 1 – β = .70, .80, or .90,
and the randomized block design contains two to four treatment levels. If these conditions are
not satisfied, Tang's charts in Appendix Table E.12 can be used to estimate the sample n. The
charts are entered with , where , and v2 = (n′ − 1)(p − 1).
Alternatively, Tang's charts can be entered with
in which the levels of treatment A were fixed but those for blocks were random. Furthermore,
treatment A was subject to the restriction that was was ,
and πi was independent of εij. Although this mixed model is the most common model for a
randomized block design, three other models can be formulated. The key features of these
models are as follows:
1.Fixed-effects model (model I). For this model, treatment A and blocks are both fixed effects
and subject to the restrictions and is .
2.Random-effects model (model II). For this model, treatment A and blocks are both random
effects. Furthermore, α is is is , is independent of πi
and εi, and πi is independent of εi.
3.Mixed model (model III, treatment A random). For this model, the levels of A are random
effects, but blocks are fixed effects. Furthermore, α is , πi is subject to the
restriction is , and αj is independent of εij.
As you will see, the model determines the nature of the hypotheses that are tested and the
expectations of the mean squares. The null hypotheses for the four models are as follows:
The expectations of the mean squares are given in the upper portion of Table 8.3-1 (additive
model). For each of these models, the F statistics are
Table 8.3-1 ▪ Expected Values of Mean Squares for RB-p Design: Additive and
Nonadditive Models
in which the expectation of the numerator contains the same terms as the expectation of the
denominator plus one additional term. The additional term always contains the parameter that
is being tested—for example, α, , πi, or . Although the form of the F statistics is the same
for all four models, the inferences that a researcher can draw are different. For example, for
model I, inferences apply only to the p treatment levels and n blocks in the experiment; for
model II, inferences apply to the populations from which the p treatment levels and n blocks are
random samples.
Equation (8.3-1) is appropriate for an additive model because it does not contain any provision
for the nonadditivity of blocks and treatment levels. In the next section, I describe the model
equation for a nonadditive model.
For model (8.3-1), I assumed that in the population, the scores in each block had the same
trend over the p treatment levels. To put it another way, suppose that I had access to the
population of blocks represented in our experiment, made a graph of the p scores in each
block, and then connected the scores by lines. If the lines are parallel, the blocks and
treatment levels are said to be additive—not to interact. If any two lines are not parallel along
some portion of the lines, the blocks and treatment levels are said to interact. In this case, I
need to include a term in model (8.3-1) that reflects this interaction or nonadditivity. My model
should always reflect all of the effects that are actually present in an experiment. The
nonadditive model equation for a randomized block design is
where (πα)ij represents the interaction effect for block i and treatment level j.
The presence of block-treatment interaction effects alters the expected values of the mean
squares, as shown in the lower portion of Table 8.3-1 (nonadditive model), and biases some of
the F tests.4 Consider the expected values for model I in the lower portion of Table 8.3-1. It is
not possible to form an F statistic in which the expectation of the numerator contains the same
terms as the expectation of the denominator plus one additional term, the term being tested.
For example, to test the hypothesis αj = 0 for all j, the expected values of the numerator and
denominator of the F statistic should have the form
However, for the nonadditive model, the expected values of the numerator and denominator are
The presence of the interaction term in the denominator negatively biases the test;
consequently, a researcher will reject too few false null hypotheses. The same problem occurs
with model III when the levels of A are random and blocks are fixed.
If a negatively biased test is significant, a researcher can feel confident that the null hypothesis
is false. However, the interpretation of an insignificant test is ambiguous. An insignificant test
may occur because the test is negatively biased, the test lacks adequate power, or the null
hypothesis is true. For models II and III (A fixed), tests of treatment A are not negatively biased.
However, the tests are less powerful if an interaction term appears in the model equation. To
see why this is so, consider a test of treatment A for model II. Expected values for the additive
and nonadditive models are, respectively,
Suppose that , and . The values of the F statistics for the additive and
nonadditive models are, respectively,
I have described two reasons why the additive model is preferred over the nonadditive model:
The additive model avoids a negative bias for model I and model III (A random), and it results in
a more powerful test of a false null hypothesis. Before turning to some underlying assumptions
associated with a randomized block design, I describe a test of the hypothesis that block and
treatment effects are additive.
A preliminary test for choosing between the additive and nonadditive model equations—for
example, between
and
was developed by Tukey (1949). The procedure involves partitioning the residual sum of
squares into two components: a one degree of freedom block-treatment interaction component
that represents nonadditivity and a (n − 1)(p − 1 ) − 1 degree of freedom component that
provides an error term for testing the significance of the nonadditivity component.
To get a better feeling for the meaning of additivity and nonadditivity, consider the altimeter data
shown in Table 8.2-1. If the population values for each of the eight blocks in Table 8.2-1
changed by X amount from a1 to a2, by X′ amount from a2 to a3, and by X″ amount from a3 to
a4, the block-treatment interaction would be zero and the additive model would apply. If we
graphed the data, the lines connecting the scores in each block would be parallel. An
interaction would appear as two or more nonparallel lines along some portion of the lines.
Tukey's test is sensitive to situations in which the scores for each block follow the same general
trend, but the amount of change from a1 to a2, a2 to a3, and so on is not the same for all
blocks.5 This is a limitation of the test because other forms of nonadditivity do occur.
Nevertheless, it can be a useful preliminary test in designs that have one score per cell. In such
designs, there is no within-groups or within-cells error term, and a residual is used as an
estimate of experimental error. If the test for nonadditivity is insignificant, it lends credence to
the hypothesis that MSRES i s a n estimate of rather than or
.
Computational procedures for the test of nonadditivity are illustrated in Table 8.3-2. The level of
significance adopted for the test, α = .10, reflects a desire to avoid making a Type II error. The
use of a numerically large α value (.10, .15, .25) increases the power of the test if the null
hypothesis is false. It is evident from Table 8.3-2 that the test for nonadditivity is not significant.
Thus, the additive model appears to underlie the data. This is confirmed by an inspection of
Figure 8.3-1, where the residual for each observation, , is plotted against the
estimated value, . The distribution of points in Figure 8.3-1 does not
suggest that there is a relationship between the size of the residuals and the estimated values.
For a nonadditive model, the points often have a funnel shape. For example, as increases,
the spread of the s tends to increase.
Sometimes the use of a different criterion measure or one of the transformations described in
Section 3.6 will eliminate or minimize the block-treatment interaction. Considering the
advantages of the additive model, these procedures are worth considering.
Comparison of Assumptions and Error Terms for CR-p and RB-p Designs
In this section, I compare the assumptions underlying randomized block and completely
randomized designs and show why the randomized block design is generally more powerful.
The presentation uses concepts from Appendixes B and D. For readers who do not want to
follow the algebraic derivations, I summarize the most important points at the end of the
section.
where and εi(j) is . The variance of the Yijs for population aj, denoted by
, is due only to variation among the εi(j)s because μ and α are both constants for the jth
Consider now the mixed model for a randomized block design described in Section 8.1. There I
assumed that
. This follows from rule B.2 in Appendix B concerning the variance of a random
variable:
Replacing Yij on the right side of rule B.2 with μ + αj + πi + εij from equation (8.4-1) gives
Two scores, Yij and Yij, in the same block of a randomized block design are not independent
because the term πi is common to both of them. Thus, although I assume that the errors are
independent, I do not assume that the scores within a block are independent. The dependence
of the scores in populations aj and aj′ is reflected in their covariance, denoted by . I now
Replacing X on the right side of rule B.15 with μ + αj + πi + εij and Y with μ + αj′ + πi + εij′
gives
So far, I have shown that (1) , (2) the dependence of Yij and Yij′, is due to the πi
term, and (3) . Next I use these three facts to show that the randomized block design
error variance, MSRES, is generally smaller than the completely randomized design error
variance, MSWG.
Earlier I showed for a completely randomized design that . I now show for a randomized
block design that , where p denotes the population correlation between treatment
levels. Assume that the p population variances are homogeneous,
Then the correlation between potential observations for any two treatment levels j and j′ is given
by
From equation (8.4-4), I know that . Replacing in equation (8.4-5) with gives
To summarize, the point of this discussion is to show that MSRES for a randomized block
design is generally smaller than MSWG for a completely randomized design. The expected
values of the two error terms are
The expected values of the treatment and block mean squares for the mixed model described
in Section 8.1 also can be expressed in terms of and ρ:
define a special type of covariance matrix. This matrix, denoted by ΣS, has the following form:
where all of the variances on the main diagonal are equal and all of the covariances off the
main diagonal are equal. Such a matrix is said to have compound symmetry and is called a
type S matrix.
Matrices that satisfy this less restrictive condition are called type H matrices, ΣH. All matrices
that have compound symmetry satisfy this condition and thus are a subset of the broader class
of type H matrices. The following population matrix is an example of a type H matrix:
Notice that the variances on the main diagonal are not equal and the covariances off the main
diagonal are not equal. I can show that this is a type H matrix by computing the variances of
difference scores for all p(p − 1)/2 = 6 pairs of treatment levels as follows:
Huynh and Feldt (1970) determined the necessary condition for F = MSA/MSRES t o b e
distributed as F with (p − 1) and (n − 1)(p − 1) degrees of freedom. Rouanet and Lépine (1970)
independently arrived at the same necessary condition but expressed the condition in matrix
notation as
where
C*′ is any (p − 1) × p orthonormal coefficient matrix that is used to express the null
hypothesis for treatment A.
I is an identity matrix.
When condition (8.4-8) is satisfied, Σ is said to be spherical; hence, the necessary condition is
called the sphericity condition. The sphericity condition is really quite simple, although the
matrix notation tends to obscure this simplicity. The type H matrix described earlier for an RB-4
design is used to illustrate the condition.
To begin, I need to reformulate the omnibus null hypothesis for treatment A, H0: μ.1 = μ.2 = μ.3
= μ.4, so that it contains p − 1 orthogonal rows. The reformulation can be done in a variety of
ways. For example, two alternative formulations are
and
Next, the length of each coefficient vector must be made to equal 1. A matrix whose rows are
orthogonal and have unit length is called an orthonormal matrix (see Appendix D.1-8). The
orthonormal form of the coefficient matrices, C′1 and C2, is
and
Because , I conclude that the sphericity condition is satisfied. The outcome of this
procedure is not affected by the particular orthonormal matrix used to test the omnibus null
hypothesis. I would have reached the same conclusion if I had used the coefficient matrix
because also is equal to 5 I. Rouanet and Lépine (1970) have shown that when
, the sphericity condition also is satisfied for all subsets of contrasts, not just the
omnibus null hypothesis. In Chapter 10, I show that the sphericity condition may hold for a
subset of hypotheses of interest (local sphericity), although the omnibus sphericity condition is
not tenable.
To summarize, the point of this exposition is that a randomized block design does not require
equal variances and equal covariances. What is required is sphericity or homogeneity of the
variance of population difference scores for all pairs of treatment levels—that is,
The randomized block design is not robust to violation of this assumption. In the following
section, I describe a test to determine if the sphericity condition is not tenable.
A number of statistics have been proposed for testing the hypothesis that satisfies the
sphericity condition. Cornell, Young, Seaman, and Kirk (1992) compared the power of eight
tests and recommended the locally best invariant test (S. John, 1971, 1972; Nagao, 1973;
Sugiura, 1972). They found that this test, unlike other tests, had substantial power to detect
departures from sphericity for both small and large samples and provided good control of the
Type I error.6 The test is illustrated in Table 8.4-1 for the altimeter data in Table 8.2-1. The α =
.25 level of significance was adopted for the test because the number of blocks, n = 8, is small.
When the number of blocks is greater than 10, the use of α = .15 is recommended. The use of
a numerically large significance level when testing a model assumption is a common practice
and increases the power of the test. Critical values for the locally best invariant test in Appendix
Table E.18 are denoted by . Because V* = 6.45 is greater than the critical value,
, there is reason to believe that the sphericity condition, , is not
tenable. Thus, the conventional F tests for the altimeter data in Table 8.2-2 are positively biased
(G. E. P. Box, 1954b). A test is positively biased if it tends to reject the null hypothesis more
often than the hypothesis should be rejected. Fortunately, there is a correction for this positive
bias, as I show in the next section.
Table 8.4-1 ▪ Computation of the Locally Best Invariant Test for the Sphericity Condition
[covariance matrix, , was computed for the data in Table 8.2-1.]
When the sphericity condition is violated, G. E. P. Box (1954b) showed that the true distribution
of the F statistic with (p − 1) and (n − 1)(p − 1) degrees of freedom can be approximated by an
F statistic with reduced degrees of freedom. The modified degrees of freedom are (p − 1)ε and
(n − 1)(p − 1)ε, where ε is a number that depends on the degree of departure of the population
covariance matrix from the required form.7 When the sphericity condition is satisfied, the value
of ε is 1; otherwise, ε is less than 1, with a minimum value of 1/(p − 1).
In any practical situation, the value of ε is unknown. However, the work of Collier, Baker,
Mandeville, and Hayes (1967); Huynh and Feldt (1976); and Stoloff (1970) indicates that a
satisfactory estimate of ε can be obtained from the sample covariance matrix. Collier and
colleagues (1967) proposed a maximum likelihood estimator of ε. Their estimator, , which is
defined in Table 8.4-2, is biased, and the extent of the bias increases as approaches 1.0. In
an effort to reduce this bias for large values of , Huynh (1978) and Huynh and Feldt (1976)
recommended using the estimator (see Table 8.4-2) when ≥ .75. Rogan, Keselman, and
Mendoza (1979) have made a similar recommendation. The estimator, which is computed
from the ratio of two unbiased estimators, is also biased. Gary (1981) investigated the Type I
error for the two estimators in the context of a randomized block factorial design for the
following values of : 1.0, .96, .69, and .52. He concluded that provided better Type I error
protection than for all values of ε investigated except 1.0. One reason for the liberalness of
is that it can assume values greater than 1—in which case, it is set equal to 1. It can be shown
that ; the two statistics are equal when . Pending further research, the use of
is recommended.
The computation of is tedious, but fortunately, most computer packages will provide . If the
reader does not have access to such a computer package, the need to compute may be
avoided by using the following testing procedure for treatment A:
Procedures for computing and are illustrated in Table 8.4-2 for the altimeter data in Table
8.2-1. The critical values for treatment A are shown in part (iv) of the table. Note that, as
expected, the conservative critical value, 5.59, is larger than the adjusted critical value, 3.89.
In concluding this section, a comment about the use of alternative procedures such as
multivariate analysis of variance is in order. Multivariate analysis of variance is sometimes
recommended when the sphericity condition is violated. The relative power of the two
procedures depends on the value of ε, as well as on p and n. If sphericity is tenable, the
randomized block design is more powerful. As n increases, the power advantage of the
randomized block design decreases. Algina and Keselman (1997) describe conditions under
which the multivariate analysis of variance is recommended. To summarize, the conventional F
test in a randomized block design provides maximum power when the test's assumptions are
tenable. An adjusted F test appears to be a good choice when the number of blocks is
relatively small, the matrix does not depart markedly from sphericity, and the
populations are approximately normal (J. C. Keselman, Lix, & Keselman, 1996).
Statistics for testing hypotheses about contrasts in a randomized block design have the same
general form as those illustrated in Section 5.4 for a completely randomized design. However,
MSRES instead of MSWG should be used to estimate the population error variance if the
sphericity condition is tenable. If the assumption is not tenable, an MSRESi appropriate for the
ith contrast should be used. Formulas that use MSRES are given first.
An a priori t statistic for the ith contrast among means has the following form:
with v = (n –1)(p − 1) degrees of freedom. To reject the null hypothesis that c1μ1 + c2μ2 +…+
cpμp = 0, the absolute value of t must equal or exceed tα/2, v for a two-tailed test or tα, v for a
one-tailed test, in which case the value of t must be consistent with the alternative hypothesis.
The qFH a posteriori statistic for the Fisher-Hayter test (Hayter, 1986) is
The Fisher-Hayter statistic can be used to test hypotheses for all pairwise contrasts if the
omnibus ANOVA null hypothesis has been rejected. The null hypothesis for a pairwise contrast
is rejected if | qFH | ≥ qα; p−1, v.
with v1 = (p − 1) and v2 = (n − 1)(p − 1) degrees of freedom. The null hypothesis for a contrast
is rejected if FS ≥ (p − 1)Fα;(p − 1), (n − 1)(p − 1).
The sphericity condition is not tenable for the altimeter data in Table 8.2-1; hence, MSRESi
rather that MSRES should be used to test individual contrasts. The procedures are illustrated
using the t and Fisher-Hayter test statistics with MSRESi. Assume that we are interested in the
following contrasts: ψ1 = μ.1 – μ.2, ψ2 = μ.3 – μ.4, and ψ3 = (μ.1 + μ.2)/2 – (μ.3 – μ.4)/2.
Because these contrasts are a priori and orthogonal, the t statistic can be used to test the
corresponding null hypotheses. The computations are presented in Table 8.5-1. According to
Table 8.5-1(iii), the absolute value of t exceeds the critical value for contrasts 2 and 3. Hence,
the null hypotheses for μ.3 – μ.4 = 0 and (μ.1 + μ.2)/2 – (μ.3 – μ.4)/2 are rejected.
Table 8.5-1 ▪ Computational Procedures for Testing Contrasts Among Means When the
Sphericity Condition Is Not Tenable
It is evident from Table 8.5-1 that when the sphericity condition is not tenable, the procedures
for testing differences among means are more complex. Less evident is the loss of power in
testing a false null hypothesis that occurs when MSRESi is used in place of MSRES. In the
present example, MSRESi has 7 degrees of freedom—many fewer than the 21 associated with
MSRES. The advantage of using MSRESi is that the resulting test is exact. The formula for
computing MSRESi is an interaction formula. In the present example,
If the overall sphericity condition were tenable, these four residual mean squares would
estimate the same population error variance. In this example, the use of MSRES in place of
MSRES1 would have led to a negatively biased test; the use of MSRES in place of MSRES2
and MSRES3 would have led to positively biased tests.
If the treatment levels in an experiment differ quantitatively, additional insight concerning the
experiment can be obtained by performing a trend analysis. Procedures for carrying out a trend
analysis for a completely randomized design are described in Chapter 6. These procedures
generalize, with one modification, to a randomized block design. The modification has to do
with the choice of an error term. If the sphericity condition is tenable, MSRES can be used in
the denominator of the F statistic for testing the significance of the linear contrast, quadratic
contrast, and so on. For example,
If the assumption is not tenable, a residual mean square appropriate for the specific trend
contrast of interest should be used—for example, MSRESlin, MSRESquad, a n d s o o n .
Procedures for computing and MSRESlin are shown in Table 8.6-1. These procedures
can be used to obtain mean squares for other trends by replacing the linear coefficients with
the appropriate trend coefficients.
Table 8.6-1 ▪ Computational Procedures for Testing the Significance of the Linear
Contrast When the Sphericity Condition Is Not Tenable
A randomized block design enables a researcher to isolate variation associated with a nuisance
variable, thereby reducing the error variance. As discussed in Section 8.1, if the SSBLOCKS
accounts for an appreciable portion of the total sum of squares, a randomized block design is
more efficient than a completely randomized design.
A measure of the relative efficiency of the two designs, adjusted for differences in degrees of
freedom, is given by
where MSWG is computed from the mean squares in a randomized block design as follows:
where n = 8, MSBL = 4.500, p = 4, and MSRES = 1.405. The efficiency of the randomized block
design relative to that of the completely randomized design is
Because the ratio is greater than 1, the randomized block design is more efficient than the
completely randomized design. I can be more precise. The number of subjects, nj, in each
treatment level of a completely randomized design necessary to match the efficiency of a
randomized block design is
where n is the number of blocks in the randomized block design. The randomized block design
used np = (8)(4) = 32 observations; the completely randomized design would require nj p ≅ (13)
(4) = 52 observations. This is another indication that the selection of previous flying experience
as the nuisance variable was a good choice.
The partition of the total sum of squares and degrees of freedom for the two designs was
shown in Figure 8.1-2. It is evident from this figure that SSRES for a randomized block design
will be smaller than SSWG if the block effects are not equal to zero. If the block effects are
equal to zero, SSRES = SSWG. In this case, the completely randomized design is more
efficient than a randomized block design because the error term is based on more degrees of
freedom—for example, p(n − 1) as opposed to (n − 1)(p − 1). It should be apparent that SSBL
in an RB-p design must account for an appreciable portion of the total sum of squares to
compensate for the loss of n − 1 degrees of freedom in the MSRES error term.
A researcher who uses a randomized block design hopes to be compensated for the additional
effort required to form homogeneous blocks by obtaining a smaller error term and greater
power to reject a false null hypothesis. The question of whether the effort involved in matching
subjects or obtaining their repeated participation is less than the effort required to run more
subjects in a completely randomized design should be raised for each proposed research
project. Scarcity of subjects makes a randomized block design an attractive alternative to a
completely randomized design when the nature of the treatment permits obtaining repeated
measures on the same subjects.
The cell means model I introduced in Section 7.7 has several advantages over the classical
overparameterized analysis of variance model. The cell means model enables a researcher to
test hypotheses about any linear combination of available cell means, and there is never any
ambiguity about the hypothesis that is tested. Furthermore, the cell means model can be used
when observations are missing.
There are two forms of the cell means model: one in which no restrictions are placed on the
population cell means and a second in which the population cell means are restricted so that
some interaction effects are set equal to zero. The unrestricted cell means model i s
appropriate for designs in which population means are estimated from cells that contain n ≥ 2
observations, as in a completely randomized design. I described this model in Section 7.7. A
restricted cell means model should be used when (1) the population cell means are
estimated from cells that contain one observation and (2) an interaction (residual) is used to
estimate error variance, as in a randomized block design. A restricted cell means model also
can be used when the cells contain two or more observations and a researcher wants to test
hypotheses about, say, treatments A and B subject to the restriction that the two treatments do
not interact. This use of the restricted model is described in Section 9.13.
where εij is and the cell means, μij, are subject to the restrictions that
The restrictions on μij state that all block-treatment population interaction effects are equal to
zero. The restrictions are imposed because it is not possible to estimate error effects separately
from interaction effects. Recall from Section 8.5 that MSRES, which is used to estimate error
variance, is the block-treatment interaction mean square. The application of the restricted cell
means model is slightly more complicated than that for the unrestricted model. I introduce the
general features of the restricted cell means model here and then describe some simplifications
that can be used with the model.
The data in Table 8.8-1 are used to illustrate the computational procedures for the cell means
model. The use of only two blocks simplifies the presentation. The first step in analyzing the
data in Table 8.8-1 is to specify contrasts of population means that are relevant to the
researcher's scientific hypotheses. The hypotheses for treatment A and the blocks are
or
If the hypothesis for treatment A, for example, is not tenable, then it follows that the classical
overparameterized model hypothesis
is not tenable. When p ≥ 3 or n ≥ 3, the hypotheses can be expressed in a variety of ways. The
only requirement is that the hypothesis matrices, and , are of full row rank. This means
that each row of a hypothesis matrix must be linearly independent8 of every other row.
Hypothesis matrices are used to compute coefficient matrices, C′. I point out in Section 7.7 that
the hypothesis matrix, H′, and the coefficient matrix, C′, for a completely randomized design are
identical. This is not the case for a randomized block design. Coefficient matrices for a
randomized block design with no missing observations are computed from hypothesis matrices,
H′, and sum vectors, 1′, as follows:
where ⊗ denotes the Kronecker product, h = np = (2)(3) = 6 is the number of cell means, and
the vector of cell means is ordered as follows: . Notice in the mean
vector that I first sum through the block means and then the treatment means. The Kronecker
products are in the reverse order: .
The numbers in each row of and are coefficients, cj′ of mean contrasts. The numbers
satisfy the same necessary conditions as the coefficients, cij′ in the multiple comparison
contrasts that I describe in Section 5.1. For example, the numbers in the rows of sum to
zero, , and at least one number in a row is not equal to zero, cj ≠ 0, for some j.
However, in the cell means model, the sum of the absolute value of the numbers in a row
typically does not equal 2.
Next I show that the and matrices define the researcher's hypotheses in terms of
contrasts of sums of cell means.
In scalar notation, the hypotheses for treatment A and the blocks can be written as,
respectively,
Notice that the hypotheses are expressed as differences between the sums of cell means.
Alternatively, the hypotheses can be expressed as differences between the means of cell
means: for example, (μ11 + μ21)/2 – (μ12 + μ22)/2 = 0 or μ1 – μ2 = 0. The alternative ways of
expressing the hypotheses lead to identical results. Expressing the hypotheses with sums of
cell means avoids fractions. To better understand these hypotheses, the reader may find it
useful to see which cell means in Table 8.8-1 are involved in each contrast.
The restrictions on the model—that the block-treatment interaction effects are equal to zero—
also can be expressed in a number of ways. One way is as follows:
Figure 8.8-1 ▪ Two interaction terms of the form μij – μi′j – μij′ + μi′j′ are obtained from the
crossed lines by subtracting the two μijs connected by a dashed line from the two μijs
connected by a solid line. Another set of interaction terms is μ11 – μ21 – μ13 + μ23 and
where R′1 is a coefficient matrix that defines the restrictions on μ1 and s = h − n − p + 1. R′1 is
computed from the Kronecker product of H′A and H′BL as follows:
I want to test the null hypothesis for treatment , subject to the restrictions that
. To accomplish this, I form an augmented matrix, , and an augmented vector,
, as follows:9
where contains the p − 1 rows of and the s rows of that are not identical to the
rows of , inconsistent with them, or linearly dependent on them. The augmented vector,
, contains s + p − 1 zeros. The hypothesis
combines the restrictions that the two block-treatment interaction effects are equal to zero with
the hypothesis that the two treatment A contrasts are equal to zero. The treatment A sum-of-
squares formula that is used to test this joint hypothesis is
where .
Similarly, I want to test the null hypothesis for the blocks, , subject to the
restrictions that . The augmented matrix and augmented vector are as follows:
The hypothesis
combines the restrictions that the two block-treatment interaction effects are equal to zero with
the hypothesis that the block contrast is equal to zero. The block sum-of-squares formula that
is used to test this joint hypothesis is
When multiple comparison tests are performed, restricted cell means, , rather than
unrestricted cell means, , should be used.
The previous section provided an overview of the general features of the restricted cell means
model. Next I show that the formulas for computing the treatment and block sums of squares
can be simplified when there are no missing observations. I focus on the formula for SSA; the
same ideas apply SSBL.
The sum-of-squares formula (8.8-1) for treatment A contains the augmented matrix. An
examination of reveals that the rows of are linearly independent of the rows of .
Indeed, for the data in Table 8.8-1, formulas (8.8-1) and (8.8-3) give the same sum-of-squares
value:
An examination of reveals that and also are orthogonal. Hence, the line of
reasoning that led to the simplified formula for treatment A also leads to the simplified formula
for the block sum of squares:
For the data in Table 8.8-1, formulas (8.8-2) and (8.8-4) give the same sum-of-squares value:
In the previous section, I expressed the null hypotheses as contrasts of sums of cell means.
Alternatively, I can express the null hypotheses as contrasts of cell means. When no
observations are missing, the two ways of expressing the null hypotheses lead to identical
sums of squares. However, when observations are missing, the null hypotheses must be
expressed as contrasts of cell means. Here I use the data in Table 8.8-1 to show that the two
kinds of null hypotheses lead to the same sums of squares when there are no missing
observations.
The reader may find it helpful to refer to Table 8.8-1 to see which cell means are involved in
each contrast. The coefficient matrices that are used in testing these hypotheses are computed
from hypothesis matrices, H′, and identity matrices, I, as follows:
I want to test hypothesis (8.8-5) subject to the restrictions that . However, the rows of
the augmented matrix
are not linearly independent. For example, row 1 of is linearly dependent on rows 3 and 4
of :
In other words, the two restrictions in can be constructed from linear combinations of the
rows of . Hence, contains the restriction information in . As a result, the formula for
computing the treatment A sum of squares can be simplified. The formula is
However, when one or more observations are missing, formula (8.8-7) gives a different result.
The two restrictions in can be constructed from linear combinations of the rows of .
Hence, contains the restriction information in . As a result, the formula for computing
the block sum of squares also can be simplified. The formula is
When, as in Table 8.8-1, there are no missing observations, the three block formulas, (8.8-2),
(8.8-4), and (8.8-8), give the same result:
Suppose that observation Y22 in Table 8.8-1 is missing. For this case, one or more contrasts of
population cell means cannot be estimated. The first step in analyzing the data is to determine
which contrasts can be estimated. I begin by showing the cell mean hypotheses for the case in
which there are no missing observation and then modify the hypotheses to take into account
the missing observation.
Because observation Y22 is missing, I cannot test the treatment A cell mean null hypothesis:
Vertical and horizontal lines in the matrix and vectors identify (1) the missing cell mean, μ22, in
μ1; (2) the two contrasts, μ21 – μ22 and μ22 – μ23, in that cannot be estimated; and (3)
the corresponding 0s in 02(A). Although contrasts μ21 – μ22 and μ22 – μ23 in rows 2 and 4
cannot be estimated, the two rows can be added to obtain a new interpretable contrast, μ21 –
μ23 = 0, as follows:
Column 4 and rows 2 and 4 of as well as the corresponding 0s in 02(A) and the μ22 in μ1
must be deleted to obtain a modified matrix and modified vectors, 03(A) and μ2. The new
contrast with column 4 deleted can be added to ; a zero can be added to 03(A) for the new
contrast. The modified null hypothesis is
Because observation Y22 is missing, I cannot test the block null hypothesis:
Column 4 and row 2 of as well as the corresponding 0 in 02(bl) and μ22 in μ1 must be
deleted. The modified coefficient matrix and modified vectors are
or
also must be modified. The missing observation appears in both rows of the . Hence, neither
interaction effect can be estimated. However, a new interaction effect can be formed by adding
rows 1 and 2 of and deleting column 4 corresponding to the missing observation. The
modified restriction is
This restriction is linearly dependent on the three rows of and on the two rows of as
I show here:
Hence, and contain the restriction (8.8-9). The sums-of-squares formulas and the
values of treatment A and the blocks, subject to restriction (8.8-9), are as follows:
Table 8.8-2 ▪ ANOVA Table for RB-3 Design With a Missing Observation
According to Table 8.8-2, the null hypothesis for treatment A can be rejected. The next step in
the analysis is to determine which contrasts among the population means are not equal to zero.
The vector of restricted cell means is given by . The means are shown
in Table 8.8-3. One of the multiple comparison procedures in Chapter 5 can be used to test the
contrasts.
I have described the procedures for applying the restricted cell means model in detail because
the procedures generalize to other designs that have a block variable and use an interaction to
estimate error variance.
1.One treatment with p ≥ 2 treatment levels and one nuisance variable with w ≥ 2 levels.
2.Formation of w groups, each containing np homogeneous experimental units, where n ≥ 2.
The variability among the experimental units within each group should be less than the
variability among units in different groups.
3.Random assignment of the np units in each group to the treatment levels, with the
restriction that each treatment level contains n units. The design requires w sets of np
homogeneous experimental units—a total of N = npw units.
where
Yijz is the score for the ith experimental unit in cell jz.
μ is the grand mean of the population means, μ11, μ12, …, μpw The grand mean is a
constant for all observations in the experiment.
αj is the treatment effect for population j and is equal to μj. – μ, the deviation of the
grand mean from the jth population mean. The jth treatment effect is a constant for all
ζz is the group effect for population w and is equal to μ.z – μ, the deviation of the
grand mean from the zth population mean. The group effect is a random variable that
is ; ζz is independent of εi(jz).
εi(jz) is the error effect associated with Yijz. The error effect is a random variable that is
.
This model, which is a mixed model (model III), is indistinguishable in form from a mixed model
for the completely randomized two-treatment factorial design (CRF-pq design) that is described
in Chapter 9. However, the two designs differ in the number of treatments and in the way the
experimental units are assigned to treatment levels (combination). The GRB-p design has one
treatment, and the experimental units are randomly assigned to the treatment levels within
each group. The CRF-pq design has two treatments, and the experimental units are randomly
assigned to the pq treatment combinations. Many designs that are called completely
randomized factorial designs are actually generalized randomized block designs. The
confusion usually occurs when one of the factors in an experiment is an organismic variable.
The use of an organismic variable always restricts the assignment of experimental units to the
treatment combinations. For example, if the group variable is gender, a man cannot be
assigned to a treatment combination that calls for women; the man must be assigned to the
men group. The layouts for GRB-4 and CRF-22 designs are shown in Figure 8.9-1.
Figure 8.9-1 ▪ Layouts for GRB-4 and CRF-22 designs. For the GRB-4 design, the np = 8
subjects in each group are assigned randomly to the four treatment levels (Treat. Level),
aj. The dependent variable (Dep. Var.) is denoted by Yijz. For the CRF–22 design, the 32
subjects are randomly assigned to the four treatment combinations (Treat. Comb.), ajbk,
with the restriction that 8 subjects are assigned to each combination. The dependent
variable (Dep. Var.) is denoted by Yijk.
For purposes of illustration, assume that 32 helicopter pilots are available to participate in the
altimeter experiment described in Section 8.2. Eight of the pilots have less than 476 flight hours
(group 1), eight have between 477 and 1491 hours (group 2), eight have between 1492 and
2489 hours (group 3), and eight have 2490 or more hours (group 4). The eight pilots in each
group are randomly assigned to the levels of treatment A, with the restriction that each
treatment level contains two pilots. The appropriate design model is a mixed model because
the levels of treatment A (altimeters) are fixed, whereas those for groups (previous flying
experience) are random. If I replicated the experiment, I would use the same four altimeters, but
the pilots and the amounts of their flying experience would be different.
The first null hypothesis states that all population means for treatment A are equal. The second
null hypothesis states in effect that the means of the populations from which the four levels of
flying experience are random samples are equal. The third null hypothesis states that the
variables—altimeters and previous flying experience—do not interact. I have more to say about
interactions in Chapter 9. The .05 level of significance is adopted for the tests. The data and
computational procedures are given in Table 8.9-1; the results of the analysis are summarized
in Table 8.9-2.
According to Table 8.9-2, the hypothesis μ1. = μ2. = μ3. = μ4. is rejected. Because the
Treatment A × Groups interaction is insignificant, it is reasonable to believe that in the
population, the scores for each group have the same trend over the p treatment levels. The
interpretation of the results is more complicated when the interaction is significant; for a
discussion of the issues, see Sections 9.3 and 9.6. One feature in Table 8.9-2 deserves special
comment. The proper error term for testing MSA is MSG × A; MSWCELL is used to test MSG
and MSG × A. The choice of F statistic denominators is based on the principle that the
expected value of the denominator should contain all the terms in the numerator except the one
being tested. The expected values for the fixed-effects, random-effects, and mixed models are
given in Table 8.9-3. The sphericity assumption (see Section 8.4) is required for tests that use
MSA × G as the error term. Homogeneity of population within-cell variances is required for
those that use MSWCELL. The classical sum-of-squares computational procedures require
equal cell ns. If the cell ns are unequal, the cell means model approach described in Section
9.13 should be used.
Tests of differences among treatment A means have the same general form as those in Section
8.5 for an RB-p design. If model I applies, the t statistic and the Fisher-Hayter qFH statistic are,
respectively,
with pw(n − 1) degrees of freedom. The use of MSWCELL is appropriate if the homogeneity of
variance assumption is tenable. If it isn't, the procedures described in Section 5.2 or 5.5 should
be used.
with (p − 1)(w − 1) degrees of freedom. The use of MSA × G is appropriate if the sphericity
assumption is tenable. If it isn't, the procedures described in Section 8.5 should be used. For
other models, the mean square that is used in the denominator of the omnibus F test should
be used in the denominator of the multiple comparison statistic.
where and .
where .
For a fixed effects model (model I), measures of strength of association are as follows:
Procedures for calculating power and the number of subjects necessary to achieve a desired
power follow those for a two-treatment completely randomized factorial design and are
described in Section 9.8. Also, the cell means model analysis procedures for a completely
randomized factorial design in Sections 9.13 and 9.14 can be used with the generalized
randomized block design.
The major advantages of the randomized block and generalized randomized block designs are
as follows:
1.Greater power relative to a completely randomized design for many research applications.
The designs permit a researcher to isolate the effects of one nuisance variable, thereby
reducing the experimental error and obtaining a more precise estimate of treatment effects.
2.Flexibility. Any number of treatment levels and blocks can be used in an experiment.
3.Simplicity in analysis of data.
4.The GRB-p design enables a researcher to test the significance of the treatment A × group
interaction. Furthermore, an appropriate error term is available for testing treatment A even
1.If a large number of treatment levels are included in the experiment, it becomes difficult to
form blocks or groups that have minimum within-block (group) variability.
2.In the fixed-effects and mixed (A random) models for an RB-p design, a test of treatment
effects is negatively biased if (πα)ij > 0.
3.The designs involve somewhat more restrictive assumptions—for example, sphericity—than
the completely randomized design.
1.Terms to remember:
a.subject matching (8.1)
b.repeated measures (8.1)
c. subjects-by-treatments design (8.1)
d.subject-by-trials design (8.1)
e.partial omega squared (8.2)
f. partial intraclass correlation (8.2)
g.covariance matrix (8.4)
h.compound symmetry (8.4)
i. type S matrix (8.4)
j. type H matrix (8.4)
k. sphericity condition (8.4)
l. conservative F test (8.4)
m.adjusted F test (8.4)
n.unrestricted cell means model (8.8)
o.restricted cell means model (8.8)
2.[4.1, 8.1] Compare the randomization procedures for CR-p and RB-p designs.
*3.[8.1] What are the four ways of forming homogeneous blocks in an RB-p design?
4.[8.1] For each of the following experiments, suggest an appropriate blocking variable:
a.The effect of jogging on heart rate
b.The effect of work table height on assembly line productivity
c. The effect of meaningfulness of nonsense syllables on the number of trials required to
learn the syllables
*5.[8.1] Show that
*6.[8.1] Derive the computational formulas for an RB-p design from the deviation formulas in
equation (8.1-3).
*7.It was hypothesized that brain-damaged patients would score lower on the Willner Unusual
Meanings Vocabulary Test (WUMV), which measures knowledge of unusual meanings of
familiar words, and the Willner-Sheerer Analogy Test (WSA) than on the vocabulary items of
the Wechsler Adult Intelligence Scale (WAIS). A random sample of 12 brain-damaged
patients took all three tests. The order of administration of the tests was randomized
independently for each patient. The dependent variable was the subject's standard score
on each test. According to the test norms, all three tests have a mean of 10 and a standard
deviation of 3. The following data were obtained. (Experiment suggested by Willner, W.
Impairment of knowledge of unusual meanings of familiar words in brain damage and
schizophrenia. Journal of Abnormal Psychology.)
*a.[8.2] Test the null hypothesis μ.1 = μ.2 = μ.3; let α = .05. Assume for this exercise and
for (b)–(d) that the additive model (Section 8.3) is appropriate and the sphericity
condition (Section 8.4) is tenable.
*b.[8.2] Calculate the power of the test of μ.1 = μ.2 = μ.3.
*c.[8.2] (i) For the null hypothesis μ.1 = μ.2 = μ.3, use the results of part (a) as a pilot
study and determine the number of blocks required to achieve a power of approximately
.88. (ii) Use Appendix Table E.13 to determine the number of subjects required to detect
a large effect; let 1 – β = .80 and assume that p = .80.
*d.[8.2] Compute and interpret and for these data.
*e.[8.3] Use Tukey's procedure to determine whether the additive model is appropriate; let
α = .10.
*f.[8.3] Construct a figure like Figure 8.3-1 for and . Is the figure consistent with
Tukey's test for nonadditivity?
*g.[8.4] If you have access to a computer package that performs matrix operations,
compute the locally best invariant test statistic, V*; let α = .15. Interpret the result.
*h.[8.4] Compute the value of ê and determine the critical value for an adjusted F test. The
use of F(adjusted) is not required for the test of μ.1 = μ.2 = μ.3; explain why this is so.
*i.[8.5] The null hypothesis in part (a) does not specifically address the researcher's
research interest. Formulate a more appropriate null hypothesis and test it using Dunn's
statistic; use MSRES in the formula and let α = .05. Use the TINV function in Excel (see
Section 5.4) to obtain the one-tailed critical value. Suggest at least one alternative
explanation for the observed differences among the means of the three tests.
*j.[8.7] (i) Determine the efficiency of the randomized block design relative to that of a
completely randomized design; use Fisher's correction. (ii) How many subjects would
be required in a completely randomized design to match the efficiency of the
randomized block design?
k. Prepare a “results and discussion section” for the Journal of Abnormal Psychology.
*8.Six monkeys (Macaca mulatta) were trained to search a visual display for a target stimulus
(3-inch circle). The effects of zero, one, two, three, four, and five irrelevant stimuli
(geometric shapes) on the probability of a correct response were investigated. The subjects
had free access to a manipulandum with which they could turn on a visual display for 5
seconds. If they pressed the panel that contained the target stimulus, a banana-flavored
food pellet was delivered to a food cup. If the incorrect panel was pressed or if no response
was made, there was a 5-second period of darkness and time-out during which pressing
the manipulandum had no effect. It was hypothesized that the irrelevant stimuli that
appeared with the target stimulus would decrease the monkey's response accuracy. The
*a.[8.2] Test the null hypothesis μ.1 = μ.2 = … = μ.6; let α = .01. Assume for this exercise
and parts (b)–(d) that the additive model (Section 8.3) is appropriate.
*b.[8.2] Calculate the power of the test of μ.1 = μ.2 = … = μ.6.
*c.[8.2] For the null hypothesis μ.1 = μ.2 = … = μ.6, use the results of part (a) as a pilot
study and determine the number of blocks required to achieve a power of approximately
.94.
*d.[8.2] Compute and interpret and for these data.
*e.[8.3] Use Tukey's procedure to determine whether the additive model is appropriate; let
α = .10.
*f.[8.3] Construct a figure like Figure 8.3-1 for and . Is the figure consistent with
Tukey's test for nonadditivity?
*g.[8.4] If you have access to a computer package that performs matrix operations,
compute the locally best invariant test statistic, V*; let α = .25. Interpret the result.
*h.[8.4] (i) Use the three-step testing procedure described in Section 8.4 to test the null
hypothesis μ.1 = μ.2 = … = μ.6. (ii) Compute the value of and determine the critical
value for an adjusted F test.
*i.[8.5] Use Dunnett's statistic to determine which population means differ from the control
population (treatment level a1). (i) Compute the statistic using both MSRES a n d
MSRESi; use a one-tailed test and let α = .01. (ii) Does the choice of error terms affect
your decisions about differences among the population means? (iii) Based on part (g) or
(h), which error term is the most appropriate?
*j.[8.6] (i) Test the hypothesis that the relationship between the independent and
dependent variables is nonlinear. Use SSRES DEP. FROM LIN = SSRES – SSRESlin
with dfdepfrom lin = dfRES – dfRESlin. (ii) What percent of the total trend in the sample
is accounted for by the nonlinear component?
*k.[8.8] (i) Determine the efficiency of the randomized block design relative to that of a
completely randomized design; use Fisher's correction. (ii) How many subjects would
be required in a completely randomized design to match the efficiency of the
randomized block design?
l. Prepare a “results and discussion section” for Perceptual and Motor Skills.
a.[8.2] Test the null hypothesis μ.1 = μ.2 = μ.3; let α = .01. Assume for this exercise and
for parts (b)–(d) that the additive model (Section 8.3) is appropriate and the sphericity
condition (Section 8.4) is tenable.
b.[8.2] Calculate the power of the test of μ.1 = μ.2 = μ.3.
c. [8.2] For the null hypothesis μ.1 = μ.2 = μ.3, use the results of part (a) as a pilot study
and determine the number of blocks required to achieve a power of approximately .97.
d.[8.2] Compute and interpret and for these data.
e.[8.3] Use Tukey's procedure to determine whether the additive model is appropriate; let
α = .10.
f. [8.3] Construct a figure like Figure 8.3-1 for and . Is the figure consistent with
Tukey's test for nonadditivity?
g.[8.4] If you have access to a computer package that performs matrix operations,
compute the locally best invariant test statistic, V*; let α = .15. Interpret the result.
h.[8.4] (i) Use the three-step testing procedure described in Section 8.4 to test the null
hypothesis μ.1 = μ.2 = μ.3. (ii) Compute the value of and determine the critical value
for an adjusted F test.
i. [8.5] (i) Use the Fisher-Hayter statistic to test all two-mean null hypotheses; let α = .01
(ii) Justify the use of MSRES instead of MSRESi.
j. [8.6] (i) Test the hypothesis that the relationship between the independent and
dependent variables is nonlinear. Use both SSRES with df = (n − 1)(p − 1) and SSRES
DEP FROM LIN = SSRES – SSRESlin with dfdep from lin = dfRES – dfRESlin. (ii) Does
the choice of error terms affect your decision about nonlinearity in the population? (iii)
Based on parts (g)–(h), which error term is most appropriate? (iv) In this example,
and SSRES DEP. FROM LIN = MSRESquad. Explain why this is
true. (v) What percent of the total trend in the sample is accounted for by the nonlinear
component?
k. [8.7] (i) Determine the efficiency of the randomized block design relative to that of a
completely randomized design; use Fisher's correction. (ii) How many subjects would
be required in a completely randomized design to match the efficiency of the
randomized block design?
l. Prepare a “results and discussion section” for the Journal of Speech and Hearing
Research.
10.
Pain thresholds to electrical stimulation were measured in 10 male volunteers using the
psychophysical method of limits. The independent variable was the rate of decrease in the
noxious stimulus in the descending trials; four rates were used ranging from slow to fast.
The subjects were instructed to say “pain gone” as soon as all of the pain had disappeared.
Following a series of practice trials, the subjects were given four trials with each of the five
rates. The order of presentation of the rates was randomized independently for each
subject. The dependent variable was the mean voltage applied when the subject said “pain
gone” and was measured in arbitrary units on a 0- to 100-point scale. (Experiment
suggested by Horland, A. A. & Wolff, B. B. Changes in descending pain threshold related
to rate of noxious stimulation. Journal of Abnormal Psychology.)
a.[8.2] Test the null hypothesis μ.1 = μ.2 = μ.3 = μ.4; let α = .05. Assume for this exercise
and for parts (b)–(d) that the additive model (Section 8.3) is appropriate and the
sphericity condition (Section 8.4) is tenable.
b.[8.2] Calculate the power of the test of μ.1 = μ.2 = μ.3 = μ.4.
c. [8.2] (i) For the null hypothesis μ.1 = μ.2 = μ.3 = μ.4, use the results of part (a) as a pilot
study and determine the number of blocks required to achieve a power of approximately
.84. (ii) Use Appendix Table E.13 to determine the number of subjects required to detect
a large effect; let 1 – β = .80 and assume that ρ = .60.
d.[8.2] Compute and interpret and for these data.
e.[8.3] Use Tukey's procedure to determine whether the additive model is appropriate; let
α = .10.
f. [8.3] Construct a figure like Figure 8.3-1 for and . Is the figure consistent with
Tukey's test for nonadditivity?
g.[8.4] If you have access to a computer package that performs matrix operations,
compute the locally best invariant test statistic, V*; let α = .15. Interpret the result.
h.[8.4] (i) Use the three-step testing procedure described in Section 8.4 to test the null
hypothesis μ.1 = μ.2 = μ.3 = μ.4. (ii) Compute the value of and determine the critical
value for an adjusted F test.
i. [8.5] (i) Use the Fisher-Hayter statistic to test all two-mean null hypotheses; let α = .05.
(ii) Justify the use of MSRES instead of MSRESi.
j. [8.6] (i) Assume that the levels of the independent variable differ by a constant amount.
Test the hypothesis that the relationship between the independent and dependent
variables is nonlinear. Use both SSRES with df = (n − 1) (p − 1) and SSRES DEP.
FROM LIN = SSRES – SSRESlin with dfdep from lin = dfRES – dfRESlin. (ii) Does the
choice of error terms affect your decision about nonlinearity in the population? (iii)
Based on part (g) or (h), which error term is most appropriate? (iv) What percent of the
total trend in the sample is accounted for by the nonlinear component?
k. [8.7] (i) Determine the efficiency of the randomized block design relative to that of a
completely randomized design; use Fisher's correction. (ii) How many subjects would
be required in a completely randomized design to match the efficiency of the
randomized block design?
l. Prepare a “results and discussion section” for the Journal of Abnormal Psychology.
a.[8.2] Test the null hypothesis μ.1 = μ.2 = … = μ.6; let α = .01. Assume for this exercise
and parts (b)–(d) that the additive model (Section 8.3) is appropriate.
b.[8.2] Calculate the power of the test of μ.1 = μ.2 = … = μ.6.
c. [8.2] (i) For the null hypothesis μ.1 = μ.2 = … = μ.6, use the results of part (a) as a pilot
study and determine the number of blocks required to achieve a power of approximately
.87. (ii) Use Appendix Table E.12 to determine the number of subjects required to detect
a large effect; let 1 – β = .87 and assume that ρ = .80.
d.[8.2] Compute and interpret and for these data.
e.[8.3] Use Tukey's procedure to determine whether the additive model is appropriate; let
α = .10.
f. [8.3] Construct a figure like Figure 8.3-1 for and . Is the figure consistent with
Tukey's test for nonadditivity?
g.[8.4] If you have access to a computer package that performs matrix operations,
compute the locally best invariant test statistic, V*; let α = .25. Interpret the result.
h.[8.4] (i) Use the three-step testing procedure described in Section 8.4 to test the null
hypothesis μ.1 = μ.2 = … = μ.6. (ii) Compute the value of ê and determine the critical
value for an adjusted F test.
h.[8.4] Compute the value of and determine the critical value for an adjusted F test. The
use of F(adjusted) is required for the test of μ.1 = μ.2 = … = μ.6; explain why this is so.
i. [8.5] Use Dunnett's statistic to determine which population means differ from the control
population (treatment level a1). (i) Compute the statistic using both MSRES a n d
MSRESi; use a two-tailed test and let α = .01. (ii) Does the choice of error terms affect
your decision about differences among the population means? (iii) Based on part (g) or
(h), which error term is the most appropriate?
j. [8.7] (i) Determine the efficiency of the randomized block design relative to that of a
completely randomized design; use Fisher's correction. (ii) How many subjects would
be required in a completely randomized design to match the efficiency of the
randomized block design?
k. Prepare a “results and discussion section” for Perceptual and Motor Skills.
*12.
Following the procedure in Section 3.3 for a completely randomized design, derive the
following E(MS)s for a randomized block design. Assume an additive model.
*a.[8.3] E(MSA), E(MSBL), E(MSRES); assume that treatment A and blocks are fixed.
b.[8.3] E(MSA), E(MSBL), E(MSRES); assume that treatment A is fixed but blocks are
random.
13.
[8.3] Explain why the interpretation of an insignificant F statistic for treatment A for the
nonadditive model may be ambiguous.
14.
[8.3] Why is the .10 level of significance rather than a more stringent level of significance
often adopted for Tukey's test for nonadditivity of block and treatment effects?
*15.
[8.4] The error term for a randomized block design is normally smaller than that for a
completely randomized design. (a) What determines the relative size of the two error terms?
(b) Why might the error term for a randomized block design be larger than that for a
completely randomized design?
*b.[8.4] Assume that you want to test the following null hypothesis:
17.
[8.5] In computing and , either MSRES or MSRESi may be used. What are the
*a.[8.8] Write (i) the classical experimental design model and (ii) the restricted cell means
model for this RB-3 design.
*b.[8.8] (i) For the restricted cell means model, write the null hypotheses for treatment A
and blocks using matrix notation. (ii) Use the Kronecker product to construct the
coefficient matrices for treatment A, blocks, and the treatment-block interaction.
*c.[8.8] Use the restricted cell means model to test the hypotheses in part (b)(i). Use the
formulas
Let α = .05.
*d.[8.8] Assume that Y21 is missing. Use the restricted cell means model to test
*19.
Exercise 8 described an experiment about the effect of irrelevant stimuli on the probability of
a correct response. Assume that a follow-up experiment was performed in which the
following data were obtained. This problem should be done with the aid of a computer and
matrix package.
*a.[8.9] Write (i) the classical experimental design model and (ii) the restricted cell means
model for this RB-3 design.
*b.[8.9] (i) For the restricted cell means model, write the null hypothesis for treatment A
and blocks using matrix notation. (ii) Use the Kronecker product to construct the
coefficient matrices for treatment A, blocks, and the treatment-block interaction.
*c.[8.9] Use the restricted cell means model to test the hypotheses in part (b)(i). Use the
formulas
Let α = .01.
*d.[8.9] Assume that Y41 is missing. Use the restricted cell means model to test
20.
Exercise 9 described an experiment concerning adaptation in the oral performance of
stutterers. Assume that a follow-up experiment was performed in which the following data
were obtained. This problem should be done with the aid of a computer and matrix
package.
a.[8.8] Write (i) the classical experimental design model and (ii) the restricted cell means
model for this RB-3 design.
b.[8.8] (i) For the restricted cell means model, write the null hypothesis for treatment A
and blocks using matrix notation. (ii) Use the Kronecker product to construct the
coefficient matrices for treatment A, blocks, and the treatment-block interaction.
c. [8.8] Use the restricted cell means model to test the hypotheses in part (b)(i). Use the
formulas
Let α = .01.
d.[8.8] Assume that Y32 is missing. Use the restricted cell means model to test
*21.
It was hypothesized that test anxiety and the resulting poorer performance among
schoolchildren is due to previous failures and negative evaluations in school. A
counterconditioning procedure was used to reduce test anxiety. Thirty children, 15 boys and
15 girls, were randomly assigned to one of three conditions with 5 in each condition. The
children in the counterconditioning group were shown neutral words, then pictures of
school-related scenes, and finally positive words. The children in a placebo condition saw
the neutral words, followed by school scenes, but no positive words. Those in the control
group did not experience any of the treatment procedures. The dependent variable was the
score on the Digit Span subtest of the Wechsler Intelligence Scale for Children. The
following data were obtained. (Experiment suggested by Parish, T. S., Buntman, A. D., &
Buntman, S. R. Effect of counterconditioning on test anxiety as indicated by digit span
performance. Journal of Educational Psychology.)
A.
*c.[8.9] Compute and interpret for these data.
d.Prepare a “results and discussion section” for the Journal of Educational Psychology.
22.
In exercise 21, the effects of a counterconditioning procedure on test anxiety were
investigated. Suppose that a follow-up experiment was performed in which 30 boys
between ages 10 and 11 were assigned to five groups on the basis of the number of
negative school evaluations they had received. Six boys were assigned to each group. The
boys in each group were then randomly assigned to one of the three treatment conditions
described in exercise 21. The dependent variable was the score on the Digit Span subtest
of the Wechsler Intelligence Scale for Children. The following data were obtained.
a.[8.9] Test the null hypotheses: μ1. = μ2. = μ3., , and ; let α =
.05.
b.[8.9] Use the Fisher-Hayter statistic to test all two-mean null hypotheses for treatment
A.
*c.[8.9] Compute and interpret for these data.
d.Prepare a “results and discussion section” for the Journal of Educational Psychology.
3For a discussion of the relative merits of these statistics, see Keppel and Wickens (2004) and
4For alternative views of the mixed, nonadditive model, see Samuels, Casella, and McCabe
(1991) and the comments at the end of the article.
5More specifically, Tukey's procedure tests the hypothesis that a special type of contrast-
contrast interaction is equal to zero. For this contrast-contrast interaction, the coefficients of the
contrast are the and deviations. Contrast-contrast interactions are discussed in
Section 9.6.
6SPSS provides Mauchly's sphericity test, which is less powerful than the locally best invariant
test.
7This epsilon should not be confused with the epsilon that is used to denote an error effect.
The meaning of ε should be clear from the context.
*The reader who is interested only in the classical sum-of-squares approach can, without loss
of continuity, omit this section.
10Two matrices, and X2, that are conformable for multiplication (see Appendix D.2-4) are
orthogonal if and only if . If , the rows of and also are linearly
independent.
http://dx.doi.org/10.4135/9781483384733.n8
Ronald A. Fisher introduced the factorial design in 1926 (J. F. Box, 1978, p. 158). The design
represented a fundamental advance in the scientific method because it enabled researchers to
simultaneously investigate two or more treatments in the same experiment. It is the most widely
used design in the behavioral sciences, as an examination of recent volumes of behavioral
science journals will verify.
All factorial designs (1) have two or more treatments, (2) each treatment has two or more
levels, and (3) the levels of different treatments are combined to form treatment combinations.
Factorial designs are constructed from three building block designs: the completely
randomized design, randomized block design, and Latin square design. These building block
designs are described in Section 2.2 and Chapters 4, 8, 14. The construction of factorial
designs by combining two completely randomized designs is described here. The construction
of other factorial designs is described in Chapters 10, 12, 13, 15, and 16.
One assumption that has guided the writing of this and subsequent chapters is that complex
designs are more easily understood when they are perceived as consisting of simple building
block designs. To help the reader identify the building block design, the letters CR, RB, or LS
are used in the designation of factorial designs. An exception is the split-plot factorial design,
whose designation is well established by contemporary usage.
The simplest factorial design, from the standpoint of data analysis and assignment of
experimental units to treatment combinations, is the completely randomized factorial design. A
design with two treatments is designated as a CRF-pq design, where the letters CR identify the
building block design, F indicates that it is a factorial design, and p and q denote the number of
levels of treatments A a n d B, respectively. A completely randomized factorial design is
appropriate for experiments that meet, in addition to the assumptions of the completely
randomized design described in Chapter 4, the following conditions:
1.Two or more treatments, with each treatment having two or more levels.
2.All levels of each treatment investigated in combination with all levels of every other
treatment. If there are p levels of one treatment and q levels of a second treatment, the
experiment contains p × q treatment combinations.
3.Random assignment of experimental units to the treatment combinations. Each
This is a good place to describe the scheme that I use to designate treatments in analysis of
variance (ANOVA) designs. Treatments are designated by the capital letters A through G. A
particular but unspecified level of a treatment is denoted by lowercase letters a through g and a
lowercase subscript—for example, aj and bk. If two unspecified levels of a treatment must be
differentiated, a prime is used after one of the subscripts. For example, two levels of treatment
A are aj and aj′. Specific levels of a treatment are denoted by number subscripts—for example,
a1 and a2. Besides specifying a particular treatment level, it is also necessary to indicate the
number of levels of a treatment. Seven lowercase letters are used for this purpose. The
complete designation scheme for treatments is as follows:
The letter S is used in the notation scheme to denote blocks or samples of subjects
(experimental units). The letters si refer to a particular but unspecified block or subject. There
are n levels of si—that is, i = 1, …, n. This designation scheme uses 24 letters of the alphabet.
The remaining two letters, X and Y, are used to denote scores (observations).
A diagram of a CRF-23 design is shown in Figure 9.2-1. When all possible combinations of the
levels of treatments A and B occur together in an experiment, the treatments are said to be
completely crossed treatments. A CRF-pq design has p × qtreatment combinations:a1b1,
a1b2, …, apbq. The design requires n × p × q subjects, where n ≥ 1. The n × p × q subjects are
randomly assigned to the p × q treatment combinations, with the restriction that n subjects are
assigned to each combination. If one or more of the treatments is an organismic variable—for
example, gender, IQ, or year in school—it is necessary to deviate from this randomization
procedure. Obviously one cannot assign a woman to the treatment combination “man and high
reinforcement.” In such cases, the randomization procedure for a generalized randomized block
design in Section 8.9 is used in which one of the treatments corresponds to the w levels of the
group variable. The resulting design is a generalized randomized block design, GRB-p,
although researchers often incorrectly treat it as a completely randomized factorial design.
Figure 9.2-1 ▪ Layout for a completely randomized factorial design (CRF-23 design),
where n × 2 × 3 subjects are randomly assigned to the 2 × 3 = 6 combinations of
treatments A a n d B, with the restriction that n subjects are assigned to each
combination. The layout for this design is the same as that for a completely randomized
design except that subjects are assigned to treatment combinations, ajbk, instead of
In Figure 9.2-1, each group contains n subjects who are assigned to the same treatment
combination. The design in Figure 9.2-1 has 2 × 3 = 6 cells or groups. The minimum number of
subjects required for a CRF-23 design is n × 2 × 3 = 6, where n = 1. It is desirable, however, to
have more than one subject in each of the pq cells. If n = 1, it is not possible to compute a
within-cell estimate of error variance. In this case, the A × B interaction must be used to
estimate error variance under the often tenuous assumption that the interaction effects are
equal to zero. If n is equal to 5, the design in Figure 9.2-1 requires (5)(2)(3) = 30 subjects. As
the number of levels of a treatment increases, the number of subjects required increases
rapidly. For example, if treatment A had three instead of two levels, the required number of
subjects would increase from 30 to 45.
Recall from Chapter 4 that the numbers of subjects in each treatment level of a completely
randomized design do not have to be equal. This also is true for completely randomized
factorial designs. However, the analysis and interpretation are greatly simplified if the ns in each
cell are equal. Accordingly, researchers are encouraged to have equal ns.
Assume that a statistical consultant has been employed to assist a police department evaluate
two features of its human relations course for new officers. The two features are the type of
beat to which officers are assigned during the course, treatment A, and the length of the
course, treatment B. Treatment A has three levels: upper-class beat, a1; middle-class beat, a2;
and inner-city beat, a3. Treatment B also has three levels: 5 hours of human relations training,
b1; 10 hours, b2; and 15 hours, b3. The dependent variable is the officer's attitude toward
minority groups following the course. A test developed and validated by the consultant is used
to measure the dependent variable.
The research hypotheses can be evaluated by testing the following null hypotheses:
The first two null hypotheses, H0: μ1. = μ2. = … = μp. and H0: μ.1 = μ.2 = … = μ.q, are the
familiar hypotheses that the means for the respective treatments are equal. The third null
hypothesis, H0: μjk − μj′k − μjk′ + μj′k′ = 0 for all j and k, states that all population interaction
effects, (αβ)jk, for treatments A and B are equal to zero. Two treatments are said to interact if
differences in performance under the levels of one treatment are different at two or more levels
of the other treatment. I have more to say about interactions shortly.
To test the null hypotheses, 45 police recruits were randomly assigned to the nine treatment
combinations of the CRF-33 design with the restriction that five recruits were assigned to each
combination. All treatment levels of interest to the police commissioner were included in the
experiment. Thus, a fixed-effects model, model I, applies to the experiment.
Before testing the statistical hypotheses, descriptive statistics—means and standard deviations
—were computed and plots of standardized residuals, zi(jk), were constructed. Novice
researchers often omit the important steps of examining descriptive statistics and checking
model assumptions. Such an examination can turn up suspected data recording errors,
assumptions of the model that require additional scrutiny, and other anomalies that should be
looked into. Sometimes an examination of treatment means reveals only trivial effects that are
of no practical value even if they are statistically significant.
Descriptive statistics for the police attitude data (Table 9.3-2) are shown in Table 9.3-1.
Standardized residuals, zi(jk), for each of the nine treatment combinations are plotted in Figure
9.3-1(a). As discussed in Section 4.2, plots of standardized residuals are used to check for
outliers and to determine if the error effects, , are normally distributed, have equal
variances, and are mutually independent. If the model assumptions for the error effects hold
and there are no outliers, then approximately 68% of the zi(jk) s should fall between −1 and +1,
approximately 95% between −2 and +2, and approximately 99.7% between −3 and +3. An
examination of Figure 9.3-1(a) suggests that there is no reason to question the normality
assumption. Furthermore, all of the zi(jk)s fall between −2.5 and +2.5; hence, there are no
outliers. The dispersions of the zi(jk) s appear fairly uniform, so the homogeneity of variance
assumption also appears tenable.
Figure 9.3-1 ▪ (a) The standardized residuals, zi(jk), are plotted for the nine treatment
In Figure 9.3-1(b), the standardized residuals are plotted against the order in which the
observations were collected. If the independence assumption is tenable, the zi(jk) s should be
randomly scattered around zero with no discernable pattern. Nonindependence is indicated, for
example, if the zi(jk) s show a consistent downward or upward trend or have the shape of a
megaphone. Because the standardized residuals for combination a2b3 tend to increase as a
function of the order in which the measurements were collected, the statistical consultant would
want to review the data collection and recording procedures for this treatment combination.
After the consultant is satisfied with the exploratory analysis, the next step is a confirmatory
data analysis.
The layout of the CRF-33 design and computational procedures are shown in Table 9.3-2. The
labels of the ABS and AB summary tables are a convenient way of designating the sources of
variation represented by observations in these tables; the more letters in the label, the more
sources of variation. For example, both the ABS and AB summary tables contain information
about variation among the levels of treatments A and B, but information concerning subject
variation can be obtained only from the ABS summary table.
To compute the total sum of squares in a factorial design, construct a summary table that
contains all the sources of variation that the design estimates, for example, A, B, and S. This
summary table can be collapsed into smaller summary tables by summing the entries over one
or more sources of variation, a procedure that is continued until all required summary tables
have been constructed. For a CRF-pq design, only two summary tables are required: ABS and
AB.
The analysis of variance is summarized in Table 9.3-3. The table contains three a priori families
of contrast: A, B, and AB. The null hypothesis for each family is evaluated at the α = .05 level of
significance. The conceptual unit for error rate in this example is the individual family. The
probability of falsely rejecting one or more omnibus null hypotheses (error rate experimentwise)
is 1 – (1 – .05)3 = .14. The relative merits of holding constant the error rate per family or some
larger conceptual unit, such as the experiment, have been widely debated. This issue is
discussed in Section 5.1. It can be shown that the three a priori families of contrasts are also
mutually orthogonal (see G. E. P. Box et al., 2005, chap. 7). The practice adopted here is to
make the family the conceptual unit for the error rate if, as in Table 9.3-3, the families are a
priori and orthogonal. Indeed, a CRF-pq design can be regarded as three sets of completely
randomized designs with prespecified orthogonal families of comparisons—that is, A, B, and
AB. According to the analysis in Table 9.3-3, the null hypotheses for treatment B and the AB
interaction are rejected. Procedures for following up a significant interaction are discussed in
the next section and in Section 9.6. If the AB interaction had not been significant, the
consultant would have performed multiple comparisons among the treatment B means. These
procedures are discussed in Section 9.5.
A significant interaction is a signal that the interpretation of tests of treatments A and B must be
qualified. The first step in interpreting an interaction is to graph the means as in Figure 9.3-2.
An examination of Figure 9.3-2(a) reveals that the differences among the a1, a2, and a3 sample
means are quite small ( , and ), which is consistent with
the insignificant test for treatment A. But this tells only part of the story. If we examine
differences among the a1, a2, and a3 means at each level of treatment B, we discover some
large differences. For example, the differences between a1 and a3 at b1, b2, and b3 are,
levels of treatment A and large only at a3. As these examples illustrate, a significant interaction
is a red flag signaling that tests of treatments provide only a partial and often misleading
picture of what has happened in an experiment.
Additional examples of interactions are shown in Figure 9.3-3(a and b). As is customary, the
means are connected by straight lines. A significant interaction always appears as nonparallel
lines. If treatments do not interact, the lines are parallel as in Figure 9.3-3(c). A test of the null
hypothesis (oβ)jk = 0 for all j and k helps a researcher decide whether chance sampling error is
a reasonable explanation for the presence of nonparallel lines connecting sample means.
Additional steps in interpreting interactions are discussed in Section 9.6.
Figure 9.3-3 ▪ Treatments A and B are said to interact if any portions of the lines that
connect the means for aj a n d aj′ are not parallel. Parts (a) and (b) illustrate two
treatments that interact. Part (c) illustrates treatments that do not interact. The necessary
condition for the lines in (c) to be parallel is that (μ11 – μ21) – (μ12 − μ22) = 0 or μ11 − μ21
− μ12 + μ22 = 0.
A score, Yijk, in a completely randomized factorial design is a composite that reflects the effects
of treatments A and B, the AB interaction, and all other sources of variation that affect Yijk.
These latter sources of variation are collectively referred to as error variance. The experimental
design model for a score is
where
Yijk is the score for the ith experimental unit in the jkth treatment combination.
μpq. The grand mean is a constant for all observations in the experiment.
αj is the treatment effect for population j and is equal to μj. − μ, the deviation of the
grand mean from the jth population mean. The jth treatment effect is a constant for all
scores in treatment level aj and is subject to the restriction .
βk is the treatment effect for population k and is equal to μ.k − μ, the deviation of the
grand mean from the k th population mean. The k th treatment effect is a constant for
all scores in treatment level bk and is subject to the restriction
(αβ)jk is the joint effect of treatment levels j and k (interaction of α and βk) and is
equal to μjk − μj. − μ.k + μ. The interaction effect is subject to the restrictions
and .
εi(jk) is the error effect associated with Yijk and is equal to Yijk − μjk. The error effect
is a random variable that is NID .
The key features of the model can be summarized as follows: (1) The model is assumed to
reflect all sources of variation that affect Yijk. (2) Treatments A and B and the AB interaction
represent fixed effects; thus, equation (9.4-1) is a fixed-effects model (model I). (3) The error
effect, εi(jk), is (a) independent of other εs and is (b) normally distributed within each jk
treatment population with (c) mean equal to zero and (d) variance equal to . Assumption (d)
can be restated as , the familiar homogeneity of variance assumption.
Procedures for using sample data to determine the tenability of some of these assumptions are
discussed in Sections 3.5, 4.2, and 9.3. Section 3.5 examines the consequences of violating the
assumptions.
The values of the parameters μ, αj, βk, (αβ)jk, and εi(jk) in equation (9.4-1) are unknown, but
they can be estimated from sample data as follows.
The partition of the total sum of squares, which is the basis for the analysis in Table 9.3-3, is
obtained by rearranging the terms in equation (9.4-2) as follows:
Squaring both sides of equation (9.4-3) and summing over all of the observations following the
examples in Sections 3.2 and 8.1 lead to the following partition of the total sum of squares.1
computational purposes. More convenient formulas are given in Table 9.3-2.2 Mean squares are
obtained by dividing the sums of squares by the degrees of freedom in Table 9.3-3.
Expected Values of Mean Squares for Fixed-Effects, Mixed, and Random-Effects Models
The police attitude experiment described in Section 9.3 involves a fixed-effects model (model I).
The particular levels of treatments A and B were included in the experiment because they were
the only levels of interest to the police commissioner. The expectations of the mean squares for
this model are shown in Table 9.4-1. If the levels of each treatment are selected randomly from
a population of levels, the model is called a random-effects model (model II). If the levels of one
or more, but not all, of the treatments are selected randomly, the model is a mixed model
(model III). Treatment effects are fixed or random depending on whether they were randomly
sampled. Interaction effects are random effects if they involve one or more random effects;
otherwise, they are fixed effects.
A comparison of the expectations of the mean squares, E(MS)s, for the fixed-effects, mixed,
and random-effects models is shown in Table 9.4-1. Much confusion has arisen concerning the
correct error term to use in testing main effects and interactions because of a failure to
distinguish among these three models. Recall from Section 4.8 that the numerator of an F
statistic should contain, in addition to the expected value terms in the denominator, one
additional term—the one being tested. Based on the E(MS)s in Table 9.4-1, the F statistics for
testing treatment A for models I and II are as follows:
The proper error term for other tests can be determined from Table 9.4-1.
The type of effect—fixed versus random—also determines the nature of the null hypothesis that
is tested. For example, the null hypotheses for treatment A, where the αjs are fixed effects or
random effects are, respectively,
The null hypothesis for fixed effects concerns only the j = 1, …, p population treatment effects
(α) that are represented in the experiment. For random effects, the null hypothesis concerns all
of the P population treatment effects, not just the p effects in the experiment, where p is less
than P. If the null hypothesis for a fixed-effects treatment is rejected, it can be concluded that
at least two of the p population treatment effects are not equal. If the null hypothesis for a
random-effects treatment is rejected, at least two of the treatment effects in the population of P
effects are not equal. The computation of sums of squares for models that involve fixed and
random effects is identical. The models differ in the selection of the treatment levels, expected
values of the MSs, and the nature of the null hypotheses that are tested.
where
Yijk is the score for the ith experimental unit in treatment combination ajbk.
αj is .
βk is .
(αβ)jk is .
Yijk is the score for the ith experimental unit in treatment combination ajbk.
βk is .
In this model, the variance of (αβ)jk is defined as rather than as to simplify the
expected mean squares.
Tests of differences among means in a CRF-pq design have the same general form as those
given in Chapter 5. Recall from Section 5.1 that most multiple comparison procedures use one
of three test statistics: t, q, and F. Test statistics for making comparisons among treatment A
means assuming a fixed-effects model have the following form:
with v1 = s − 1 and v2 = pq(n − 1), where s denotes the number of means in a set.
Test statistics for making comparisons among treatment B means have the following form:
For random-effects and mixed models, the denominator of the F statistic that is used to test
treatments A and B should be used in the denominator of the t, q, and F multiple comparison
statistics.
The reader learned in Section 9.3 that if an interaction is significant, the interpretation of tests of
main effects must be qualified. When this occurs, a researcher may, in an attempt to better
understand the interaction, proceed to test hypotheses about simple main effects or
hypotheses about treatment-contrast interactions. I examine these two approaches next.
I begin by distinguishing between main effects and simple main effects. For purposes of
discussion, I assume a fixed-effects model. Main effects for treatments A and B have the
following form in which the grand mean is subtracted from a treatment mean:
or
Simple main effects have the following form in which a treatment mean is subtracted from a
treatment combination mean:
or
Simple main-effects hypotheses are deceptively simple. For example, H0: αj at b1 = 0 for all j
focuses on only the αj effects in treatment level b1. However, as I show later, a test of this
hypothesis is equivalent to testing a null hypothesis for a main effect plus an interaction effect:
H0: αj + (αβ)j1 = 0 for all j.
Formulas for computing mean squares for simple main effects are similar to those for main
effects.
More convenient computational formulas are given in Table 9.6-1. The police attitude data from
Table 9.3-1 are used to illustrate the computations. The results are summarized in Table 9.6-2.
By now the reader is familiar with partitioning sums of squares and degrees of freedom. It is
apparent with the data in Table 9.6-1 that
and that the degrees of freedom for SSA + SSAB are equal to the degrees of freedom for
SSA at bk:
and
The important point is that simple main-effects sums of squares represent a partition of a
treatment sum of squares plus an interaction sum of squares. A significant test of F = (MSA at
b1)/MSWCELL means that αj ≠ 0 for some j, or (αβ)j1 ≠ 0 for some j, or both—the interpretation
of the F test is ambiguous.
In Chapter 4, I discussed the issue of the correct conceptual unit for error rate. Researchers
face the same issue in performing tests of simple main effects. In the absence of definitive
research, the best I can do is to describe approaches that are consistent with the testing
philosophy outlined in Section 5.1. A decision to perform simple main-effects tests is usually
made following an examination and statistical analysis of data. Furthermore, simple main-
effects sums of squares for a treatment are not mutually orthogonal. The procedure that I
recommend in such cases is to assign the same error rate to the collection of tests as that
allotted to the family or the experiment. In the police attitude example, the simple main-effects
sums of squares represent a partition of three families, A, B, and AB, with an overall experiment
error rate of .05 + .05 + .05 = .15. For a CRF-33 design, the number of simple main-effects
tests that can be performed is p + q = 3 + 3 = 6. The per-experiment error rate for the collection
of the six simple main-effects tests in this example can be held at or less than .15 by means of
Dunn's procedure, which is based on the Bonferroni inequality. Alternatively, I can use a
procedure described by Gabriel (1964, 1969), called the simultaneous test procedure, t o
control the experimentwise error rate at or less than .15. This procedure is related to Scheffé's
method and Roy's (1953) union intersection principle. I illustrate the two procedures next.
With Dunn's procedure, each simple main-effects F statistic is evaluated at the αPF/(p + q) =
.15/6 = .025 level of significance. The critical value of the test statistic is F.15/6; 2, 36 = 4.10,
where v1 = 2 is the degrees of freedom of the simple main-effects mean square and v2 = 36 is
the degrees of freedom of the error mean square. With this criterion, two simple main-effects
hypotheses are rejected: H0: αj at b3 = 0 for all j and H0: βk at a3 = 0 for all k (see Table 9.6-2,
rows 6 and 9). A researcher would normally proceed next to evaluate contrasts among the
means embedded in these two hypotheses. These contrasts are called simple-effects
contrasts to distinguish them from main-effects contrasts. Scheffé's statistic can be used to
where the cjs and cks are coefficients that define the contrasts. The computational formulas for
Scheffé's statistic are
The second approach to controlling error rate, the simultaneous test procedure (STP), is similar
to Scheffé's procedure and is particularly well suited to data snooping. According to the
simultaneous test procedure, an F test statistic is significant if it exceeds , where α
is the significance level of the omnibus test, v1 is the degrees of freedom of the omnibus
numerator mean square, v2 is the degrees of freedom of the error mean square, and v3 is the
degrees of freedom of the mean square that is being tested.3 For example, the critical value for
F = (MSA at bk)/MSWCELL is (8F.15; 8, 36)/2 = (8)(1.64)/2 = 6.56. This follows because the
combined significance level of the omnibus tests of A, B, and AB is .05 + .05 + .05 = .15; the
degrees of freedom for the omnibus null hypothesis for A, B, and AB are 2 + 2 + 4 = 8; the
degrees of freedom of MSWCELL are 36; and the degrees of freedom of the simple main-
effects MSA at bk, v3, are p – 1 = 2. When the critical value 6.56 is used, the null hypothesis
H0: βk at a3 = 0 for all k is rejected. As a follow up, all possible contrasts of the form
as the critical value. The values of the STP and Scheffé's statistics are equal when v3 is equal
to 1. In that case,
In concluding this discussion of simple main-effects tests, note that questions surrounding their
use have been the subject of continuing debate among statisticians. Boik (1975) has examined
some apparent inconsistencies and interpretation problems that can occur when the tests are
applied to designs with three or more treatments. Procedures for controlling the error rate and
achieving coherent tests have been discussed by Betz and Gabriel (1978). However, even with
refinements, a basic limitation of the simple main-effects approach remains: It partitions a
pooled sum of squares involving a treatment and an interaction. Tests of hypotheses about
simple main effects and contrasts that involve cell means may be interesting, but they do not
help us understand the interaction between two treatments. The approach described next
enables a researcher to gain a better understanding of the nature and sources of nonadditivity
in data.
Treatment-Contrast Interactions
Whenever two treatments interact, I know that some contrast for one treatment is different at
two or more levels of the other treatment. In other words, at least one contrast interacts with the
other treatment. Such interactions are called treatment-contrast interactions to distinguish
them from omnibus interactions.4 I now show how to follow up a significant omnibus interaction
by partitioning it into meaningful treatment-contrast interactions.
As usual, I begin by graphing the interaction. A graph of the AB interaction for the police
attitude experiment is given in Figure 9.3-2. The next step is to examine the data for interesting
contrasts of the form
and
Suppose that after examining Figure 9.3-2, the police commissioner is interested in the
following contrasts:
I know from the significant AB interaction that at least one contrast interacts with the other
treatment. To determine whether any of these interesting contrasts interact with the other
treatment, I can test the following treatment-contrast null hypotheses:
According to Table 9.6-3 (part v), hypotheses H0: αψ2(B) = δ for all j and H0: βψ2(A) = δ for all
k can be rejected. The significant test of H0: αψ2(B) = δ for all j tells me that contrast ψ2(B)—5
hours of human relations training versus 15 hours—is not the same at all levels of treatment A.
My interest turns to gaining an understanding of this treatment-contrast interaction. My next
question is as follows: Is contrast ψ2(B) different for the upper- and middle-class beats, ψ1(A)
= μ1k − μ2k, and is it different for the average of the upper- and middle-class beats and the
inner-city beat, ψ2(A) = [(μ1k + μ2k)/2 − μ3k]? To put it another way, does contrast ψ2(B)
interact with either of the two interesting treatment A contrasts? The null hypotheses are,
respectively,
Because these hypotheses are about the interaction of two contrasts, they are called contrast-
contrast interactions.
Similarly, the rejection of H0: βψ2(A) = δ for all k tells me that contrast ψ2(A)—average of the
upper- and middle-class beats versus the inner-city beat—is not the same at all levels of
treatment B. I wonder if contrast ψ2(A) is different for the 5- and the 10-hour human relations
courses, ψ1(B); the 5- and 15-hour courses, ψ2(B); and the 10- and 15-hour courses, ψ3(B)?
The corresponding null hypotheses are, respectively,
Note that the second hypothesis duplicates one of the preceding ones. Procedures for
computing contrast-contrast mean squares are illustrated in Table 9.6-4. The results of all tests
are summarized in Table 9.6-5. Notice that all contrast-contrast interaction mean squares have
one degree of freedom.
According to row 8 of Table 9.6-5, ψ2(A)—average of the upper- and middle-class beats versus
the inner-city beat—is not the same at all levels of treatment B. In following this up, I found, as
noted earlier, that the only interesting contrast that ψ 2(A) interacts with is ψ2(B) —5 and 15
hours of human relations training (see row 10).
There is no reason to believe that two of the interesting contrasts interact with the other
treatment. These are ψ1(A) (see row 7 of Table 9.6-5) and ψ3(B) (see row 6 of Table 9.6-5). If
we are willing to act as if the null hypotheses for these treatment-contrast interactions are
tenable, we can go ahead and test the associated main-effects hypotheses: H0: ψ1(A) = μ1. −
μ2. = 0 (upper-class vs. middle-class beat) and H0: ψ3(B) = μ.2 − μ.3 = 0 (10 hours vs. 15
hours of training).5 Scheffé's statistics for testing these null hypotheses are
The critical value for Scheffé's test statistic is 2F.05; 2, 36 = 2(3.26) = 6.52. Neither null
hypothesis can be rejected. Many researchers would be reluctant to test hypotheses such as
H0: ψ1(A) = μ1, – μ2. = 0 and H0: ψ3(B) = μ.2 − μ.3 = 0, believing that the AB interaction
renders such contrasts meaningless. As I have shown, this is not the case. The significant AB
interaction indicates that at least one contrast is different at two or more levels of the other
treatment. It does not mean that all contrasts for one treatment interact with the other
treatment.
Earlier I showed that the sum of the simple main-effects sums of squares is not equal to the
omnibus interaction sum of squares. It is easy to show that for mutually orthogonal contrasts,
the sum of treatment-contrast interaction sums of squares is equal to the omnibus interaction
sum of squares. That is,
To illustrate, I use contrasts and that are defined by the coefficient vectors
The important point is that treatment-contrast interactions, unlike simple main effects, represent
a partition of the omnibus interaction.
The presence of a significant interaction can certainly complicate a researcher's life. As the
reader has seen, the follow-up procedures are both complex and tedious. However, interactions
among variables are common and researchers must be prepared to deal with them. In this
section, I described two follow-up approaches: tests of simple main effects and tests of
treatment-contrast and contrast-contrast interactions. In the hands of a creative researcher, the
latter approach can provide useful insights into the nature and sources of nonadditivity in data.
When one or more treatments in a factorial experiment are quantitative, additional insight
concerning their effects can be obtained by partitioning the data into orthogonal trend
contrasts. Trend analysis was introduced in Chapter 6.
For quantitative treatments, a test of a treatment null hypothesis is also a test of the hypothesis
of no trend in the population means. Consider the police attitude data in Table 9.3-2. Treatment
A is not a quantitative variable and therefore is not suitable for trend tests. However, treatment
B is quantitative. A test of the hypothesis of no trend for treatment B is provided by
Because, according to Table 9.3-3, the F statistic for treatment B is significant, we can conclude
that the population dependent variable means are related in some manner to the independent
variable. A graph of the trend for the sample data is shown in Figure 9.7-1. If the omnibus test
for the presence of a trend is not significant, further tests for trends should not be made unless
a researcher has advanced a priori hypotheses concerning specific trends. A posteriori
procedures should be used for trend snooping, whereas a priori procedures can be used to
carry out planned trend tests.
A factorial experiment enables a researcher to make a trend test not previously described. This
test provides an answer to the following question: Is the trend of population means for one
treatment the same for all levels of a second treatment? This test has the form
According to the analysis of data presented in Table 9.3-3, the AB interaction is significant. A
graph illustrating the differences in trends for treatment B at the three levels of treatment A is
shown in Figure 9.3-2. If, as in the present example, the AB interaction is significant, interest
shifts from the omnibus treatment trend contrasts described in Sections 6.2 and 6.3 to
interaction trend contrasts.
I noted in Section 6.3 that SSBG in a completely randomized design can be partitioned into p −
1 trend contrasts. Similarly, if both treatments of, say, a CRF-33 design represent quantitative
variables, then the data can be partitioned into pq − 1 = 8 trend contrasts as shown in Table
9.7-1. The reader may have difficulty visualizing the meaning of linear × linear, linear ×
quadratic, and other trend contrasts. Suppose for illustrative purposes that treatment A in Table
9.3-2 also is a quantitative variable and that the levels are separated by equal intervals. A
response surface for these data is shown in Figure 9.7-2. I know from the significant AB
interaction that the profiles for treatment A at the q levels of B are not parallel and that the
profiles for treatment B at the p levels of A are not parallel. An examination of Figure 9.7-2
suggests that the profiles for treatment A, for example, are linear and quadratic in form. A
researcher might wonder whether the linear and quadratic components of the treatment A trend
interact with the linear component of treatment B; that is, are the linear × linear and quadratic ×
linear interactions significant? This type of analysis of the AB interaction provides the
researcher with an indication of the fit of differently shaped surfaces to the population means.
Table 9.7-1 ▪ Partition of Treatments and Interaction for CRF-33 Design When Both
Treatments Are Quantitative
If only one of the treatments in a CRF-33 design is quantitative, a different partition of the data
is required. For the data in Table 9.3-2, where only treatment B is quantitative, the partition in
Table 9.7-2 is appropriate. Profiles for treatment B at the three levels of A are shown in Figure
9.3-2. The question that the partition in Table 9.7-2 is designed to answer is, “Which trend
contrasts account for the differences in the profiles?” Procedures for carrying out these trend
analyses are presented in the following sections.
Table 9.7-2 ▪ Partition of Treatments and Interaction for CRF-33 Design When Only
Treatment B Is Quantitative
Although the AB interaction for the data in Table 9.3-2 is significant, a researcher might want to
know how much of the variation in the dependent variable is accounted for by the linear and
quadratic trend components in treatment B. We will assume that a researcher has advanced a
priori hypotheses with respect to both trend components. Alternatively, a researcher might
simply want to test the hypothesis that the trend is nonlinear. Procedures for testing these
hypotheses are identical to those in Section 5.5 and are not repeated here. Results of the
computations for treatment B are as follows:
It is apparent from the foregoing that the linear contrast is significant but the quadratic is not.
Therefore, the overall trend for treatment B shown in Figure 9.7-1 can be adequately described
by an equation of the form
Procedures for fitting a linear equation to the trend in Figure 9.7-1 are described in Section 6.2.
The linear component of the trend accounts for almost all of the variation in treatment B, as the
following computations show:
If, in a CRF-33 design, the two treatments represent quantitative variables, the four degrees of
freedom for the AB interaction can be partitioned as shown in Table 9.7-1. Procedures for
testing
are given in Table 9.7-3. The data are taken from Table 9.3-2; for purposes of illustration,
assume that both variables are quantitative and that the levels of treatment A also are
separated by equal intervals. The orthogonal coefficients in part (i) of Table 9.7-3 are obtained
from Appendix Table E.10.
In evaluating the significance of the F statistics in Table 9.7-3, we again face the issue of the
correct conceptual unit for the error rate. This issue is examined in detail in Sections 5.1 and
9.6, and there is little that I can add here. Because the four contrast-contrast interactions in
Table 9.7-3 are orthogonal, the question reduces to whether the tests are a priori or a posteriori.
Critical values for both cases are given in the table.
If one but not both treatments in a CRF-33 design represent a quantitative variable, the four
degrees of freedom for the AB interaction can be partitioned as shown in Table 9.7-2. A
significant test of F = MSAB/MSWCELL tells us that the trend for the quantitative variable is not
the same for all levels of the qualitative variable.
The treatment × trend-contrast interaction null hypotheses that I want to test are
where δ is a constant for a particular hypothesis. The analysis in Table 9.7-4 is used to
determine whether the AB interaction is due to differences in the linear or quadratic trend
components. The data are taken from Table 9.3-2. The hypothesis αψlin(B) = δ for all j i s
rejected, which means that treatment A interacts with the linear contrast. Differences in the
linear contrast for the three levels of treatment A account for
9.8 Estimating Strength of Association, Effect Size, Power, and Sample Size
Strength of Association
A significant F statistic indicates that some relationship exists between the dependent variable
and the independent variable. Although this information is useful, the point was made in
Section 4.4 that an F statistic provides no information about the strength of the association.
Indexes of strength of association for a CRF-pq design are computed for the police attitude data
in Table 9.3-2. Because treatments A and B are fixed effects, the appropriate measure of
association is partial omega squared:
where
According to Cohen's guidelines, there is a large association between attitudes and (1)
treatment B and (2) the AB interaction. Because the F statistic for treatment A is not significant,
the omega squared for this treatment represents chance association.
Estimates of the variance components can be obtained from the expected values of the mean
squares in Table 9.4-1. According to the table, the expected values of the mean squares are
Indexes of strength of association for a mixed model in which the levels of treatment A are fixed
effects but those for treatment B are random effects are as follows:
where
Partial omega squared for a fixed-effects model also can be computed from a knowledge of F,
n, p, and q. The alternative formulas are
These formulas are useful for assessing the practical significance of published research where
only the F statistic and degrees of freedom are provided.
Effect Size
A second approach to assessing the practical significance of research results was introduced in
Section 4.4. This approach, popularized by Jacob Cohen (1988), uses a function of differences
among means called an effect size. You learned in Section 4.4 that a sample estimate of effect
size, denoted by , can be computed from omega squared. For example, the formula for
computing from the partial omega squared for treatment A is
The same formula with the appropriate partial omega squared can be used to compute for
treatment B and the AB interaction.
Procedures for calculating power and the number of subjects necessary to achieve a specified
power for a completely randomized design were described in Section 4.5. These procedures
generalize to a completely randomized factorial design. The police attitude data in Table 9.3-2
are used to illustrate the computation of power. Formulas for computing are as follows:
The last formula, , does not appear to follow the pattern for and
. The reader probably expected to see . The formula for has the general
form
Power is the probability of rejecting a false null hypothesis. Recall that the null hypothesis for
treatment A was not rejected. It is possible that the null hypothesis is really false but the test
lacked adequate power to reject the false null hypothesis. The power of the test of treatment A
can be determined from
where
and v1 = (3 − 1)(3 − 1) = 4 and v2 = (3)(3)(5 − 1) = 36. According to Tang's charts, the power of
the test of the AB interaction is approximately .85.
Tang's charts can be used to estimate the number of observations in a completely randomized
factorial design that are required to achieve a given power. To accomplish this, it is necessary to
specify the (1) degrees of freedom for the treatments and interaction; (2) level of significance,
α; (3) power, 1 – β; (4) size of the population variance, ; and (5) sum of the squared
population treatment effects and interaction effects: , and . From
trial and error, values of n can be inserted in the formulas
until the desired power is obtained. In all likelihood, the required ns for treatments A and B and
the AB interaction differ. The largest n should be selected. The number of subjects required for
the experiment is npq.
and B is to specify the number of levels of the treatments, α, 1 – β, and either ω2 or f. Suppose
a researcher specifies that p = 3, q = 3, α = .05, 1 – β = .80, and f = .40 (a large effect size).
The required sample size for treatment A can be determined from Appendix Table E.13 for v1 =
p − 1 = 3 − 1 = 2 and , where v1 and denote the
degrees of freedom for the F test of treatment A The value of n is obtained from the column
headed by f* = f = .400 and the row labeled 1 − β = .80. According to Appendix Table E.13, the
sample n is 7. The experiment requires npq = (3)(3)(7) = 63 subjects. Because treatments A
and B have the same number of degrees of freedom, the required n for treatment B also is 7. If
n = 7 is used, the power of the test of the AB interaction is less than .80 because interaction
effects are computed from n observations, whereas treatment effects are computed from np or
nq observations.
Tang's charts in Appendix Table E.12 can be used with either f or ω2. To use f, the charts are
entered with or , where for treatment A and q − 1 f o r
treatment B, v2 = pq(n′ − 1), and n = a trial value of n. To use ω2, the charts are entered with
Sometimes it is necessary to derive the E(MS)s for a complex design to determine the proper
denominator for an F statistic. I illustrated a tedious procedure for deriving expected values in
Section 3.3. Fortunately, a simple alternative procedure described here leads to the same
results. The procedure can be used with any design that is constructed from CR-p and RB-p
building block designs. An equal number of experimental units in each treatment level or
treatment combination are assumed.
For purposes of illustration, expected values of mean squares are derived for a CRF-pq design.
The procedure uses the following six rules.
Rule 2. Construct a two-way table. See Table 9.9-1 for details of the layout.
a.Row headings in part (i) of the table consist of the terms on the right side of the model
Rule 3. Entries below each column heading in part (ii) are as follows:
a.If the column heading i, j, or k in part (ii) appears as a subscript of a row term in part (i) but
not in parentheses, enter in the row the sampling term—1 – n/N, 1 − p/P, or 1 – q/Q—
appropriate for that column. The lowercase letter in the sampling term stands for the
number of levels of the treatment that are in the experiment; the uppercase letter, the
number in the population.
b.If the column heading does not appear as a subscript of a row term, enter in the row the
number of levels, n, p, or q, appropriate for that column.
c. If the column heading appears in the subscript of a row term but in parentheses, enter the
number 1 in the row.
Rule 4. Entries in part (iii) for a given row consist of the variances for those terms in the model
that contain all of the subscripts of the part (i) row term. No distinction is made between
subscripts in parentheses and those not in parentheses. For example, the subscript of αj in row
1 is j. Variances for the terms in the model that contain the subscript j—εi(jk), (αβ)jk, and αj—
are and ; these variances are entered in part (iii) of row 1.
Rule 5. Coefficients of the variances in part (iii) for a given row are obtained by covering up the
column(s) in part (ii) headed by subscripts(s) that appear in the row of part (i), but not including
subscripts in parentheses. Multiply the variances in part (iii) by the uncovered terms in part (ii)
from the row for that variance. Consider the first row; its subscript is j, so column j is covered,
leaving columns i and k uncovered. The coefficients for are n and q from the first row, the
coefficients for are n and 1 − q/Q from the third row, and the coefficients for are 1 – n/N
and 1 from the fourth row.
Rule 6. The sampling term—1 – p/P, 1 – q/Q—in part (iii) equals zero if the corresponding
treatment or interaction in the experimental design model represents a fixed effect and 1 if the
effect is a random effect. For example, if treatment A is a fixed-effects treatment, the number of
treatment levels in the experiment is equal to the number in the population: hence, p = P and 1
– p/P = 1 − 1 = 0. If treatment A is a random-effects treatment, the p treatment levels in the
experiment are assumed to be very small relative to the P levels in the population. Hence, p/P ≅
0 and 1 – p/P = 1 − 0 = 1. The error effect εi(jk) is considered a random variable; thus, 1 – n/N
is always equal to 1.
After the table is completed, the expected values of mean squares for fixed-effects, random-
effects, and mixed models can be readily determined. For example, if treatments A and B are
fixed effects, their respective sampling terms, 1 – p/P and 1 – q/Q, are equal to zero, and all
products that involve these sampling terms equal zero. If, however, treatments A and B are
random effects, their respective sampling terms, 1 – p/P and 1 – q/Q, are equal to 1.
In writing out the E(MS)s in part (iii) of Table 9.9-1, the variances for fixed effects should be
replaced by sums of squared effects divided by their respective degrees of freedom. For
example, because treatments A and B are fixed, should be replaced by and
by . If, as in this example, the interaction involves all fixed effects, it too should
be replaced by the sum of the squared interaction effects divided by the degrees of freedom;
that is, is replaced by . The resulting expected values of mean squares
and the form of the F statistics for this fixed-effects model are as follows:
Suppose that A is a fixed effect and B is a random effect. Then the sampling term 1 – q/Q is
equal to 1; 1 – p/P is equal to zero as before. The expected values of mean squares and the
form of the F statistic for this mixed model are as follows:
Another example showing the application of the six rules appears in Table 9.9-2 for a CRF-pqr
design. This design is described in Section 10.2. The design has three treatments. The
experimental design model equation is
If treatments A, B, and C are random effects, the expected values of mean squares and the
forms of the F statistics are as follows:
For this random-effects model, all of the sampling terms are equal to 1. An examination of the
E(MS)s for this design reveals that the model does not provide error terms for testing the three
main effects. The mean square error term for testing MSA, for example, should estimate the
following terms:
It is apparent that this mean square does not exist. There are several solutions to this problem,
depending on the outcome of tests of the interactions. If all of the interactions are significant, a
researcher would not ordinarily be interested in tests of main effects. Hence, for this case, the
problem of testing main effects does not come up. A second approach involves performing
preliminary tests on the model. Suppose that a test of, say, F = MSAB/MSABC is not significant
at the .25 level of significance. The inclusion of (αβ)jk in the model equation and, hence,
in the E(MS) for treatment A is open to question. If a researcher concludes that , the
proper error term for testing treatment A is MSAC, as the modified expectations show:
I discuss preliminary tests on the model and pooling procedures in Section 9.11. A third
approach to the problem involves piecing together an error term that has the proper form for
testing main effects. I discuss this procedure in Section 9.10.
The model equations for some of the designs that have been presented include a subscript
followed by one or more subscripts in parentheses. As I discussed in Section 2.2, this notation
indicates that an effect is nested—that is, restricted to one level or combination of other
treatments. For example, the subscript parentheses in εi(jk) for the model equation
indicate that the ith experimental unit is nested within the jkth treatment combination. However,
the subscripts in (αβ)jk are not followed by parentheses: hence treatments A and B are
crossed. Information about effects that are nested or crossed is used to derive the E(MS)s (see
rule 3).
In the previous section, I showed that mean squares having appropriate expected values for
constructing F statistics are not always available. Under these conditions, an error term that
does have the appropriate expected values can be pieced together. The resulting F statistics
are called quasiFstatistics and are designated by the symbols F′ and F″.
Earlier I showed that the error term for testing MSA in a CRF-pqr design, assuming a random-
effects model, should have the following expected value:
Although a mean square with this expected value does not exist, a composite mean square with
the required expected value can be pieced together. The error term, following the procedure
suggested by Satterthwaite (1946), is MSerror = MSAB + MSAC – MSABC. This can be shown
as follows:
where MS1 represents the mean square that is tested. The degrees of freedom for the
denominator of this statistic, v2, are the nearest integral value of
where df2,df3, and df4 are the degrees of freedom for the respective mean squares. The
degrees of freedom for the numerator of the F′ statistic are the regular df for MS1.
One problem inherent in the F′ statistic is the possibility of obtaining a negative denominator.
Cochran (1951) proposed that Satterthwaite's method be modified by adding mean squares to
the numerator and denominator rather than subtracting them from the denominator. This
statistic, designated as F″, has the general form
The degrees of freedom for the numerator and denominator are equal to the nearest integral
values, respectively, of
A quasi F″ statistic for testing treatment A for the CRF-pqr design just described is given by
It can be shown that this statistic has the required form in terms of expected values of mean
squares:
Quasi F′ and F″ statistics for testing treatments B and C in a CRF-pqr design, assuming a
random-effects model, have the following form:
The sampling distribution of a quasi F statistic when the null hypothesis is true is not central F,
although the latter distribution may be used as an approximation. Conditions under which the
approximation holds have been examined by Gaylor and Hopper (1969).
The selection of an experimental design and associated design model is largely based on a
researcher's subject-matter knowledge. Other factors that influence the selection of a design
are discussed in Section 2.4. The model selected should include all sources of variation in
which a researcher is interested and that are expected to contribute to the total variation. All
sources of variation not specifically included in the model as treatment or nuisance effects
become a part of the error variance.
A factorial design should be used whenever a researcher expects that interaction terms are
important sources of variation. Unfortunately, factors other than an interest in higher-order
interactions may lead a researcher to use a complex factorial design: availability of a particular
computer software package and a familiarity with the layout and analysis of a particular
complex design. I have mentioned two situations. In the first, interactions are included in the
model by choice. In the second, they are included for convenience. Researchers in the second
category typically have little commitment to the experimental design model that is adopted.
Procedures discussed in this section are concerned with using data obtained in an experiment
to make preliminary tests on the appropriateness of a particular model. Statisticians disagree
on whether to adhere to the model specified at the beginning of an experiment, even though
the data suggest that it is incorrect, or whether it is permissible or even desirable to modify the
model along lines suggested by the data. The procedures recommended here represent a
middle-of-the-road position with respect to the issue of preliminary tests and pooling.
Assume that a CRF-pqr design has been used in an experiment and that the treatment levels
were selected randomly. The model equation for this design is
The experimental design model and expected values of the mean squares determine the form
of the tests of the null hypotheses. If there is a question about whether interaction terms should
appear in the model, preliminary tests can be performed. Such tests are designed to revise or
confirm the specification of parameters included in the model.
The first component usually tested is the highest-order interaction—in this case, the three-
treatment interaction . According to the expected values of the mean squares for this design
in Section 9.9, the F statistic for the test of has the form
Notice that the term does not appear in the model. Thus, instead of
.
The design provides two independent estimates of error variance: MSWCELL and MSABC.
When MSWCELL is based on a small number of degrees of freedom, it is often desirable to
combine MSABC with MSWCELL to form a pooled error term that has more degrees of
freedom. The resulting error term, called a residual error, is given by
with df = pqr(n − 1) + (p − 1)(q − 1)(r − 1). The expected values of mean squares for the revised
model are shown in Table 9.11-1.
Table 9.11-1 ▪ Expected Values of Mean Squares for Revised Model and F Statistics
The next step in the preliminary test of the model is to determine whether the three two-
treatment interactions should be retained. According to Table 9.11-l, the F statistics have the
form
Assume that tests of and are insignificant at the .25 level, but
is significant. The final revised model equation under these conditions is
The expected values of mean squares for this revised model and F statistics are shown in Table
9.11-2. According to Table 9.11-2, the pooled error term for testing and is
Table 9.11-2 ▪ Expected Values of Mean Squares for Revised Model and F Statistics
One advantage of carrying out preliminary tests and pooling is readily apparent. If interaction
terms can be deleted from the model and pooled, the resulting error term has more degrees of
freedom. However, the potential disadvantages of carrying out preliminary tests and pooling,
such as pooling nonzero interaction terms with the error variance, may outweigh the
advantages if the unpooled error term is based on an adequate number of degrees of freedom
—say, 20 to 30.
A problem with preliminary tests and pooling is that subsequent tests are carried out as if no
preliminary tests had preceded them. The sampling distributions of statistics used in tests that
are preceded by preliminary tests differ from those that are not preceded by preliminary tests.
In general, researchers do not know the appropriate sampling distribution of the statistic at
each stage of preliminary testing of the model. The use of a sampling distribution that is not
appropriate for sequential decisions probably introduces a slight positive bias—that is, too often
rejecting a null hypothesis when it should not be rejected. Thus, pooling introduces
contingencies that are difficult to evaluate statistically. Adherence to the original model
eliminates this problem, although tests based on an incorrect model may be less powerful than
tests based on a revised model.
A researcher can take three positions with respect to conducting preliminary tests and pooling:
(1) never pool, (2) always pool, and (3) pool only when evidence indicates that the initial model
is incorrect. Both the first and third rules have much to recommend them. I favor the third. For
other statements on the issue of pooling, see N. H. Anderson (2001), Hines (1996), and
Montgomery (2009).
Researchers often conduct experiment with three of more treatments, each having many levels.
A CRF-354 design, which is not an unusually large experiment by current standards, has (3)(5)
(4) = 60 treatment combinations. To compute a within-cell error term with 2 subjects per cell, a
minimum of 120 subjects is required. However, a sample of this size may not be available or
considerations, such as time or cost, may preclude the use of such a large sample. Under
these conditions, several design alternatives are available to a researcher. One alternative that
is described here is to assign only one subject to each treatment combination. If this approach
is followed, there is no within-cell error term, and the highest-order interaction is used to
estimate error variance. The strategy permits a researcher to carry out a multitreatment
experiment with a minimum of subjects. Other design alternatives are described in Chapters 12,
15, and 16.
The first model includes a two-treatment interaction; the second does not. If the second model
provides an adequate description of the sources of variation in the experiment, then MSAB and
MSWCELL both estimate only error variance. For experiments in which n = 1, a mean square
within-cell error term cannot be computed. If MSAB estimates only error variance, it can be
used in place of MSWCELL. The basic question to be answered is the following: Which of the
two models is appropriate for the experiment?
Tukey's test for nonadditivity, described in Section 8.3, is helpful in choosing between the two
models. The model in which
is called an additive model. Computational procedures for Tukey's test for a randomized block
design are shown in Table 8.3-2. These procedures generalize with slight modification to
factorial experiments. A test of nonadditivity for a CRF-pq design requires the substitution in
Table 8.3-2 of an AB summary table in place of the AS summary table. The computational
formulas are
Tukey's test for nonadditivity in Section 8.3 can easily be modified for factorial experiments with
three or more treatments. For a CRF-pqr design, the AS summary table in Table 8.3-2 i s
replaced by an ABC summary table. A djkl summary table must be substituted for the dij
summary table. The required computational formulas are
If FNONADD is insignificant at, say, α = .25, the nonadditive model is rejected in favor of the
additive model in which .
for these data contains 16 parameters: μ, α1, α2, α3, β1, β2, β3, (αβ)11, (αβ)12, …, (αβ)33.
However, only nine cell means are available to estimate these parameters.
Thus, the model is overparameterized: It contains more parameters than there are means from
which to estimate the parameters. I mentioned in Section 7.7 that statisticians have developed a
number of ways to get around this problem. Unfortunately, the solutions do not always work
well when there are empty cells—that is, cells with no observations.
The cell means model for a two-treatment completely randomized factorial design is
for the data in Table 9.13-1 contains nine parameters: μ11, μ12, …, μ33. Nine cell means are
available to estimate these parameters. Hence, the model is not overparameterized. A
population mean can be estimated for every cell that contains one or more observations. Unlike
the classical analysis of variance model, the cell means model does not impose a structure on
the analysis of data. Analysis procedures for the model can be used to test hypotheses about
any linear combination of cell means. It is up to the researcher to decide which tests are
meaningful or useful based on the original research hypotheses, the way the experiment was
conducted, and the data that are available. However, as I show in Section 9.14, if one or more
cells are empty, linear combinations of cell means must be chosen carefully because some
tests may be uninterpretable.
In this section, I show how to formulate coefficient matrices for testing a variety of null
hypotheses. The formulation of coefficient matrices for experiments with missing observations
and empty cells is discussed in Section 9.14.
The null hypotheses for treatments A and B for the police attitude experiment described in
Section 9.3 can be expressed as
The coefficient matrix, , for testing the treatment A null hypothesis is obtained from the
where denotes the sum vector and q is the number of levels of treatment B. It is assumed
that the vector of cell means is ordered as follows:
Notice the order of the cell means: Each level of treatment B appears with a1, followed by each
level with a2 and then a3. In the Kronecker product, treatment B—the treatment whose levels
are ordered first—postmultiplies treatment A. The order of the terms in the Kronecker product is
the reverse of the order of the cell means.
It is evident from an inspection of that the coefficients in the first row of the matrix define
μ1. − μ2.; the coefficients in the second row define μ2. − μ3. The fractions can be avoided by
deleting as follows:
Note that because of the way the vector of cell means is ordered, the vector premultiplies the
matrix.
The coefficient matrix for testing the AB interaction null hypothesis is obtained from the
Kronecker product as follows:
Each row of defines one of the sets of crossed lines in Figure 9.13-1. The cell means
model analysis of the police attitude data is illustrated in Table 9.13-2. The results of the
analysis in Table 9.13-2 are identical to those in Table 9.3-2, where the classical sum-of-
squares approach was used.
Figure 9.13-1 ▪ An AB interaction term of the form μjk − μj′k − μjk′ + μj′k′ is obtained from
the crossed lines by subtracting the two μjks connected by a dashed line from the two
Table 9.13-2 ▪ Computational Procedures for CRF-33 Design Using a Cell Means Model
Simple main-effects sums of squares for A at bk are easily obtained using the cell means
model. The required coefficient matrices are given by
The coefficient vector selects the k th level of treatment B for computing simple main
effects. If the first element of is a 1 and the other elements are zeros, the first level of B is
selected; if the second element is a 1 and the other elements are zeros, the second level of B is
selected; and so on. The simple main effects sums of squares are
Each of these sums of squares has two degrees of freedom—the number of rows of the
associated C′ matrix. These sums of squares are identical to those in Table 9.6-1, where the
classical sum-of-squares approach was used.
A coefficient matrix for computing a simple main-effects sum of squares for B at a1 is given by
The coefficient matrices for SSB at a2 and SSB at a3 are obtained from and
, respectively. This computational procedure is very versatile. For example, a
simple main-effects sum of squares for treatment B that is based on treatment levels a1 and a2
is obtained by replacing with .
The coefficient matrix for computing the treatment A × contrast 1(B) interaction is given by
with two degrees of freedom, the number of rows of . According to row 4 of Table 9.6-5,
the null hypothesis for all j cannot be rejected. The value of contrast 1(B) at the
first level of treatment A, , is given by
The values of contrast at the second and third levels of treatment A are obtained by
replacing with and , respectively.
The coefficient matrix for computing the treatment B × contrast 1(A) interaction is given by
According to row 7 of Table 9.6-5, the null hypothesis for all k cannot be rejected.
The value of contrast 1(A), at the first level of treatment B, is given by
The values of contrast at the second and third levels of treatment B are obtained by
replacing with and , respectively. The values of the
three contrasts, and are consistent with the nonsignificant
test of H0: βψ1(A) = δ for all k.
The formulation of coefficient matrices for testing hypotheses about trends presents no new
problems. A trend is defined by a coefficient vector—for example,
,
, and so on. The coefficients in these vectors are obtained
from Appendix Table E.10. The coefficient matrices for computing the linear trend for treatments
A and B are given by, respectively,
and
I obtained the same sum-of-squares value in Section 9.7, where I used the classical sum-of-
squares approach.
In Table 9.7-4, sums of squares for the interaction of treatment A with the linear and quadratic
contrasts for treatment B were computed. For the cell means model, the coefficient matrices for
computing these sums of squares are given by
The versatility of the cell means model approach has been illustrated in this section. The
approach can be used for all of the sums of squares computed using the classical sum-of-
squares approach. An advantage of the cell means model approach is that the form of the
sum-of-squares formula is the same regardless of the hypothesis tested—namely
*9.14 Analysis of Completely Randomized Factorial Designs with Missing Observations and Empty Cells
It is usually desirable to have equal ns in each cell of a completely randomized factorial design;
however, this is not always possible. Two cases are examined: one in which the cell ns are
unequal but each cell contains at least one observation and a second in which one or more
cells are empty. The classical sum-of-squares approach described in Section 9.3 requires equal
cell ns or proportional cell ns.6 In the real world, the required proportionality rarely occurs. If
cell ns are unequal or if one or more cells are empty, the procedures described next can be
used.
When cell ns are unequal, a researcher has a choice between computing unweighted means or
weighted means. Unweighted means are simple averages of cell means. These means were
described in the previous section. Weighted means are weighted averages of cell means in
which the weights are the sample njks. The difference between the two kinds of means is
illustrated in Table 9.14-1, where observation Y511 for the police attitude data is missing.
Unweighted and weighted means for treatments A and B are computed as follows:
The value of weighted means is affected by the sample njks. Hence, the means are data
dependent, which is usually undesirable. However, weighted means may be preferred when
the sample njks are proportional to the population njks. Null hypotheses for treatments A and B
can be expressed as follows:
The coefficient matrices and procedures for testing the unweighted-means null hypotheses are
the same as those in Table 9.13-2. Of course, the number of rows in the y vector and X matrix
must correspond to the number of observations. For the data in Table 9.14-1, where Y511 is
missing, the y vector and X matrix contain only N = 44 rows. The coefficient matrix, , for
testing the unweighted-means null hypothesis for treatment A is
A test of contrasts of sums of cell means gives the same result as a test of the following
contrasts of treatment means:
The coefficient matrix for testing the weighted-means null hypothesis for treatment A i s
obtained by replacing the nonzero coefficients in with cjk = ± njk/nj. as follows:
The coefficient matrix, C1(B), for the unweighted-means null hypothesis for treatment B is
A test of contrasts of sums of cell means gives the same result as a test of the following
contrasts of treatment means:
The coefficient matrix for testing the weighted-means null hypothesis for treatment B i s
obtained by replacing the nonzero coefficients in with cjk = ± njk/n.k as follows:
There seems to be a consensus among statisticians that unweighted means should be used to
compute the sum of squares for the AB interaction. Thus,
In this section, I computed two SSAs; their values are 188.09 and 187.51. The two SSAs
provide tests, respectively, of two different null hypotheses:
The merits of these and other null hypotheses have been the subject of an extended debate.
Other things being equal, researchers should test hypotheses that involve unweighted means
because such hypotheses are not influenced by the cell ns.
Before turning to the empty-cell case, I briefly mention three approximate procedures that are
sometimes used when cell ns are unequal. For a fuller discussion of these and other
procedures, see Milliken and Johnson (2009), Searle (1997), and Shaw and Mitchell-Olds
(1993). One procedure involves making the cell ns equal by estimating the missing
observations and using the estimates in place of the missing observations. This procedure is
appropriate if most of the cell ns are equal and all of the interaction effects are equal to zero.
The procedure is often used for randomized block and Latin square designs, but it is not
recommended if the interaction effects are not equal to zero.
Another procedure is to randomly set aside data to reduce all cell ns to the same size. Then the
classical sum-of-squares formulas can be used. Questions concerning this method immediately
come to mind. For example, How much data should one be willing to set aside? Unfortunately,
a definitive answer is not available.
The third procedure, called the unweighted-means analysis, consists of first computing cell
means for each treatment combination. The cell means are then subjected to a classical sum-
of-squares analysis. The final step in the analysis is to multiply the treatment and interaction
sums of squares by the harmonic mean of the sample njks—that is, by ñ = pq/(1/n11 + 1/n12 +
…+ 1/npq). Unfortunately, the resulting mean squares do not have χ2 distributions, and hence
their ratios do not provide F statistics for testing null hypotheses. At a time when exact
procedures such as the cell means model are available, it is difficult to defend the use of these
approximate procedures.
The cell means model can be used to test hypotheses about any linear combination of
population means that can be estimated from data. When one or more cells are empty, the
challenge facing a researcher is to formulate interesting and interpretable null hypotheses
using the means that are available. Consider the data for a CRF-33 design in Figure 9.14-1,
where two cells are empty. The experiment was designed to test, along with others, the
following null hypothesis for treatment A:
Figure 9.14-1 ▪ CRF-33 design with two empty cells: a1b3 and a2b2. It is assumed that the
loss of data is not related to the nature of treatments A and B and the AB interaction.
Unfortunately, this hypothesis is untestable because μ13 and μ22 cannot be estimated. The
hypothesis
is testable because data are available to estimate each of the population means. However, the
hypothesis is uninterpretable because different levels of treatment B appear in each row of
Figure 9.14-1: (b1 and b2) versus (b1 and b3) in the first row and (b1 and b3) versus (b1, b2,
and b3) in the second row. The following hypothesis is both testable and interpretable.
For a hypothesis to be interpretable, the estimators of population means for each contrast in
the hypothesis should share the same levels of the other treatment(s). For example, to estimate
μ1. = ½ (μ11 + μ12) and μ3. = ½ (μ31 + μ32), it is necessary to average over b1 and b2 and
ignore b3. The null hypothesis for treatment A can be expressed in matrix notation as
, where
For the data in Table 9.14-2 where Y511 is missing and two of the cells are empty, the sum of
squares for testing hypothesis (9.14-1) is
Table 9.14-2 ▪ Cell Means for the Data in Table 9.3-1; Cells a1b3 and a2b2 Are Empty and
These hypotheses involve simple main effects. Recall from Section 9.6 that tests of simple main
effects involve a main effect plus an interaction effect. Thus, if the null hypothesis is rejected, it
means that the treatment effect(s) is significant, the interaction effect(s) is significant, or both. If
this is what a researcher wants to know, then a test of simple main effects is interpretable;
however, the results of the test are ambiguous.
Testable and interpretable hypotheses for treatment B and the AB interaction are, respectively,
and
If there were no empty cells, the null hypothesis for the AB interaction would involve h − p − q +
1 = 9 − 3 − 3 + 1 = 4 interaction components. However, because of the empty cells, only two of
the interaction components can be tested; these components are shown in Figure 9.14-2. If the
null hypothesis for the AB interaction is rejected, we can conclude that at least one function of
the form μjk − μj′k − μjk′ + μj′k′ does not equal zero. However, failure to reject the null
hypothesis does not imply that all functions of the form μjk − μj′k − μjk′ + μj′k′ equal zero
because we are unable to test two of the interaction components.
When a number of cells are empty, a researcher may discover that more useful information can
be salvaged from an experiment by partitioning the data into interpretable subsets than by
examining the data as a whole. Consider Figure 9.14-3. For the data in part (a), several
hypotheses are testable. However, the partitioned data in part (b) are easier to analyze and
interpret because there are no empty cells. Clearly, the analysis of experiments with empty
cells cannot be done in mechanical fashion; the researcher must search for the useful
information.
Figure 9.14-3 ▪ (a) CRF-43 design with three empty cells. (b) The data in part (a) are
partitioned into two subsets. A researcher may find each subset easier to analyze and
interpret than the unpartitioned data in part (a).
When there are empty cells, it is not always possible to formulate interpretable null hypotheses
that use most of the data. The configuration of means in Figure 9.14-4 is one example. In some
instances, the missing cell means can be estimated if a researcher is willing to assume that all
interaction effects are equal to zero.7 Once the missing cell means have been estimated, the
analysis proceeds as usual. Given that a researcher is rarely in a position to make the
gratuitous assumption that all interaction effects are equal to zero, nothing more is said about
this approach. The interested reader can consult Searle (1987). It should be apparent from the
foregoing that a researcher should make every effort to avoid empty cells.
Figure 9.14-4 ▪ For this CRF-43 design, no interpretable null hypotheses use a majority of
the data.
1.All subjects are used in simultaneously evaluating the effects of two or more treatments.
The effects of each treatment are evaluated with the same precision as if the entire
experiment had been devoted to that treatment alone. Thus, factorial experiments permit
the efficient use of resources.
2.They enable a researcher to evaluate interaction effects.
1.If numerous treatments are included in the experiment, the number of subjects required
may be prohibitive.
2.A factorial design lacks simplicity in the interpretation of results if interaction effects are
present. Unfortunately, interactions among variables are common in the behavioral
sciences, medical sciences, and education.
3.The use of a factorial design commits a researcher to a relatively large experiment. Small
exploratory experiments may indicate much more promising lines of investigation than those
originally envisioned. Relatively small experiments permit greater freedom in the pursuit of
serendipity.
4.Factorial experiments are generally less efficient in determining optimum levels of
treatments or treatment combinations than are a series of smaller experiments, each based
on the results of the preceding experiments.
1.Terms to remember
a.factorial design (9.1)
b.completely crossed treatments (9.2)
c. treatment combination (9.2)
d.simple main effects (9.6)
e.simultaneous test procedure (9.6)
f. simple-effects contrasts (9.6)
g.treatment-contrast interaction (9.6)
h.contrast-contrast interaction (9.6)
i. quasi F statistic (9.10)
j. nonadditive model (9.12)
k. additive model (9.12)
*a.[9.3] Perform an exploratory data analysis. Assume that the change scores for each
treatment combination are listed in the order in which they were obtained.
*b.[9.3] Test the following hypotheses: H0: μ1. = μ2., H0: μ.1 = μ.2, and H0: μ11 − μ21 −
μ12 + μ22 = 0; let α = .05.
*c.[9.3] Graph the AB interaction; interpret the graphs.
*d.[9.3] Calculate the power of the tests of treatment A and the AB interaction.
*e.[9.3] (i) Determine the value of n necessary to achieve a power of approximately .85 for
treatment A and (ii) the value of n necessary to detect a large association, ω2 = 0.138,
for A; let 1 − β = .80.
*f.[9.8] Calculate and .
g.Prepare a “results and discussion section” appropriate for the Journal of Experimental
Social Psychology.
6.It was hypothesized that persons who are less physically attractive believe that they have
less control over reinforcements in their lives than do those who are more attractive. To test
this hypothesis, 36 male college students were shown one of six photographs and asked to
fill out the Rotter I-E scale the way they thought the person in the photograph would.
Individuals who score at the internal end of the I-E continuum, low scores, perceive events
as contingent on their behavior. Those at the external end, high scores, perceive events as
the result of luck, chance, or powerful others. Thirty-six head and shoulder photographs
were obtained from college yearbooks. Half of the photographs were of men and half were
women. This variable is treatment A. Prior to the experiment, the photographs were
assigned to one of three physical attractiveness categories: high, moderate, and low—
treatment B. The 36 subjects were randomly assigned to six subsamples containing 6
subjects each. The subsamples of subjects were randomly assigned to the six sets of
photographs. The following data for the Rotter I-E scale were obtained. (Experiment
a.[9.3] Perform an exploratory data analysis. Assume that the ratings for each treatment
combination are listed in the order in which they were obtained.
b.[9.3] Test the following hypotheses: H0: μ1. = μ2., H0: μ.1 = μ.2 = μ.3, and H0: μjk − μj′k
− μjk′ + μj′k′ = 0 for all j and k; let α = .05.
c. [9.3] (i) Graph the interaction between treatments A and B. (ii) Are the graphs consistent
with the test of the AB interaction?
d.[9.3] Calculate the power of the tests of treatments A and B.
e.[9.3] (i) Determine the value of n necessary to achieve a power of approximately .85 for
treatment A and (ii) the value of n necessary to detect a large treatment effect, f = .40,
for A and B; let 1 − β = .80.
f. [9.5] Use Tukey's statistic to determine which population means differ for treatment B.
g.[9.8] Calculate and .
h.Prepare a “results and discussion section” appropriate for the journal Perceptual and
Motor Skills.
*7.The experiment described in Exercise 6 was repeated using a random sample of 36 women
college students as raters. The following data were obtained. (Experiment suggested by
Miller, A. G. Social perception of internal-external control. Perceptual and Motor Skills.)
*a.[9.3] Test the following hypotheses: H0: μ1. = μ2., H0: μ.1 = μ.2 = μ.3, and H0: μjk − μj′k
− μjk′ + μj′k′ = 0 for all j and k; let α = .05.
*b.[9.3 and 9.6] (i) Graph the AB interaction. (ii) Test the following hypotheses: H0: αψ1(B)
= δ for all j, H0: αψ2(B) = δ for all j, and H0: αψ3(B) = δ for all j, where ψ1(B) = μ.1 −
μ.2, ψ2(B) = μ.1 − μ.3, and ψ3(B) = μ.2 − μ.3. Use ( ) / v3 as the critical value
for each test.
*c.[9.3] Calculate the power of the tests of treatment B and the AB interaction.
*d.[9.3] (i) Determine the value of n necessary to achieve a power of approximately .85 for
the AB interaction and (ii) the value of n necessary to detect a large treatment effect, f =
.40, for treatment B; let 1 − β = .80.
*e.[9.8] Calculate and .
f. Prepare a “results and discussion section” appropriate for the journal Perceptual and
Motor Skills.
a.[9.3] Perform an exploratory data analysis. Assume that the ratings for each treatment
combination are listed in the order in which they were obtained.
b.[9.3] Test the hypotheses: H0: μ1. = μ2., H0: μ.1 = μ.2 = μ.3 = μ.4, and H0: μjk − μj′k −
μjk′ + μj′k′ = 0 for all j and k; let α = .05.
c. [9.3 and 9.6] (i) Graph the AB interaction. (ii) Test the following hypotheses: H0: αψ1(B)
= δ for all j, H0: αψ2(B) = δ for all j, and H0: αψ3(B) = δ for all j, where ψ1(B) = μ.1 −
μ.2, ψ2(B) = μ.1− μ.3, and ψ 3(B) = μ.2 − μ.3. Use as the critical value for
each test.
d.[9.3] Calculate the power of the tests of treatment B and the AB interaction.
e.[9.8] Calculate and .
f. Prepare a “results and discussion section” appropriate for the Journal of Counseling
Psychology.
*9.Show that
*10.
[9.4] Derive the computational formulas for SSA and SSWCELL from the deviation formulas
in equation (9.4-4).
11.[9.6] If H0: α at b1 = 0 for all j is rejected, the interpretation is ambiguous; explain.
12.
[9.6] The critical value for the simultaneous test procedure is . Identify v1, v2,
and v3.
*13.
[9.6] When is the critical value for Scheffé's procedure equal to that for the simultaneous
test procedure?
14.
[9.6] Sometimes a meaningful interpretation can be made of a main-effects contrast—say,
μj. − μj′.—even though the associated treatment interacts with another treatment; explain.
*15.
[9.6] When does
*16.
[9.9] Use the rules in Section 9.9 to derive the expected values of mean squares for the
following model:
where εi(jk) is , A is fixed, and B is random. Denote the means squares by MSA,
MSB w. A, and MSWCELL. MSB w. A is read as mean square B within A and is described
in Chapter 11.
*17.
[9.10] Assume that the expected values of the mean squares for a design are as follows:
18.
[9.10] Assume that the expected values of mean squares for a design are as followhs:
*19.
[9.10 and 9.11] Assume that the following data have been obtained for a CRF-322 design
(model II) with njkl equal to 2:
*a.Perform preliminary tests on the model and pool where appropriate; write out the
ANOVA table showing the values of mean squares and degrees of freedom.
*b.Indicate the expected values of the mean squares for the final model.
*c.Construct an F′ statistic for testing treatment C.
*20.
[9.13] Exercise 5 describes an experiment in which subjects rated a confederate in terms of
likability, intelligence, and personal adjustment.
21.
[9.14] Exercise 6 describes an experiment in which male subjects were shown photographs
and asked to fill out the Rotter I-E scale the way they thought the person in the photograph
would.
a.Assume that observation Y211 = 10 is missing. Use the cell means model approach to
test the null hypotheses for this CRF-23 design. (This problem requires the inversion of
small matrices. The computations, while tedious, can be done without the aid of a
computer.)
b.Assume that observation Y211 = 10 is missing and that cell a1b2 is empty. Use the cell
means model approach to test the null hypotheses. (This problem involves the inversion
of small matrices. The computations, while tedious, can be done without the aid of a
computer.)
c. How does the empty cell a1b2 affect the interpretation of the AB interaction?
22.
[9.14] Exercise 8 describes an experiment that involved four approaches to resolving
personal anger conflicts.
a.Assume that observation Y113 = 4 is missing. Use the cell means model approach to
test the null hypotheses. (This problem requires the inversion of small matrices. The
computations, while tedious, can be done without the aid of a computer.)
b.Assume that observation Y113 = 4 is missing and that cell a2b2 is empty. Use the cell
means model approach to test the hypotheses. (This problem requires the inversion of
small matrices. The computations, while tedious, can be done without the aid of a
computer.)
c. How does the empty cell a2b2 affect the interpretation of the AB interaction?
3In Gabriel's (1964, 1969) descriptions of the procedure, a sum of squares for the hypothesis
4Boik (1975, 1979) provides an excellent survey of the various types of interactions that can be
5The merits of behaving as if, for all practical purposes, the null hypothesis is true when the
probability associated with the test statistic is large (p > .25) have been the subject of much
debate. The practice is fairly common in some areas, such as performing preliminary tests on
the appropriateness of a particular model (see Section 9.11).
*The reader who is interested only in the classical sum-of-squares approach can, without loss
of continuity, omit this section.
*The reader who is interested only in the classical sum-of-squares approach can, without loss
of continuity, omit this section.
6Cell ns for a CRF-pq design are proportional if the number of observations in each of the jk
treatment combinations satisfies njk = nj.n.k / N, where N = n11 + n12 + … npq.
7Weeks and Williams (1964) have developed a criterion, called geometric connectedness, for
determining whether all of the missing cell means can be estimated when the interaction
effects are assumed to equal zero.
http://dx.doi.org/10.4135/9781483384733.n9
Completely Randomized Factorial Design with Three or More Treatments and Randomized Block Factorial
Design
This chapter describes the layout and analysis of a completely randomized factorial design with
three or more treatments. Also discussed is a factorial design that is constructed from two or
more randomized block designs.
The classical sum-of-squares analysis procedures described in Section 9.3 as well as those for
the cell means model in Sections 9.13 and 9.14 can be easily extended to experiments that
have three or more treatments. I begin with a three-treatment completely randomized factorial
design. A diagram of a CRF-222 design is shown in Figure 10.1-1. The experimental design
model is
Figure 10.1-1 ▪ Block diagram of a CRF-222 design. Subjects are randomly assigned to
the eight treatment combinations: a1b1c1, a1b1c2, …, a2b2c2. Each subject is observed
The interactions in a design can be determined by writing down all combinations of treatment
letters while preserving the alphabetical order of the letters. For example, the letters for
treatments A, B, and C can be combined as follows: AB, AC, BC, and ABC. Interactions that
involve two letters are called two-treatment interactions, first-order interactions, or double
interactions. If three letters are involved, the interaction is a three-treatment interaction, second-
order or triple interaction, and so on.
A completely randomized factorial design with four treatments is denoted by the letters CRF-
pqrt.
where t is the number of treatments in the design, and l is the number of letters in the
interaction. For example, a four-treatment design has six two-treatment interactions:
I now illustrate the computational procedures for a CRF-222 design. For purposes of
illustration, assume that a researcher is interested in the effects of the magnitude of
reinforcement (treatment A), social deprivation (treatment B), and gender of the adult who
administers reinforcement (treatment C: c1 denotes a man, and c2 denotes a woman) on
children's choice behavior. The task performed by the children consisted of taking marbles from
a bin, one at a time, and inserting them into one of two distinctively marked holes in the top of a
box. The hole preferred by each child was determined during a preliminary familiarization
session. The dependent variable is the number of marbles inserted in the nonpreferred hole
during the last 10 minutes of a 15-minute experimental session.
Treatment A
Treatment B
Treatment C
AB interaction
AC interaction
BC interaction
ABC interaction
Assume that 32 five-year old children are available to participate in the experiment. The children
are randomly assigned to the 2 × 2 × 2 = 8 treatment combinations with the restriction that four
children are assigned to each combination.
The layout for this CRF-222 design and computational procedures are shown in Table 10.2-1.
The analysis of variance is summarized in Table 10.2-2. According to Table 10.2-2, all two-
treatment interactions are significant. The researcher could use graphs to gain a better
understanding of the way the variables interact. If any of the treatments had contained more
than two levels, tests of hypotheses for treatment-contrast and contrast-contrast interactions
could be used to understand the sources of nonadditivity in the data. These procedures are
described in Section 10.4.
Procedures for testing differences among means for a CRF-pq design were described in
Section 9.5. These procedures generalize to completely randomized factorial designs with more
than two treatments. For example, t and q test statistics for making comparisons among
treatment A means for a CRF-pqr design assuming a fixed-effects model have the following
form:
with v = pqr(n − 1). The only difference between these formulas and those for a CRF-pq design
is the use of nqr instead of nq in the denominators. The change is necessary because the
treatment A means, for example, for a CRF-pqr design are computed from nqr observations.
Simple-Effects Tests
If a researcher chooses to test hypotheses about simple main effects, simple simple main
effects, and so on, the following provide useful computational checks:
Recall from the discussion in Section 9.6 that the appropriate way to follow up significant
interactions is to test hypotheses for treatment-contrast and contrast-contrast interactions.
Procedures for estimating strength of association, effect size, power, and sample size for a
CRF-pq design were described in Section 9.8. These procedures generalize to completely
randomized factorial designs with more than two treatments. For example, partial omega
squared for treatment A for a CRF-pqr design is given by
where
or
A sample estimate of Cohen's effect size measure can be computed from omega squared.
For example, the formula for computing for treatment A is
As discussed in Section 9.8, these formulas also can be used to estimate sample size if a
researcher is able to specify the (1) level of significance, oc; (2) power, 1 – β; (3) size of the
population variance, ; and (4) sum of the squared population treatment effects, for example,
. If a researcher isn't able to estimate and from a pilot study or previous
research, it may be possible to express the difference between the largest and smallest
population means that the researcher wants to detect as a multiple, d, of the population
standard deviation. This procedure was described in Section 4.5. The procedure involves using
Tang's charts in Appendix Table E.12 and trial and error to determine the required sample n. An
even simpler procedure for estimating sample size for treatments A, B, and C is to specify the
number of levels of the treatments, α, 1 – β, and f. This approach to estimating sample size
was illustrated in Section 9.8.
The reader may feel that a bewildering array of sum-of-squares formulas is used in analysis of
variance. At first glance, this appears to be true, but actually only four kinds of formulas are
used. These can be classified as (1) between sum of squares, (2) within sum of squares, (3)
interaction sum of squares, and (4) total sum of squares. Consider the formulas for CR-p, RB-p,
CRF-pq, and CRF-pqr designs. The between sum-of-squares formulas have the following form:
A mean square is obtained by dividing a sum of squares by its degrees of freedom. For a CR-p
design, for example, MSA = ([A] – [Y])/(p − 1).
The third kind of formula, the one for computing an interaction sum of squares, has the
following form:
CR-p
no interaction sum of squares
design
RB-p
one interaction (A × blocks) that is also called a residual:
design
[AS] – [A] – [S] + [Y]
CRF-pq
one two-treatment interaction:
design
[AB] – [A] – [B] + [Y]
CRF-pqr
three two-treatment interactions:
design
[AB] – [A] – [B] + [Y]
[AC] – [A] – [C] + [Y]
[BC] – [B] – [C] + [Y]
one three-treatment interaction:
[ABC] – [AB] – [AC] – [BC] + [A] + [B] + [C] – [Y]
CRF-pqrt six two-treatment interactions, four three-treatment interactions, and one four-
design treatment interaction:
[ABCD] − [ABC] − [ABD] − [ACD] − [BCD] + [AB] + [AC] + [AD] + [BC] + [BD] +
[CD] − [A] − [B] − [C] − [D] + [Y]
If t is an even number, the sign of [Y] is positive; if t is an odd number, the sign is negative. This
is evident because the sign changes for each group of terms. In some designs, such as a
randomized block design, the interaction sum of squares is referred to as a residual sum of
squares. The residual designation is used whenever an interaction sum of squares is used to
estimate error variance. Recall that for a randomized block design, it is not possible to compute
a within-groups or within-cell mean square estimate of error variance.
The final kind of formula used in analysis of variance is the total sum-of-squares formula. This
formula has the form illustrated by the following examples:
The patterns that underlie the between, within, interaction, and total sums of squares should
be apparent. Complex designs described in subsequent chapters should appear less complex
when the reader realizes that all designs involve at most only four kinds of formulas and hence
only four kinds of variation.
The meaning of the terms in the abbreviated sum-of-squares formulas is relatively easy to
discern. Terms such as [A], [AB], and [ABS] in a CRF-pq design are given by
If the correction term [Y] is subtracted from [A], [B], [AB], or [ABS], a sum of squares is
obtained. Thus,
The last two formulas require a word of explanation. First, consider [AB] – [Y]. The term [AB]
can be thought of as including variation associated with treatments A and B and the AB
interaction. The term [AB] includes the AB interaction because treatments A and B are crossed
in a CRF-pq design. To obtain the AB interaction sum-of-squares formula, one must subtract
SSA = [A] – [Y] and SSB = [B] – [Y] from [AB] – [Y] as follows:
The reader should recognize SSAB = [AB] – [A] – [B] + [Y] as the formula for the AB
interaction.
Consider next the term [ABS] in [ABS] – [Y]. The [ABS] term includes variation due to A, B, S
(subjects), and the AB interaction—all the sources of variation in the CRF-pq design. The
reader may wonder why [ABS] does not also include variation due to the AS, BS, and ABS
interactions. The answer is that only treatments A and B are crossed; subjects (S) are nested
within the pq treatment combinations. Interactions do not exist for effects that are nested. I have
more to say about this in Chapter 11.
Within sum-of-squares formulas are used to compute variation due to subjects (S), also called
SSWCELL variation. Variation due to subjects is obtained from, say,
When [A] is subtracted from [AS] or when [AB] is subtracted from [ABS], the letter S remains.
Thus, these formulas reflect variation due to subjects.
In Sections 9.13 and 9.14, I showed how to formulate coefficient matrices for the cell means
model when the experiment contains two treatments. The procedures generalize to experiments
with three or more treatments. Consider the CRF-232 design in Table 10.4-1. Because several
observations are missing, the classical sum-of-squares approach described in Section 10.2
cannot be used. Instead, the cell means model approach should be used. In the following
sections, I show how to use the Kronecker product to formulate coefficient matrices for a variety
of null hypotheses.
The null hypotheses for treatments A, B, and C for the data in Table 10.4-1 are, respectively,
The coefficient matrices— and —for testing these null hypotheses are obtained
from the Kronecker product, ς, as follows:
where , and denote sum vectors for treatments A, B, and C, respectively. Assume that
the vector of h = pqr = (2)(3)(2) = 12 cell means is ordered as follows:
where
The vector contains the cell means shown in Table 10.4-1. The columns of X represent the
12 treatment combinations in Table 10.4-1. A 1 or zero in a row indicates, respectively, the
presence or absence of an observation in a treatment combination. For example, the 1s in
column 1 and rows 1 and 2 indicate that there are two observations in treatment combination
a1b1c1, and so on. The degrees of freedom for each sum of squares are equal to the number
of rows in its coefficient matrix. The within-cell sum of squares is given by
The p values for treatments A, B, and C are, respectively, .001, .219, and .171.
The preceding discussion focused on testing null hypotheses for unweighted means. If a
researcher wanted to test null hypotheses for weighted treatment means, the ±1 coefficients in
would be replaced with ±njkl/nj.., the ±1 coefficients in would be replaced with
±njkl/n.k., and the ±1 coefficients in would be replaced with ±njkl/n..l. For example, the
weighted-means coefficient matrix for treatment A is
The coefficient matrices for testing the AB, AC, BC, and ABC interaction null hypotheses are
The null hypothesis for a three-treatment interaction has the general form
This three-treatment interaction term looks a little forbidding, but the underlying pattern is quite
simple. A three-treatment interaction is equal to zero if the interaction between any two
treatments is the same at all levels of the third treatment. Consider the design in Figure 10.4-1.
Figure 10.4-1 ▪ Two-treatment interaction terms of the form μjkl – μj′kl – μj′kl + μj′kl are
obtained from the crossed lines by subtracting the two μjkls connected by a dashed line
from the μjkls connected by a solid line—for example, μ111 − μ211 − μ121 + μ221. A three-
μ222).
Let us return to the CRF-232 design in Table 10.4-1 where observations Y3111, Y2112, Y3112,
and Y3231 are missing. When there are missing observations, a researcher also has a choice
among several null hypotheses that can be tested for the two-treatment interactions. The
hypotheses described earlier are usually the ones of most interest. However, a researcher
might want to test hypotheses involving weighted means where the weights are the cell njkls.1
The weighted-means null hypothesis for the AB interaction, for example, is
Coefficient Matrices for Simple Main Effects, Simple Interaction Effects, and Simple Simple Main Effects
The simple main-effects sum of squares for treatment A at b1, for example, is easily obtained
using the cell means model. The coefficient matrix for testing H0: αj at b1 = 0 for all j is given by
The coefficient vector selects the k th level of treatment B for computing simple main
effects. For example, selects the second level of treatment B, selects
The cell means model can be used to test hypotheses about simple interaction effects. For
example, the coefficient matrix for testing H0: (αβ)jk at c1 = 0 for all jk is given by
The cell means model also can be used to test hypotheses about simple simple main effects.
For example, the coefficient matrix for testing H0: α at b1c1 = 0 for all j is given by
In Sections 9.6 and 9.14, I described how to follow up a significant two-treatment interaction by
testing hypotheses about (1) treatment-contrast interactions, H0: αψi(B) = δ for all j and H0:
βψi(A) = δ for all k, and (2) contrast-contrast interactions, H0: ψi(A)ψ i(B) = 0. If the AB
interaction in, say, a CRF-333 design is significant, a researcher might want to test H0: αψ1(B)
= δ for all j, where contrast ψ1(B) = (1)μ.1. + (–1)μ.2. + (0)μ.3. represents an interesting
contrast among the levels of treatment B. This null hypothesis states that the difference
between the first and second levels of treatment B is the same at all three levels of treatment A.
If this null hypothesis is rejected, a researcher might want to test contrast-contrast interactions
to determine whether the value of the treatment B contrast is the same for some interesting
treatment A contrasts—say, ψ1(A) = (1)μ1.. + (0)μ2.. + (–1)μ3.. and ψ2(A) = (0)μ1.. + (1)μ2.. +
(–1)μ3… The coefficient matrices for computing these treatment-contrast and contrast-contrast
interactions for the CRF-333 design are, respectively,
If the ABC interaction in the CRF-333 design is significant, a researcher might be interested in
testing hypotheses about (1) treatment-treatment-contrast interactions:
and (3) contrast-contrast-contrast interactions, H0: ψi(A)ψ i(B)ψ i(C) = 0. The coefficient matrices
for these interactions are given by the following Kronecker products:
The CRF-232 design in Table 10.4-1 has unequal sample ns. Because of this, the orthogonal
trend coefficients in Appendix Table E.10 cannot be used. Recall that the coefficients in
Appendix Table E.10 are appropriate for quantitative treatments whose levels have equal ns
and are separated by equal intervals. Appendix C describes procedures for deriving trend
coefficients for experiments in which one or both of these conditions are not satisfied. The
formulation of coefficient matrices for testing hypotheses about trends for a CRF-pq design was
described in Section 9.13. The reader should have no difficulty extending the procedures to
designs with three or more treatments.
A randomized block factorial design uses the blocking procedure described in connection with
a RB-p design to isolate variation attributable to a nuisance variable, differences among the
blocks, while simultaneously evaluating two or more treatments and associated interactions. An
experiment with two treatments is designated as an RBF-pq design. A diagram of an RBF-23
design is shown in Figure 10.5-1.
Figure 10.5-1 ▪ Block diagram of an RBF-23 design. The blocks represent n sets of pq
matched subjects or n subjects who receive all pq treatment combinations. In the former
case, the subjects in a block are randomly assigned to the treatment combinations. In the
latter case, the order in which the treatment combinations are presented is randomized
independently for each block.
The blocks in an RBF-pq design can consist of n random samples of pq matched experimental
units or a random sample of n experimental units who receive all pq treatment combinations. If
matched subjects are used, the subjects within a block are randomly assigned to the pq
treatment combinations, with the restriction that each subject receives only one combination. If
repeated measures are used, the order of presentation of the treatment combinations is
randomized independently for each experimental unit. The merits of using matched subjects or
repeated measures, discussed in Section 8.1, apply to randomized block factorial designs. If
the block is large, other designs may be more appropriate. For example, it may not be possible
to observe each subject under all pq combinations or to secure sets of pq matched subjects.
Under these conditions, a split-plot factorial design, described in Chapter 12, m a y b e
where
(α × β)jk is the joint effect of treatment levels j and k and is subject to the restrictions
and
(αβ × π)ijk is the joint effect of treatment combination jk and block i; (αβ × π)jki is
and independent of πi.
εijk is the error effect that is is independent of πi. In this design, εijk
cannot be estimated separately from (αβ × π)jki.
Model (10.5-1) is called an additive model because all interactions involving blocks are
assumed to equal zero. A nonadditive model is examined later. Model (10.5-1), in which the
treatments are fixed but blocks are random (mixed model), is probably used most frequently for
an RBF-pq design. A residual mean square (MSRES), actually MSAB × BL, is used to estimate
error variance because a within-cell mean square is not available. The use of MSRES involves
somewhat restrictive assumptions. Before discussing these assumptions, I illustrate the layout
and computational procedures for an RBF-33 design.
In Section 9.3, a completely randomized factorial design was used to evaluate the effects of the
type of beat to which police officers are assigned during a human relations course and the
length of the course on attitudes toward minority groups. For purposes of illustration, I modify
the experiment slightly. Suppose that the officers' pretest attitude scores were used to form five
blocks such that the nine officers with the highest scores were assigned to block 1, the nine
with the next highest scores to block 2, and so on. Within each block, the officers were
randomly assigned to the nine treatment combinations. This design enables a researcher to
test the following null hypotheses for the blocks, treatments A and B, and the A × B interaction,
respectively:
The design has an important advantage over the completely randomized factorial design
described in Section 9.3. It isolates the nuisance variable of differences in initial attitudes
toward minority groups—the block variable.
The layout of the RBF-33 design and computational procedures are shown in Table 10.6-1.
The analysis of variance is summarized in Table 10.6-2. Notice that the A ×B interaction is
significant. The procedures for testing treatment-contrast and contrast-contrast interactions,
discussed in Sections 9.6 and 10.4, can be used to gain a better understanding of the way the
two variables interact.
A randomized block factorial design is more powerful than a completely randomized factorial
design if the block effects in the former design are appreciably greater than zero. The reason
for this is obvious from an inspection of Figure 10.6-1, where the two designs are compared in
terms of the way the total sum of squares is partitioned. In choosing between the two designs,
one should consider not only their relative power but also the cost of forming blocks of matched
subjects or obtaining repeated measures on the subjects versus the simpler procedure of
randomly assigning subjects to treatment combinations.
Figure 10.6-1 ▪ Schematic partition of the total sum of squares and degrees of freedom
for CRF-33 and RBF-33 designs. The rectangles with the thicker lines identify the sums of
squares that are used to compute an estimate of error variance.
Measures of strength of association can be computed for the blocks, treatments A and B, and
the A × B interaction in Table 10.6-2. For example, partial omega squared for treatment A is
given by
where
The expected values of the mean squares are obtained from Table 10.7-2.
Partial omega squareds also can be computed from F statistics. For example, the formula for
treatment A for the data in Table 10.6-2 is
Cohen's effect size measure, , can be computed from partial omega squared. For example,
the effect size for treatment A is
The use of Tang's tables in Appendix E.12 for computing power and estimating sample size is
discussed in Sections 4.5, 8.2, and 9.8.
The simplest approach to estimating sample size is to specify (1) the number of levels of the
treatments; (2) level of significance, α; (3) power, 1 – β; (4) either the strength of association,
ω2, or the effect size, f, that is of interest; and (5) the population correlation among the pq
treatment combinations, ρ. For example, suppose that a researcher wants to use a significance
level of α = .05 and a power of 1 – β = .80. Suppose also that the researcher wants to detect a
large effect, f = . 4 0 0 , a n d thinks that the population correlation among the treatment
combinations is close to .40. The required number of blocks for treatment A, for example, can
be determined from Appendix Table E.13 f o r v1 = p − 1 = 3 − 1 = 2 and
. For treatment A in a randomized block factorial design, f* in
Appendix Table E.13 is defined as . The value is
between the columns headed by f* = .500 and f* = .655 in the row labeled 1 – β = .80. By
interpolation, the required number of blocks is 5. This example assumes that the researcher
knows the population correlation coefficient. If, as is usually the case, the researcher doesn't
know the value of p, then Appendix Table E.13 can be used to estimate n for a range of values
of ρ—an optimistic value and a pessimistic value. This procedure enables a researcher to
narrow the choice for the number of blocks.
Tang's charts in Appendix Table E.12 also can be used when a researcher specifies f. The
charts are entered with a n d , where
for treatment A and q − 1 for treatment B, v2 = (n′ −
1) (pq − 1), and n′ = a trial value of n.
The expected values of the mean squares for the additive model and the nonadditive model,
which is described in this section, can be determined from Table 10.7-1. The sampling fractions
1 – n/N, 1 – p/P, and 1 – q/Q in the table become zero if the corresponding terms, BL, A, and
B, represent fixed effects; they become 1 if the corresponding terms represent random effects.
Also, if BL, A, and B represent fixed effects, the corresponding variances, , and ,
should be replaced with , and , respectively. If both A and B
represent fixed effects, also should be replaced with . The
expected values of the mean squares for the RBF-33 design in Table 10.6-2 are shown in Table
10.7-2.
If all interaction effects that involve blocks are equal to zero, the model equation for the RBF-pq
design is said to be additive; otherwise, the model is nonadditive. For the latter case, model
equation (10.5-1) should be amended as follows:
For the data in Table 10.6-1, the four error mean squares appear to be relatively homogeneous.
Two F statistics for testing treatment and interaction null hypotheses for an RBF-pq design
have been recommended in the literature. For convenience, I denote them by F× and F+. I
show next that a choice between F× and F+ is based on the tenability of various sphericity
Sphericity Conditions
The assumptions underlying F tests in an RBF-pq design have been examined in detail by
Rouanet and Lépine (1970). In summarizing their findings, I assume that the reader is familiar
with the sphericity condition introduced in Section 8.4. The hypothesis that μ.11 = μ.12 = μ.13 =
… = μ.pq is referred to as the omnibus null hypothesis. If the omnibus null hypothesis is true,
all hypotheses embedded in the omnibus hypothesis also are true. This means that μ.1. = μ.2.
= … = μ.p. for treatment A, μ..1 = μ..2 = … = μ..q for treatment B, and μ.jk − μ.j′ − μjk′ + μ.j′k′ =
0 for all j and k for the A × B interaction. The latter hypotheses are referred to as local null
hypotheses. The statistic for testing the omnibus null hypothesis is F+ = MSTREAT/MSRES.
The necessary and sufficient condition for this F+ statistic to be distributed as F with v1 = pq −
1 and v2 = (n − 1)(pq − 1) degrees of freedom, given that the omnibus null hypothesis is true, is
the omnibus sphericity condition:
where C*′ is an orthonormal matrix representing the omnibus null hypothesis that all μ.jks are
equal and Σ is the population covariance matrix. If this omnibus sphericity condition is satisfied,
the sphericity condition also is satisfied for all local null hypotheses.3 his means that
are all distributed as F when the associated null hypotheses are true.
If omnibus sphericity is not satisfied, it may still be true that one or more of the following local
sphericity conditions are satisfied:
where C*′ is the orthonormal coefficient matrix for testing the local null hypothesis. When a
MSB/MSB × BL, and F× = MSA × B/MSA × B × C— are distributed as F. Tests using F+ require
the strongest assumption, omnibus sphericity, but have more degrees of freedom than those
using F×, and consequently they are more powerful. Thus, F+ is the preferred test statistic
when C*′ΣC* satisfies the omnibus sphericity condition; F× can be used when C*′ΣC* satisfies
local sphericity. If C*′ΣC* does not satisfy the local sphericity condition, the sequential testing
strategy in Section 8.4 can be used.
A number of statistics have been proposed for testing the hypothesis that C*′ΣC* satisfies the
sphericity condition. Cornell et al. (1992) recommended the locally best invariant test (S. John,
1971, 1972; Nagao, 1973; Sugiura, 1972). The test was illustrated for an RB-p design in Table
8.4-1. The generalization to an RBF-pq design is straightforward. The test statistic is
If V* is greater than or equal to the critical value, , in Appendix Table E.18, a researcher
can conclude that the omnibus sphericity condition is not tenable. The V* statistic can be used
to test for local sphericity by using the appropriate orthonormal coefficient matrix, C*′. The
computation of the V* statistic is not illustrated for the police attitude example because the data
have too few blocks relative to the number of treatment combinations.
A restricted cell means model can be used to represent data for a randomized block factorial
design with and without missing observations. The restricted cell means model for a two-
treatment randomized block factorial design is
where εijk is and μijk is subject to the following restrictions: (1) For the test of blocks,
all AB × BL population effects equal zero; (2) for the test of treatment A, all A × BL population
effects equal zero; (3) for the test of treatment B, all B × BL population effects equal zero; and
(4) for the test of the A × B interaction, all A × B × BL population effects equal zero.
The analysis procedures for the restricted cell means model are illustrated for the police
attitude data in Table 10.6-1. These data are for an RBF-33 design with five blocks. For the
purpose of forming coefficient matrices, C′, the null hypotheses for blocks, treatment A, and
treatment B can be expressed in matrix notation as follows:
where n = 5, p = 3, and q = 3. The omnibus treatment null hypothesis, H0: μ.11 = μ.12 = μ.13 =
μ.21 = … = μ.33 = 0, can be expressed as
The coefficient matrices, C′s, for computing sums of squares for the design in Table 10.6-1 are
given by the following Kronecker products:
The hypothesis states that the sum of the 15 population cell means for a1 is equal to the sum
for a2, and the sum of the 15 population cell means for a2 is equal to the sum for a3. In the
next section, I describe a design in which an observation is missing. When observations are
missing, the null hypothesis for treatment A, for example, must be expressed as contrasts of
cell means instead of contrasts of sums of cell means: μ111− μ121 = 0, μ211 – μ221 = 0, μ311
− μ321 = 0, …, μ523 − μ533 = 0.
The sums-of-squares formulas and values for the data in Table 10.6-1 are
where .
The Yijks are given in Table 10.6-1. As discussed in Section 6.2, the J matrix can be obtained
by computing the product of h × 1 and 1 × h sum vectors: J = 1 × 1′. The sums of squares just
presented are identical to those in Table 10.6-1, which were obtained using the classical sum-
of-squares approach.
The use of MSRES to estimate error variance is appropriate if the omnibus sphericity condition
is tenable. If the condition is not tenable but local sphericity is tenable, a researcher should
compute F× statistics as described in Section 10.7. The coefficient matrices for computing SSA
× BL, SSB × BL, and SSA × B × BL are, respectively,
Multiple comparison tests are performed with treatment means computed from restricted cell
means as specified by the restrictions on the model. Restricted cell means for treatment A are
given by
These cell means are used to obtain treatment means. For example, is given by
Multitreatment designs with a blocking variable are more difficult to analyze when one or more
observations are missing. As I discuss in Section 8.8, the null hypotheses must be specified as
contrasts of population cell means instead of contrasts of sums of population cell means. An
RBF-32 design with one missing observation, Y122, is used to illustrate the procedures. The
data are shown in Table 10.8-1. To simplify the presentation, the design has only two blocks.
Ordinarily, the design should have many more blocks.
In the following discussion, I first show the cell means null hypothesis for the case in which
there is no missing observation and then modify the hypothesis to take into account the missing
observation. The following hypothesis, H′, and identity, I, matrices are used to compute the
Each row of defines a population contrast of cell means. The number of cell means in
μ1(BL) is npq = (2)(3)(2) = 12. Vertical and horizontal lines in the matrix and vectors identify (1)
the missing cell mean, μ122, in μ1(BL); (2) coefficients for the cell means contrast, μ122 − μ222
in that cannot be estimated; and (3) the corresponding 0 in 01(BL). Because Y122 is
missing, column 7 and row 4 of and 0 in must be deleted. Also, μ122 in μ1(BL)
must be deleted. The modified null hypothesis is
Notice that the hypothesis does not include the μ122 − μ222 contrast.
When a source of variation in ANOVA has three or more levels, null hypotheses can be
expressed in a several ways as I show next for the omnibus treatment null hypothesis. Two
equivalent null hypotheses for contrasts of cell means are as follows:
where
and Because Y122 is missing, two contrasts cannot be estimated: μ121 = μ122
and μ122 = μ131 in rows 5 and 7, respectively. When column 7 and rows 5 and 7 of are
deleted, the modified matrix has only eight rows—one too few rows. This problem can be
solved by first deleting column 7 of and then adding rows 5 and 7 of the modified
The sum of rows 5 and 7 defines the interesting and interpretable contrast μ121 − μ131. The
original row 5 of is replaced with the sum of rows 5 and 7. The original row 7 is deleted to
obtain a 9 × 11 coefficient matrix that provides a test of the following null hypothesis:
The null hypothesis for treatment A can be written using contrasts of cell means as follows:
When there are no missing observations, a test of contrasts of cell means in hypothesis (10.8-
3) gives the same result as a test of contrasts of sums of cell means in hypothesis (10.8-4):
However, when one or more observations are missing, tests of the two hypotheses give different
results. In this case, contrasts of cell means should be tested.
The null hypothesis for the A × B interaction can be written using cell means as follows:
The coefficient matrix for computing the residual sum of squares is given by .
The coefficient matrix is
The interaction term μ122 − μ222 − μ132 + μ232, defined by row 4 of , cannot be
estimated. The modified coefficient matrix is
The interaction term μ121 − μ221 − μ122 + μ222 − μ131 + μ231 + μ132 − μ232, defined by row
2 of , cannot be estimated. The modified coefficient matrix is
The sums-of-squares formulas and the sums-of-squares values for the RBF-32 design are
As a starting point for determining which sums of squares to subtract, notice that the letter A in
treatment A appears in the three sums of squares: SSA × B, SSA × BL, and SSA × B × BL.
These sums of squares are subtracted from . Consider next the
formula
It can be shown that the rows of are linearly dependent on . Hence, also
represents the sources of variation in . The letters A and B of the A × B interaction
also appear in designation for the SSA × B × BL interaction, so SSA × B × BL is subtracted
from .
The pattern just illustrated applies to all of the sums-of-squares formulas. However, the pattern
is not obvious for the omnibus treatment sum of squares, SSTREAT. The rows of are
linearly dependent on C2(T). An alternative label for SSTREAT is SSAB. An alternative label for
SSRES is SSAB × BL. In words, SSRES is the interaction of the omnibus treatment, AB, with
blocks, BL. The letters A and B of SSAB also appear in designation for the SSAB × BL
interaction, so it is subtracted from .
Table 10.8-2 ▪ ANOVA Table for RBF-32 Design With Missing Observation
Multiple comparison tests are performed with treatment means computed from restricted cell
means as specified by the restrictions on the model. Restricted cell means for treatment A are
These cell means are used to obtain treatment means. For example, is given by
The analysis procedures for a randomized block factorial design with missing observations are
complex. The cell means model enables a researcher to test hypotheses for any linear
combination of population cell means that can be estimated from sample data. However, the
researcher must decide which, if any, hypotheses are both interesting and interpretable.
10.9 Minimizing Time and Location Effects by Using a Randomized Block Factorial Design
A useful application of an RBF-pq design may not occur to the reader. Situations arise in
behavioral, medical, and educational research in which all the pq treatment combinations can
be administered to only one set of pq subjects at a time. At a later time or possibly at a different
location, another set of subjects receives the same treatment combinations. In this type of
experiment, the subjects in a set are matched in the sense that they receive the treatment
combinations at the same time or at the same place. Differences among the blocks (sets) of
subjects represent a time or place variable in addition to any other conditions that are not
constant over the different treatment administrations. The use of a randomized block factorial
design permits the isolation of these nuisance variables. An alternative analysis procedure,
which is generally less desirable, is to pool the subjects who receive the same treatment
combinations and compute a within-cell error term. The data are analyzed as a completely
randomized factorial design. The disadvantage of this procedure is that any variation due to
nuisance variables inflates the estimate of the error variance. The two designs require the same
number of subjects, but the randomized block factorial design permits the isolation of nuisance
variables and is more congruent with the actual conduct of the experiment.
1.Terms to remember:
a.residual sum of squares (10.3)
b.correction term (10.3)
c. omnibus null hypothesis (10.7)
d.local null hypothesis (10.7)
e.omnibus sphericity (10.7)
f. local sphericity (10.7)
*2.[10.1] List the treatments and interactions that can be tested in the following designs:
*a.CRF-233 design
*b.CRF-3422 design
c. CRF-2232 design
d.CRF-34322 design
*3.[10.1] In a CRF-354334 design, how many interactions involve the following number of
treatments?
*a.Two
*b.Three
c. Four
d.Five
e.Six
*4.[10.2] An experiment was performed to determine the relative effectiveness of two
procedures for teaching an intermediate-level biology laboratory: the conventional method
in which students worked in pairs and performed each experiment by following a laboratory
manual (treatment level a1) and a second method in which the instructor performed the
experiment while the students in pairs watched and recorded the results (treatment level
a2). The duration and number of laboratory meetings per week also were investigated:
Condition b1 was three 50-minute sessions per week; b2 was two 75-minute sessions. The
laboratories were scheduled with one day between laboratories (level c1) or on consecutive
days (level c2). Students were randomly assigned to the eight treatment combinations with
the restriction that each combination contained 10 students. The dependent variable was
the mean performance of the members of a lab pair on the intermediate biology laboratory
final examination. The following data were obtained:
*a.Use the classical sum-of-squares approach to test the null hypothesis for treatments
and interactions; let α = .05.
*b.Graph the interactions that are significant in part (a); interpret the interactions.
c. Prepare a “results and discussion” section for the Journal of Educational Psychology.
*5.[10.3] For a CRF-343225 design, write the computational formulas using abbreviated
notation for the following sums of squares:
*a.SSABCD
*b.SSABCDE
c. SSBCDE
d.SSBCDEF
*6.[10.4]
*a.Write the cell means model for the CRF-222 design in Exercise 4.
*b.Use Kronecker products to formulate C′ matrices for testing the null hypotheses for
unweighted means; let .
*c.Use the cell means model approach to compute sums of squares and test the null
hypotheses.
7.[10.4] For the data in Exercise 4, assume that observation Y5211 is missing. Use the cell
means model approach to compute sums of squares and test the null hypotheses for
unweighted means.
8.[10.4] For a CRF-232 design, use Kronecker products to formulate C′ matrices for testing
the null hypotheses for unweighted means; let (μ′= [μ111 μ112 μ121 μ122 μ131 μ132 μ211
μ212 μ221 μ222 μ231 μ232].
9.[10.4] For a CRF-233 design, use Kronecker products to formulate C′ matrices for testing
the null hypotheses for unweighted means; let μ′ = [μ111 μ112 μ113 μ121 μ122 μ123 μ131
μ132 μ133 μ211 μ212 μ213 … μ233].
*10.
[10.6] Suppose that the following data have been obtained for an RBF-23 design. Assume
that the treatments are fixed effects and the blocks are random effects.
*a.[10.6] Use the classical sum-of-squares approach to test the null hypotheses; let α =
.05.
*b.Use the Fisher-Hayter statistic to test all two-mean contrasts among the treatment B
means. (Note: )
*11.
[10.8]
*a.For the data in Exercise 10, use Kronecker products to formulate C′ matrices for testing
the null hypotheses.
*b.Use the cell means model approach to compute sums of squares and test the null
hypotheses; let α = .05.
*c.Compute restricted cell means for treatments A and B and the A × B interaction.
*12.
[10.8]
*a.For the data in Exercise 10, assume that observation Y212 is missing. Formulate C′
matrices for testing the null hypotheses.
*b.Use the cell means model approach to compute sums of squares and test the null
hypotheses; let α = .05.
*c.Compute restricted cell means for treatments A and B and the A × B interaction.
13.
[10.8]
a.For the data in Exercise 10, assume that observation Y121 is missing. Formulate C′
matrices for testing the null hypotheses.
b.Use the cell means model approach to compute sums of squares and test the null
hypotheses; let α = .05.
c. Compute restricted cell means for treatments A and B and the A × B interaction.
14.
The effect of vigorous physical exercise on the performance of a line-matching task was
investigated. Male college students were shown four vertical lines projected on a screen
directly in front of them at eye level while they exercised on a motor-driven treadmill. On the
left of the screen was a single line; on the right were three lines numbered 1, 2, and 3. The
subjects were instructed to call out as quickly as they could the number of the line on the
right that corresponded in length to the one on the left. Treatment A was the speed of the
treadmill: a1 = 2.5 mph, a2 = 3.4 mph, and a3 = 4.3 mph. Treatment B was the grade of the
treadmill: b1 = 12% grade and b2 = 18% grade. The order in which the treatment
combinations were administered to the subjects was randomized independently for each
subject. The dependent variable was the number of correct line matches during the last 3
minutes of each session. Assume that treatments represent fixed effects and blocks
represent random effects. The following data were obtained:
a.[10.6] Use the classical sum-of-squares approach to test the null hypotheses for this
RBF-32 design; let α = .05. Use MSRES as the error term.
b.Compute . Was the use of the blocking variable effective?
Explain.
c. [10.7] Would the use of F× instead of F+ have led to different conclusions?
d.Use the Fisher-Hayter statistic to determine which population means differ for treatment
A. (Note: )
e.Prepare a “results and discussion section” for Perceptual and Motor Skills.
15.
[10.8]
a.For the data in Exercise 14, formulate C′ matrices for testing the null hypotheses.
b.Use the cell means model approach to compute sums of squares and test the null
hypotheses; let α = 05.
16.
[10.7] In the context of an RBF-33 design, use null hypotheses to distinguish omnibus
sphericity from local sphericity.
*The reader who is interested only in the classical sum-of-squares approach can, without loss
of continuity, omit this section.
2It is customary to denote the interaction of, say, treatments A and B by either AB or A × B. I
depart from this convention for a randomized block factorial design and some other designs.
For a randomized block factorial design, I denote the combination of treatments A and B taken
together by AB. I denote the interaction of the treatments by A × B. Similar interpretations apply
to αβ and α × β.
*The latter portion of this section assumes a familiarity with the matrix operations in Appendix D
and the material in Section 8.4.
*The reader who is interested only in the classical sum-of-squares approach can, without loss
of continuity, omit this section.
3These matrix operations are easily performed with programs such as SAS/IML. The author s
Web page provides the SAS/IML code.
4See also the explanation in Section 8.8 for a randomized block design.
http://dx.doi.org/10.4135/9781483384733.n10
Hierarchical Designs
In a hierarchical design, the levels of at least one treatment are nested in those of another
treatment, and the remaining treatments are completely crossed. If each level of, say, treatment
B appears with only one level of treatment A, B is said to be nested in A. This is denoted by
B(A) a n d r e a d “B within A.” The distinction between nested and crossed treatments is
illustrated in Figure 11.1-1.
Figure 11.1-1 ▪ Comparison of designs with nested and crossed treatments. In part (a),
treatment B(A) is nested in treatment A because b1 and b2 appear only with a1, whereas
b3 and b4 appear only with a2. In part (b), treatments A and B are crossed because each
level of treatment B appears once and only once with each level of treatment A and vice
versa.
Experimental designs with one or more nested treatments are particularly well suited to
research in the behavioral and medical sciences, education, and industry. Consider an example
from education in which two types of instruction materials (treatment levels a1 and a2) are to be
evaluated using students in four sixth-grade classes (treatment levels b1, b2, b3, and b4). Two
classrooms are randomly assigned to use each type of instructional material. For administrative
reasons, all children in a particular room must use the same type of instruction material.
Assume that each classroom contains the same number of children. This hierarchical
experiment corresponds to the design illustrated in Figure 11.1-1(a). Each classroom, bk,
appears with only one level of programmed instruction. Thus, treatment B is nested in A.
There are problems with the design. The four classrooms represent four intact groups that are
randomly assigned to the treatment levels. However, the individual students, the observational
units,1 are not randomly assigned to the treatment levels. The use of intact groups can lead to
nonindependence of the observational units and to other problems. In Section 13.1, I discuss
these problems and an alternative design: the analysis of covariance. The hierarchical design
can be improved by randomly assigning students to the eight treatment-classroom
combinations. Random assignment helps to ensure that the eight classes of students are
probabilistically similar at the beginning of the experiment.
Nested treatments often occur in animal research where it is necessary to administer the same
treatment level to all animals in the same cage or housing compound. Consider an experiment
to investigate the effects of positive and negative ionization of air molecules (treatment levels a1
and a2, respectively) on the activity levels of rats. The animals are radiated continuously for a 3-
week period, after which their activity level is measured in an open field situation. Sixteen rats
are housed in four cages (treatment levels b1, b2, b3, and b4) with four rats per cage. Each
cage is equipped with ionizing equipment for producing the required condition in a cage. The
cages are randomly assigned to the ionization conditions, two to each condition, and the rats
are randomly assigned to the ionization and cage conditions. In this example, the cages are
nested in the ionization conditions. This experiment also corresponds to the design shown in
Figure 11.1-1(a). We assume that the activity level of a rat in the open field situation reflects the
effects of (1) the ionization condition received, treatment A; (2) the cage in which the rat is
housed, treatment B(A); and (3) experimental error: idiosyncratic characteristics of the rat and
other uncontrolled variables. These assumptions can be expressed more formally by means of
the model equation
where the symbol βk(j) indicates that the kth level of treatment B(A) is nested in the jth level of
A. Note that an A × B interaction term does not appear in the model.
In this and the programmed instruction experiment described earlier, treatment B(A) is really a
nuisance variable. That is, the variable of classrooms or cages is included in the design and
model equation because the researcher suspects that it might affect the dependent variable
and not because of an interest in the variable per se. One could argue that if the cages in the
ionization experiment are identical, this condition is constant for all subjects. This argument
ignores variables such as differences in the location of the cages in a room and possible
differences in the dispersion of air ions within the cages as well as differences in ambient
lighting, temperature, and humidity. Also ignored are differences in the social environment in
the cages due to the presence of dominant or neurotic rats in certain cages, undetected
infectious diseases, and so on. The use of a hierarchical design enables a researcher to isolate
the nuisance variables of cages, which might affect the rats' activity levels in the open field
situation.
A common error in the analysis of hierarchical experiments is to ignore the nuisance variable
and treat the design as if it were a CR-p design with nq(j) participants in each level of treatment
A. Here, q(j) denotes the number of levels of treatment B(A) that are nested in the jth level of
treatment A. This incorrect analysis may result in a biased test of treatment A. A correct
analysis takes into account the effects of the nested treatment so that the analysis is congruent
with the way the experiment was actually carried out. This point is discussed in Section 11.2.
Hierarchical designs that are constructed from completely randomized designs are appropriate
for experiments that meet, in addition to the assumptions of a CR-p design described in
Chapter 4, the following conditions:
1.There are two or more treatments, with each treatment having two or more levels.
2.The levels of at least one treatment are nested in those of another treatment; nonnested
treatments are crossed.
3.The levels (combinations) of the nested treatment(s) are randomly assigned to the
nonnested treatment (treatment combinations), and the experimental units are randomly
assigned to the treatment combinations.
Hierarchical designs are often used in research situations where the third requirement cannot
be met. For example, the programmed instruction experiment might be performed in four
schools using two sixth-grade classrooms in each school, a total of eight classrooms. In this
design, the four schools can be randomly assigned to the two types of instruction materials, but
obviously, classrooms in one school cannot be assigned to another school. Also, federal
integration guidelines would probably preclude assigning students randomly to the treatment
combinations. Many studies by their very nature preclude the kind of randomization described
in the third requirement. Campbell and Stanley (1966, p. 34) refer to experiments in which the
researcher has limited control over randomization as quasi-experiments.2 The interpretation of
the results of such experiments presents a real challenge. It seems that the difficulty in
interpreting the outcome of research varies inversely with the degree of control a researcher is
able to exercise over randomization. Certainly, quasi-experiments and ex post facto studies are
among the most difficult to interpret. Researchers will find the discussion by Shadish et al.
(2002) helpful.
24(A). A completely randomized hierarchical design with three treatments can take any of the
following forms: CRH-pq(A)r(AB), CRPH-pq(A)r, CRPH-pqr(A), CRPH-pqr(B), CRPH-pq(A)r(A),
and CRPH-pqr(AB). These designs are discussed in Section 11.6, so I limit my comments here
to the treatment designation scheme. In a CRH-pq(A)r(AB) design, treatments B(A) and C(AB)
are completely nested: B(A) is nested in A, and C(AB) is nested in A and B(A). The other
designs involve partial nesting, which is indicated by the use of P in their designation and are
referred to as completely randomized partial hierarchical designs. In the CRPH-pq(A)r
design, for example, treatment B(A) is nested in A, but treatment C is crossed with A and B(A).
In the CRPH-pqr(AB) design, treatment C(AB) is nested in A and B, but treatments A and B are
crossed. These designs are described in Section 11.6.
Hierarchical designs are balanced if they have an equal number of experimental units in each
treatment combination and an equal number of levels of the nested treatment in each level of
the other treatment; otherwise, they are unbalanced. The analysis and interpretation of the
results for unbalanced designs are more complex than the analysis and interpretation for
balanced designs.3 I describe the analysis of balanced designs first using the classical sum-of-
squares approach. The analysis of balanced and unbalanced designs using the cell means
model is discussed in Sections 11.7 and 11.8.
Assume that the ionization experiment described earlier has been performed. However, instead
of four cages, eight were used. The cages, treatment B(A), were randomly assigned to the
positive (a1) and negative (a2) ionization conditions, with the restriction that four cages were
assigned to each condition. Thirty-two rats were randomly assigned to the pq(j) = 2(4) = 8
treatment combinations with the restriction that four rats were assigned to each combination.
For a balanced CRH-28(A) design, the number of levels of treatment B(A) that are nested in
the jth level of treatment A, q(j) = 4, is constant for all j. The treatment effects for the ionization
conditions are fixed; those for cages are assumed to be random. This is a common assumption
for a nuisance variable. We are not interested in the eight cages used in the experiment but
rather in the population of cages from which the eight would have been a random sample if
random sampling had been used.
The research hypotheses leading to the experiment can be evaluated by testing the following
null hypotheses for treatments A and B(A), respectively,
If is rejected, we conclude that for a1, or for a2, or both. The level of
significance adopted is .05. The computational procedures are illustrated in Table 11.2-1. The
results are summarized in Table 11.2-2. It is apparent that both null hypotheses can be
rejected.
From an examination of the data, we know that rats exposed to negative air ions are more active
than those exposed to positive ions. The finding that there are differences in the activity levels
among the various cages is of little substantive interest because the cages represent a
nuisance variable. The significant F statistic for cages indicates that the decision to include this
source of variation in the design was a good one.
Table 11.2-2 contains two error terms: one for testing MSA and another for testing MSB(A). The
test of MSA for model III (see Table 11.3-1) uses MSB(A) in the denominator of the F statistic.
For F to be distributed as central F when the null hypothesis is true, the population variance
estimator MSB(A) should be composed of homogeneous sources of variation. It can be shown
that SSB(A) represents a pooled sum of squares:
Actually, SSB(A) is the pooled simple main effects of treatment B at each level of treatment A.
The reader may recall from Section 9.6 that
squares:
The other error-term estimator, MSWCELL, also should be composed of homogeneous sources
of variation. Procedures for testing this assumption are described in Section 3.5.
Typically, the number of degrees of freedom associated with MSB(A), the error term for testing
MSA, is quite small. If there is reason for believing that then MSB(A) and MSWCELL are
both estimators of the same population error variance and can be pooled. For a discussion of
the issues involved in performing preliminary tests and pooling, see Section 9.11. Pooling is not
appropriate for the ionization data in Table 11.2-1 because the hypothesis was rejected.
An Incorrect Analysis
A common error in analyzing experiments that involve a nested nuisance variable is to ignore
the variable and use an analysis appropriate for a CR-p design. Instead of including the
variable of, say, cages or classrooms in the design, the researcher acts as if the nq(j)
experimental units were randomly assigned to each level of treatment A and the nq(j) units
were all treated alike. If , as is often the case, then the test of treatment A using
MSWCELL as the error term is positively biased. An analysis of the data in Table 11.2-1,
ignoring the variable of cages, leads to the results shown in Table 11.2-3. The discrepancy
between the incorrect analysis (CR-2) and the correct analysis (CRH-28(A)) is quite large. The
two analyses lead to the same decision for treatment A if and the researcher uses a
pooled error term based on MSB(A) and MSWCELL. Then the test of A is given by F =
MSA/MSRES = 112.500/4.100 = 27.44, where
Table 11.2-3 ▪ Comparison of Correct and Incorrect Analyses for the Data in Table 11.2-1
As noted earlier, the use of a pooled error term for these data is not appropriate.
A score Yijk in a completely randomized hierarchical design is a composite that reflects the
effects of treatments A and B(A) plus all other sources of variation that affect Yijk. The
experimental design model for a CRH-pq (A) design is
where
Yijk is the observation for the ith experimental unit in the jkth treatment combination.
αj is the treatment effect for population j and is equal to μj. − μ. The jth treatment
effect is a constant for all scores in treatment level aj and is subject to the restriction
.
βk(j) is the treatment effect for population k(j) and is equal to μjk – μj.. βk(j) is NID
εi(jk) is the error effect associated with Yijk and is equal to Yijk − μjk εi(jk) is
and independent of βk(j).
The values of the parameters μ, αj, βk(j)), and εi(jk) in model (11.3-1) are unknown, but they
can be estimated from sample data as follows:
The partition of the total sum of squares, which is the basis for the analysis in Table 11.2-2, is
obtained by rearranging the terms in equation (11.3-2) as follows:
Squaring both sides of equation (11.3-3) and summing over all observations following the
examples in Sections 3.2 and 8.1 lead to the following partition of the total sum of squares:4
where
αj is a constant for all scores in treatment level aj and is subject to the restriction
βk(j) is a constant for all scores in treatment level bk(j) and is subject to the restriction
for all j.
If treatments A and B(A) represent random effects, the assumptions are as follows:
αj is .
βk(j) is .
Experiments in which treatment A is random but B(A) is fixed are rare. The assumptions for this
mixed model are as follows:
αj is .
The expected values for these models are given in Table 11.3-1. These expected values and
those for other hierarchical designs are easily derived using the rules in Section 9.9.
Tests of differences among means in a CRH-pq(A) design have the same general form as those
given in Chapter 5. The test statistics for making comparisons among treatment A means for a
mixed model (A fixed) have the form
For a fixed-effects model, the mean square in the denominator of the above statistics should be
replaced by MSWCELL. The general rule is that the error term that is used to test a treatment
null hypothesis should be used to test hypotheses for the associated population contrasts. The
number of degrees of freedom for these statistics is equal to the degrees of freedom of the
mean square used in the denominator: p(q(j) − 1 or pq(j)(n − 1).
A researcher ordinarily is not interested in testing hypotheses for the nested treatment.
However, if this test is desired, the statistics have the following form for all four models:
11.5 Estimating Strength of Association, Effect Size, Power, and Sample Size
Strength of Association
The expectations of the mean squares in Table 11.3-1 are the basis for determining the terms
required to compute the measures of association. Procedures for using the E(MS)s to estimate
and so on were introduced in Section 4.4. If an estimated component is negative, that
estimate is set equal to zero. A mixed model (A fixed) is appropriate for the ionization
experiment described in Sections 11.1 and 11.2. Measures of partial omega squared and partial
intraclass correlation for the ionization data in Table 11.2-1 are as follows:
where
Both the ionization treatment and the cages account for an appreciable portion of the variance
in the dependent variable.
Effect Size
Cohen's measure of effect size can be computed from partial omega squared. For example, the
Procedures for estimating power and sample size are discussed in Section 4.5. Formulas for
computing for a CRH-pq(j) design are
CRH-pq(A)r(AB) Design
Consider an experiment to evaluate the efficacy of a new drug, denoted by a1, and the current
drug, denoted by a2. Four hospitals (treatment B(A)) with two wards each (treatment C(AB))
are available to participate in the experiment. Because expensive equipment is needed to
monitor the side effects of the new drug, it was decided to use the new drug in two of the four
hospitals and the current drug in the other two hospitals. The hospitals were randomly
assigned to the drug conditions with the restriction that two hospitals were assigned to each
condition. A total of npq(j)r(jk) patients were randomly assigned to the pq(j)r(jk) = (2)(2)(2) = 8
treatment combinations, with the restriction that n patients were assigned to each combination.
The design is diagrammed in Figure 11.6-1. The experimental design model equation for this
completely nested design, in which hospitals are nested in the drugs and wards are nested in
hospitals and drugs, is
Figure 11.6-1 ▪ Diagram of CRH-24(A)8(AB) design. The four hospitals, treatment B(A), are
nested in the two drugs, treatment A; the eight wards, treatment C(AB), are nested in the
hospitals and drugs. Patients are randomly assigned to the pq(j)r(jk)= (2)(2)(2) = 8
treatment combinations, with the restriction that n patients are assigned to each
combination.
The computational formulas and so on for this design are given in Table 11.6-1. The sums of
squares SSB(A) and SSC(AB) represent pooled simple main effects and pooled simple simple
main effects, respectively. It can be shown that these sums of squares correspond to
in a factorial design. For MSB(A) and MSC(AB) to serve as denominators of F statistics, the
mean squares should be composed of homogeneous sources of variation, for example,
The experimental design model equation for a CRPH-pq(A)r design in which treatment B(A) is
nested in A but treatment C is crossed with A and B(A) is
The design is diagrammed in Figure 11.6-2; computational formulas are given in Table 11.6-2.
Figure 11.6-2 ▪ Diagram of CRPH-24(A)2 design. The four levels of treatment B(A) are
nested in the two levels of treatment A; treatment C is crossed with treatments A and
B(A).
in a factorial design.
One variation on this design is to nest treatment C(A) in A and cross treatments A and B. The
other variation is to nest treatment C(B) in B and cross treatments A and B. The experimental
design model equations are, respectively,
These designs are diagrammed in Figures 11.6-3 and 11.6-4; the computational formulas are
given in Tables 11.6-3 and 11.6-4.
Figure 11.6-3 ▪ Diagram of CRPH-224(A) design. Treatments A and B are crossed; the four
levels of treatment C(A) are nested in A. Treatments B and C(A) are crossed.
Figure 11.6-4 ▪ Diagram of CRPH-224(B) design. Treatments A and B are crossed; the four
levels of treatment C(B) are nested in B. Treatments A and C(B) are crossed.
CRPH-pq(A(r(A) Design
The experimental design model equation for a CRPH-pq(A)r(A) design in which both treatments
B(A) and C(A) are nested in treatment A is
This design is diagrammed in Figure 11.6-5; computational formulas are given in Table 11.6-5.
Figure 11.6-5 ▪ Diagram of CRPH-24(A)4(A) design. The four levels of treatment B(A) are
nested in treatment A; the four levels of treatment C(A) are also nested in treatment A The
levels of treatments B(A) and C(A) are crossed.
in a factorial design.
CRPH-pqr(AB) Design
In a CRPH-pqr(AB) design, treatment C(AB) is nested in both A and B, but treatments A and B
are crossed. For example, suppose a new drug education program for junior high students is to
be evaluated. I denote the program by a1 and the control condition by a2. Two schools,
treatment B, are available to the researcher. Eight social studies teachers, treatment C(AB), are
randomly assigned to the pq treatment combinations with the restriction that two teachers are
assigned to each combination. The design is diagrammed in Figure 11.6-6.
Figure 11.6-6 ▪ Diagram of CRPH-228(AB) design. Treatments A and B are crossed; the
eight levels of treatment C(AB) are nested in A and B.
The computational formulas are given in Table 11.6-6. The sum of squares SSC(AB)
corresponds to
in a factorial design.
I have described all of the three-treatment hierarchical designs that are based on a CR-p
design. The patterns that underlie hierarchical designs with four or more treatments are
straightforward extensions of those described. Accordingly, in the next section, I describe only
one of the four-treatment completely randomized partial hierarchical designs.
CRPH-pqrt(C) Design
The experimental design model equation for a CRPH-pqrt(C) design in which treatments A, B,
and C are crossed but treatment D(C) is nested in treatment C is
The design is diagrammed in Figure 11.6-7; computational formulas are given in Table 11.6-7.
Figure 11.6-7 ▪ Diagram of CRPH-2224(C) design. Treatments A, B, and C are crossed; the
four levels of treatment D(C) are nested in treatment C.
The sums of squares SSD(C), SSA × D(C), SSB × D(C), and SSA × B × D(C) correspond to
in a factorial design.
RBH-pq(A) Design
Hierarchical designs can be constructed using an RB-p design as the building block design. In
this section, I describe an RBH-pq(A) design in which the q levels of treatment B(A) are nested
in those of treatment A.
Suppose an educational researcher wants to compare the effectiveness of two ways of using
reading pacers in increasing sixth-graders' reading speed. In condition a1, the rate of
combinations. In the former case, the participants in a block are randomly assigned to
the treatment combinations. In the latter case, the order in which the treatment
combinations are presented is randomized independently for each block.
Consider a mixed model in which the levels of treatment A are fixed effects but the levels of the
blocks and treatment B(A) are random effects. According to Table 11.6-8, MSRES is used to
test the null hypothesis for blocks: F = MSBL/MSRES. However, MSB(A) is used to test
treatment A: F = MSA/MSB(A). As discussed in Section 11.2-2, MSB(A) should be composed of
homogeneous sources of variation: at at at ap. A sample estimator of
at aj given by
Cochran's C test statistic, described in Section 3.5, provides a simple test of the homogeneity of
variance assumption.
If the blocks do not interact with treatments A and B(A), the experimental design model
equation is
If the block-treatment interactions, and , are not equal to zero, the model equation must
be amended as follows:
Computational procedures for the additive and nonadditive models are discussed in Section
10.7.
where
The expectations of the mean squares in Table 11.6-8 are the basis for determining the terms
required to compute the measures of association.
The construction of randomized block hierarchical and partial hierarchical designs with three or
more treatments follows the pattern illustrated earlier for completely randomized hierarchical
designs.
The cell means model can be used to analyze data for completely randomized hierarchical
designs. The model for a CRH-pq(A) design is
where εi(jk) is . An important advantage of this model over the classical sum-of-
squares approach is that it can be used when there are missing observations and missing cells.
The ionization data in Table 11.2-1 are used to illustrate the computational procedures for the
cell means model. For the purpose of forming coefficient matrices, C′, the null hypotheses for
this CRH-28(A) design can be expressed in matrix notation as follows:
The use of in the null hypothesis requires explanation. This null hypothesis is for those
levels of treatment B nested in either the a1 or the a2 level of treatment A. This is the first of
two modifications that are required for nested treatments. I show later that the use of to
obtain the coefficient matrix actually results in testing the treatment B(A) null hypothesis
for all levels of treatment A. The second modification involves the use of Ip (p × p identity
matrix) instead of (1 × p sum vector) to obtain . Because treatment B(A) is nested in
treatment is obtained from the Kronecker product instead of from
. The coefficient matrices, C′, for computing sums of squares are obtained as follows:
The computational procedures for the CRH-28(A) design are shown in Table 11.7-1. The sums
of squares in Table 11.7-1 are identical to those in Table 11.2-1, where the classical sum-of-
squares approach was used. In the next section, the cell means model is used to analyze data
for an experiment that has a missing cell and two missing observations.
Table 11.7-1 ▪ Computational Procedures for CRH-28(A) Design Using the Cell Means
Model
Suppose that the ionization experiment described in Sections 11.1 and 11.2 is unbalanced
because the equipment in the second cage malfunctioned and two animals in other cages died.
Assume that the deaths were unrelated to the nature of the treatments. The data for this
experiment are shown in Table 11.7-2. The four observations in cell a1b2(1) and observations
Y413 and Y427 are missing. To compute sums of squares for this design, coefficient matrices
The null hypothesis for treatment A involves a simple average of the means at each level of
treatment A. Alternatively, a hypothesis for treatment A that involves weighted means could be
tested (see Section 9.14). The coefficient matrices for testing treatments A and B(A) are
where μ′ = [μ11 μ13 μ14 μ25 μ26 μ27 μ28]′, p = 2, q(1) = 3, q(2) = 4, h = q(1)+ q(2) = 7, and N =
n11 + n13 + n14 + … + n28. The coefficients in are cjk = ± 1/q(j) or 0; the coefficients in
are cjk = ± 1 or 0. The fractions in can be avoided by replacing the nonzero
coefficients in a1 and a2, with, respectively, ± (q(2)) = 4 and (q(1)) = 3 as follows:
where
Table 11.7-3 ▪ ANOVA Table for CRH-27(A) Design With Missing Cell and Two Missing
Observations
The use of Kronecker products to formulate coefficient matrices for the cell means model is
illustrated for a CRH-236(B) design and a CRPH-2224(C) design. The pattern underlying the
formulation of coefficient matrices for any completely randomized hierarchical design should be
apparent from these examples. The coefficient matrices for a CRH-236(B) design are presented
first.
where
For this example in which treatment C(B) is nested in treatment B, the formulation of coefficient
matrices using Kronecker products requires two modifications. First, and involve only
those levels of C(B) nested in the kth level of B. Second, in computing , the sum vector,
, for treatment B is replaced by an identity matrix, Ip. If treatment C(AB) had been nested in
both A and B, and would have been replaced by Ip and Iq, respectively. The sum-of-
squares formula for treatments and interactions has the general form
The degrees of freedom for a treatment or interaction are equal to the number of rows in its
coefficient matrix. Only three formulas are required to analyze the data for any completely
randomized hierarchical design:
The coefficient matrices for a CRPH-2224(C) design in which the t = 4 levels of treatment D(C)
are nested in C are given by the following Kronecker products:
The cell means model analysis procedures described in Sections 8.8 and 10.8 for designs with
a block variable generalize to a randomized block hierarchical design. The restricted cell means
model for an RBH-pq(A) design is
where Yijk denotes an observation in treatment combination ajbk(j) and block i. μijk is the
population cell mean for the jkth treatment combination in block i, and εijk is the error effect
that is . μijk is subject to the following restrictions: (1) For the test of blocks, all
A[B(A)] × BL population effects equal zero; (2) for the test of treatment A, all A × BL population
effects equal zero; and (3) for the test of treatment B(A), all B(A) × BL population effects equal
zero.
Consider first the case in which there are no missing observations. For the purpose of forming
coefficient matrices, the null hypotheses for the blocks and treatments are expressed in terms
of block means, μi.., and treatment means, μ.j. and μ.jk. It can be shown that the rows of the
coefficient matrices that define the restrictions on , , and are
linearly independent of the rows of the coefficient matrices that define the corresponding null
hypotheses, , and . For this case, the computational procedures are
simplified, as I show in Section 8.8.
In Section 11.6, I described an experiment to compare the effectiveness of two ways of using
reading pacers to increase sixth-graders' reading speed. I use this experiment to illustrate the
computational procedures for an RBH-24(A) design. Treatment A has two levels (a1 and a2);
treatment B(A) has four levels (b1(1), b2(1), b3(2), b4(2)). The null hypotheses for this mixed
where denotes the omnibus treatment null hypothesis. Coefficient matrices for computing
sums of squares are obtained using Kronecker products, ς, of hypotheses matrices, H′, sum
vectors, 1′, and identity matrices, I, as follows:
In this example, the coefficient matrices provide tests of contrasts of sums of population
treatment means and sums of population block means. In the following subsection, I show that
when one or more observations are missing, the coefficient matrices are formulated so as to
test hypotheses for contrasts of both population treatment cell means and contrasts of
population block cell means. The total sum of squares is given by
Multiple comparison tests are performed with treatment means computed from restricted cell
means as specified by the restrictions on the model. Restricted cell means for treatment A are
given by
These cell means are used to obtain treatment means. For example, y.1. is given by
When one or more observations are missing, coefficient matrices must specify contrasts of cell
means instead of contrasts of sums of cell means, as I show next. The rationale for the analysis
procedures presented here is described in Sections 8.8 and 10.8.
Suppose that observation Y311 in Table 11.8-1 is missing. The following hypothesis matrices, H
′, sum vectors, 1′, and identity matrices, I′, are used to compute the coefficient matrices, C′.
I first show the cell means null hypothesis for the case in which there is no missing observation
and then modify the hypothesis to take into account the missing observation.
The null hypothesis for blocks can be expressed in terms of contrasts of cell means as follows:
where Vertical and horizontal lines in the matrix and vectors identify (1)
the missing mean, μ311, in ; (2) the contrast, μ211 − μ311, in that cannot be
estimated; and (3) the corresponding 0 in 02(BL). Because Y311 is missing, column 3 and row
2 of must be deleted as well as the corresponding 0 in 02(BL) and μ311 in μ2(BL). The
modified null hypothesis for blocks is
Notice that the modified null hypothesis does not contain the μ211 − μ311 population contrast.
Sometimes a missing observation affects two or more population contrasts. Suppose, for
example, that Y212 instead of Y311 in Table 11.8-1 is missing. Observation Y212 is in column 5
and row 3 of .
Neither μ112 − μ212 in row 3 of nor μ212 − μ312 in row 4 can be estimated. This problem
can be solved by first deleting column 5 corresponding to the missing observation, Y212, and
then adding rows 3 and 4 to obtain a new interesting and testable contrast, μ112 − μ312, as
follows:
The original row 3 of is replaced with the sum of rows 3 and 4; the original row 4 is
deleted.
Table 11.8-3 ▪ ANOVA Table for the RBH-24(A) Design (Y311 Is Missing)
In general, the number of degrees of freedom for a sum of squares is equal to the number of
rows in its coefficient matrix. Exceptions to the general rule occur; is an example.
has seven rows, but the degrees of freedom for SSBL is 2. It can be shown that the five rows of
are linearly dependent on the seven rows of . Hence, following the discussion in
Section 8.8, SSBL is given by
The degrees of freedom for SSBL is equal to the number of rows in minus the number of
rows in .
Multiple comparison tests are performed with treatment means computed from restricted cell
means as specified by the restrictions on the model. Restricted cell means for treatment A are
These cell means are used to obtain treatment means. For example, . is given by
1.All participants are used in simultaneously evaluating the effects of two or more treatments.
2.A researcher can isolate the effects of nuisance variables and evaluate treatments that
cannot be crossed with other treatments.
1.If numerous treatments are included in the experiment, the number of participants required
may be prohibitive.
2.The power of certain tests for mixed- and random-effects models tends to be low because
the error term for these tests has a small number of degrees of freedom.
3.If, as is often the case, the nesting of treatment levels or the assignment of experimental
units to treatment combinations is not random, the interpretation of the results may be
ambiguous.
1.Terms to remember:
a.hierarchical design (11.1)
b.completely randomized partial hierarchical design (11.1)
c. balanced hierarchical design (11.1)
d.unbalanced hierarchical design (11.1)
2.[11.1] Distinguish between the following.
a.Hierarchical and partial hierarchical designs
b.Balanced hierarchical and unbalanced hierarchical designs
*3.An experiment was designed to evaluate the effectiveness of a pattern-practice approach in
modifying the nonstandard dialect of the Chicano children living in San Antonio, Texas. The
essential elements of the approach, denoted by a1, involved pattern drill in imitating audio-
recorded speech models and immediate feedback by a teacher. The control condition, a2,
involved reading stories aloud to the teacher. The children were praised for a good
performance, but no corrections or guidance was given. Twenty-four 12-year-old Chicano
children with similar scores on a standardized speech test were randomly assigned to six
miniclasses, with four children in each. The six miniclasses, treatment B(A), were randomly
assigned to the levels of treatment A. The dependent variable was the difference between
the children's pre- and posttest scores on the test. The following data were obtained:
*a.[11.2] Use the classical sum-of-squares approach to test H0: μ1. = μ2. and ;
let α = .05.
*b.[11.2] Is it reasonable to assume that the population variance estimated by MSB(A) is
composed of homogeneous sources of variation? Use Cochran's C statistic to test
at a2; let α = .05.
*c.[11.5] Calculate and .
d.Prepare a “results and discussion section” for the Journal of Educational Psychology.
*4.An industrial psychologist was interested in decreasing the time required to assemble an
electronic component. Three assembly fixtures, treatment A, including the one currently in
use, a1, were evaluated. Five operators at each of six workplaces in the plant, treatment
B(A), were selected randomly to participate in the experiment. The six workplaces were
randomly assigned to use the assembly fixtures, with the restriction that two workplaces
were assigned to each fixture. The operators were randomly assigned to the treatment
combinations. After a 3-week familiarization period, the following data on the number of
units assembled per hour were collected.
*a.[11.2] Use the classical sum-of-squares approach to test H0: μ1. = μ2. = μ3. and
; let α = .05.
*b.[11.2] Is it reasonable to assume that the population variance estimated by MSB(A) is
5.The effects of early environment on the problem-solving ability of rats at maturity were
investigated. Rats were raised in one of three environments: a1 was a normal environment,
a2 was an enriched environment, and a3 was a restricted environment. Nine cages were
randomly assigned to the levels of treatment A, with the restriction that three cages were
assigned to each level of A. Thirty-six rats were randomly assigned to the nine treatment
combinations, with four rats to a cage. The dependent variable was the number of trials
required to learn a visual discrimination task. The following data were obtained.
a.[11.2] Use the classical sum-of-squares approach to test H0: μ1. = μ2. = μ3. and
; let α = .05.
b.[11.2] Is it reasonable to assume that the population variance estimated by MSB(A) is
composed of homogeneous sources of variation? Use Cochran's C statistic to test
at at at a3; let α = .05.
c. [11.4] Use the Fisher-Hayter statistic to determine which treatment A population means
differ.
d.[11.5] Calculate and .
e.Prepare a “results and discussion section” for the Journal of Experimental Psychology:
Applied.
*7.[11.6] Identify the following designs; assume in each case that the building block design is
a CR-p design.
*9.[11.6] For each of the following designs, (i) write the experimental design model equation
and (ii) construct a diagram of the design. Use the minimum number of levels required for
each treatment.
*a.CRPH-pqrt(B)
*b.CRPH-pqrt(AB)
*c.CRH-pq(A)r(AB)t(ABC)
d.CRPH-pqrt(A)
e.CRPH-pq(A)r(A)t(A)
f. CRPH-pq(A)r(AB)t
*10.
[11.6] There is a correspondence between the sum of squares for a nested treatment and
the sums of squares in a factorial design. For example, SSB(A) = SSB + SSA × B and
SSC(AB) = SSC + SSA × C + SSB × C + SSA × B × C. (i) For each of the following,
indicate the correspondence.
*a.SSC(B)
*b.SSB × D(C)
*c.SSA × D(BC)
*d.SSD(ABC)
*e.SSB(A) × C(A)
*f.SSB(A) × C(A) × D(A)
g.SSD(BC)
h.SSE(ABD)
i. SSA × C(B)
j. SSD × E(ABC)
k. SSC(AB) × D
l. SSC(A) × E(D)
(ii) It is evident from part (i) that a simple rule governs the correspondence between a
nested sum of squares and the sums of squares in the factorial design. State the rule.
*11.
[11.7] Exercise 3 describes an experiment that was concerned with modifying the dialect of
Chicano children. Suppose that observation Y213 = 12 is missing.
*a.Use Kronecker products to formulate the C′ matrices for testing the null hypotheses for
treatments A and B(A).
*b.Formulate (X′X)−1.
*12.
[11.7] Exercise 4 describes an experiment that was concerned with decreasing the time
required to assemble an electronic component. Suppose that observation Y311 = 9 and cell
a1b2 are missing.
*a.Use Kronecker products to formulate the C′ matrices for testing the null hypotheses for
treatments A and B(A).
*b.Formulate (X′X)−1.
*c.Analyze the data using the cell means model approach.
13.
[11.7] Exercise 5 describes an experiment that investigated the effects of early environment
on the problem-solving ability of rats.
a.Use Kronecker products to formulate the C′ matrices for testing the null hypotheses for
treatments A and B(A).
b.Formulate (X′X)−1.
c. Analyze the data using the cell means model approach
*14.
[11.8] According to the National Science Foundation, women are underrepresented in
scientific fields. In 2010, women constituted less than 25% of all physical scientists and less
than 10% of all engineers. Gender discrimination is believed to play a role in occupational
segregation. Sixteen girls who attended a 1-day conference entitled “Expanding Your
Horizons” participated in an experiment to determine the effects of knowledge about gender
discrimination on girls' attitudes toward and interest in science-related occupations. The
girls ranged in age from 11 to 13 with a mean age of 12.1 years. The girls were assigned to
one of four blocks of size based on their score on Rosell and Hartman's Contemporary
Gender Discrimination Attitude Scale. Treatment level a1 was a 1-hour lecture on common
types of gender discrimination in scientific fields by the researcher. Treatment level a2 was
a 1-hour lecture on French cooking by the researcher, the control condition. Treatment B
was a 1-hour lecture by women scientists in one of four fields: b1(1) = physics, b2(1) =
biology, b3(2) = chemistry, and b4(2) = neuroscience. The women scientists described their
careers and the satisfaction that they derived from their research. The girls in each block
were randomly assigned to one of the four treatment combinations. At the conclusion of the
conference, the girls completed a number of measures of interest in science. Data for
Weisgram and Bigler's Utility Value of Science Scale are shown here.
*a.[11.2] Use the classical sum-of-squares approach to test the hypotheses , H0:
μ.1. = μ.2., and ; let α = .05. Assume a mixed model where the levels of
treatment A are fixed effects.
*b.[11.2] Is it reasonable to assume that the population variance estimated by MSB(A) is
composed of homogeneous sources of variation? Use Cochran's C statistic to test the
hypothesis at at a2; let α = .05.
*c.[11.5] Calculate and
d.Prepare a “results and discussion section” for the Psychology of Women Quarterly.
*15.
[11.8] Exercise 14 describes an experiment to investigate the effects of knowledge about
gender discrimination on girls' attitudes toward and interest in science-related occupations.
The data in Exercise 14 can be analyzed using the cell means model approach.
*a.Use H′ matrices to write the null hypotheses for the blocks, omnibus treatment,
treatment A, and treatment B(A).
*b.Write the C′ matrices for the blocks, omnibus treatment, treatment A, treatment B(A),
residual, A × BL, and B(A) × BL.
*c.Analyze the data using the cell means model approach. Compute F tests and p values
for the blocks and treatments A and B(A).
*d.Compute restricted cell means for treatments A and B(A).
*e.Compute restricted treatment means for treatments A and B(A).
*16.
[11.8] Exercise 14 describes an experiment to investigate the effects of knowledge about
gender discrimination on girls' attitudes toward and interest in science-related occupations.
Suppose that one of the girls did not complete the Utility Value of Science Scale.
Observation Y411 is missing. Use the cell means model approach to analyze the remaining
data.
*a.Use H′ matrices to write the null hypotheses for the blocks, omnibus treatment,
treatment A, and treatment B(A).
*b.Write the unmodified C′ matrices for the blocks, omnibus treatment, treatment A,
treatment B(A), residual, A × BL, and B(A) × BL where the missing observation is not
deleted. For each matrix, indicate the column and row that must be deleted. Write the
modified C′ matrices with observation Y411 deleted.
*c.Analyze the data using the cell means model approach. Compute F tests and p values
for the blocks and treatments A and B(A).
*d.Compute restricted cell means for treatments A and B(A).
*e.Compute restricted treatment means for treatments A and B(A).
17.
[11.8] Exercise 14 describes an experiment to investigate the effects of knowledge about
gender discrimination on girls' attitudes toward and interest in science-related occupations.
The participants also completed Weisgram and Bigler's Egalitarian Attitudes Toward
Science Scale. The data are shown here.
d.Prepare a “results and discussion section” for the Psychology of Women Quarterly.
18.
[11.8] Exercises 14 and 17 describe an experiment to investigate the effects of knowledge
about gender discrimination on girls' attitudes toward and interest in science-related
occupations. The data in Exercise 17 can be analyzed using the cell means model
approach.
a.Use H′ matrices to write the null hypotheses for the blocks, omnibus treatment,
treatment A, and treatment B(A).
b.Write the C′ matrices for the blocks, omnibus treatment, treatment A, treatment B(A),
residual, A × BL, and B(A) × BL.
c. Analyze the data using the cell means model approach. Compute F tests and p values
for the blocks and treatments A and B(A).
d. Compute restricted cell means for treatments A and B(A).
e.Compute restricted treatment means for treatments A and B(A).
19.
[11.8] Exercise 18 describes an experiment to investigate the effects of knowledge about
gender discrimination on girls' attitudes toward and interest in science-related occupations.
Suppose that one of the girls did not complete the Egalitarian Attitudes Toward Science
Scale. Observation Y124 is missing. Use the cell means model approach to analyze the
remaining data.
a.Use H′ matrices to write the null hypotheses for the blocks, omnibus treatment,
treatment A, and treatment B(A).
b.Write the unmodified C′ matrices for the blocks, omnibus treatment, treatment A,
treatment B(A), residual, A × BL, and B(A) × BL where the missing observation is not
deleted. For each matrix, indicate the column and row that must be deleted. Write the
modified C′ matrices with observation Y124 deleted.
c. Analyze the data using the cell means model approach. Compute F tests and p values
for blocks and treatments A and B(A).
d.Compute restricted cell means for treatments A and B(A).
e.Compute restricted treatment means for treatments A and B(A).
2The merits of experiments, quasi-experiments, and other research strategies are discussed in
Chapter 1.
3Guo, Billard, and Luh (2011) describe four statistics to test hypotheses for an unbalanced
CRH-pq(A) design with heterogeneous variances.
*The reader who is interested only in the classical sum-of-squares approach can, without loss
of continuity, omit this section.
*The reader who is interested only in the classical sum-of-squares approach can, without loss
of continuity, omit this section.
http://dx.doi.org/10.4135/9781483384733.n11
A split-plot factorial design is one of the most widely used designs in the behavioral sciences.
As you will see, it contains features of two building block designs: a completely randomized
design and a randomized block design. A key feature of the randomized block design
described in Chapter 8 is the blocking procedure that enables a researcher to isolate the
effects of a nuisance variable. The procedure consists of forming blocks of subjects so that the
subjects in a block are more homogeneous with respect to the nuisance variable than are
subjects in different blocks. Each block represents a level of the nuisance variable. A two-
treatment split-plot factorial design extends this procedure to p groups of blocks.
An examination of Figure 12.1-1(a) should help to clarify the distinctive character of a split-plot
factorial design. A block can contain q homogeneous subjects or one subject who is observed q
times. Suppose that repeated measures are obtained for each subject. According to Figure
12.1-1(a), the subjects in Group1 (blocks 1 through n1) receive treatment level a1 in
combination with all levels of treatment B; subjects in Group2 (blocks 1 through n2) receive
treatment level a2 in combination with all levels of treatment B. By contrast, in the RBF-23
design, Figure 12.1-1(b), each subject receives all combinations of the two treatments. The
block size of the randomized block factorial design is 6, whereas that for the split-plot factorial
design is only 3. The smaller block size of the split-plot factorial design makes it especially
attractive to researchers.
Figure 12.1-1 ▪ Comparison of two factorial designs. (a) The split-plot factorial design has
p = 2 levels of treatment A and q = 3 levels of treatment B; n blocks are randomly
assigned to each level of treatment A. The subjects in a block receive one level of
treatment A and all levels of treatment B. The number of observations in a block is q = 3.
(b) In the randomized block factorial design, the subjects in a block receive all levels
(combinations) of treatments A and B. The number of observations in a block is p × q =
(2)(3) = 6.
Split-plot factorial designs were originally used in agricultural research where the term plot
refers to a plot of land that is subdivided or split into subplots. We can think of the levels of
treatment A in Figure 12.1-1(a) as two plots that are subdivided into subplots. The three
subplots correspond to the levels of treatment B. The difference between levels a1 and a2 also
involves the difference between the two groups of blocks. Hence, the effects of treatment A are
called between-blocks or between-groups effects. Consider now treatment B. Differences
between any two levels of treatment B do not involve the difference between the blocks
because each level of B appears in each block. Hence, the effects of treatment B as well as
those for the AB interaction are called within-blocks or within-groups effects. The split-plot
factorial design is the first design that I have described that contains a mixture of between- and
within-blocks effects.
Consider treatment A again. It is apparent that the effects attributable to a1 and a2 are
indistinguishable from those due to the two groups of blocks. Treatment A and the groups of
blocks are said to be completely confounded. The effects of treatment B and the AB interaction
are free of such confounding. Group-treatment confounding does not affect the interpretability
of treatment effects, although, as we will see, it does affect precision and hence power. Group-
treatment confounding occurs in several experimental designs, including the completely
randomized design described in Chapter 4. There, the groups represent p samples of n
subjects per group. Two more types of confounding—group-interaction confounding and
treatment-interaction confounding—are described in Chapters 15 and 16, respectively. The type
of confounding in a design is one of the factors used in classifying the design. Group-treatment
confounding is characteristic of all split-plot factorial designs.
The advantage of a split-plot factorial design over a randomized block factorial design—smaller
block size—involves a trade-off that needs to be made explicit. In a split-plot factorial design,
the effects of the confounded treatment are estimated with less precision than is the same
treatment in a randomized block factorial design. The precision of an estimator is generally
measured by the standard error of its sampling distribution: The smaller the standard error, the
greater the precision. A randomized block factorial design in which the treatments represent
fixed effects uses the same error term for testing hypotheses for treatments A and B and the
AB interaction. The two-treatment split-plot factorial design, however, uses two error terms: one
for testing treatment A and a different and usually much smaller error term for testing treatment
B and the AB interaction. As a result, the power of the tests for B and AB is greater than that
for A. Hence, a split-plot factorial design is a poor design choice if a researcher's primary
interests involve treatments A and B. When one's interests center on treatments A and B, a
randomized block factorial design is the best choice if the larger block size is acceptable. An
alternative choice is the confounded factorial design described in Chapter 15. This design
achieves a reduction in block size by confounding an interaction with groups of blocks. As a
result, treatments A and B are evaluated with equal power, but the test of the confounded
interaction is less powerful. To summarize, a split-plot factorial design is a good design choice if
one's primary interests involve treatment B and the AB interaction but a poor choice if one's
primary interests involve A and B.
I denote a two-treatment split-plot factorial design by the letters SPF-p.q. The lowercase letter
preceding the dot indicates the number of levels of the between-blocks treatment (the
confounded treatment); the letter after the dot indicates the number of levels of the within-
blocks treatment (the unconfounded treatment). The design in Figure 12.1-1(a), for example, is
an SPF-2.3 design. The designation for a design with one between-blocks treatment, A with p
levels, and two within-blocks treatments, B and C with q and r levels, respectively, is SPF-p.qr.
A split-plot factorial design is appropriate for experiments that meet, in addition to the
assumptions of the experimental design model, the following conditions:
1.There are two or more treatments, where each treatment has two or more levels.
2.The number of treatment combinations is greater than the desired size of each block.
3.If repeated measurements on the subjects (experimental units) are obtained for treatment
B, each block contains one subject who is observed q times. If repeated measurements are
not obtained, each block contains q homogeneous subjects.
4.For the repeated measurements case, n blocks (subjects) are randomly assigned to the p
levels of treatment A. Then, the order of the presentation of the q levels of treatment B is
randomized independently for each block. Hence, there are two distinct randomizations.
Exceptions to these procedures are necessary if treatment A is an organismic variable such
as gender and if the nature of treatment B precludes randomization of the presentation
order.
5.For the nonrepeated measurements case, n blocks, each containing q homogeneous
subjects, are randomly assigned to each level of treatment A. The q subjects within each
block are then randomly assigned to the levels of treatment B.
where μijk is the population mean in block i, treatment level aj, and treatment level bk. The
researcher's primary interest is in determining whether the two treatments interact. The level of
significance adopted for all tests is .05.
To test the hypotheses, a sample of eight subjects was obtained. The subjects were randomly
assigned to the two levels of treatment A, with the restriction that four subjects were assigned
to each level. The subjects were then observed under all four levels of treatment B. The
dependent variable is response latency to the auditory and visual signals. The response
latency scores were subjected to a logarithmic transformation, as described in Section 3.6. The
data that were analyzed are the means of transformed latency scores for each successive hour
during the 4-hour monitoring session.
The layout of the SPF-2.4 design and the computational procedures are shown in Table 12.2-1.
The procedures are appropriate if there are no missing observations and if the numbers of
blocks in each level of treatment A are the same. If either of these conditions is not met, the
restricted cell means model in Section 12.14 should be used. The analysis of variance is
summarized in Table 12.2-2. According to this analysis, the null hypotheses for treatment B and
the AB interaction are rejected. A graph of the AB interaction is shown in Figure 12.2-1 on page
548. The graph shows that the response latency is shorter for visual signals than for auditory
signals during the first 3 hours of monitoring but longer during the fourth hour. The best type of
signal depends on the particular time period.
Procedures for estimating strength of association, effect size, power, and sample size have
been described at length in previous chapters. These procedures can be applied to the
vigilance data in Table 12.2-1. Partial omega squareds for these data are as follows:
where
Alternatively, partial omega squareds can be computed from F statistics. The formulas are
If either treatment is a random effect, the appropriate measure of strength of association is the
partial intraclass correlation. Interactions that involve a random effect also are considered
random effects. If treatment B is a random effect, for example, the partial intraclass correlation
formulas are
The expected values of the mean squares in Table 12.3-1 provide the basis for computing
.
A sample estimate of effect size can be computed from omega squared. For example, the
formula for computing for treatment A is
As discussed in Sections 4.6, 8.2, and 9.8, these formulas also can be used to estimate sample
size if a researcher is able to specify the (1) number of levels of the treatments, p and q, and
the number of interaction components, p × q; (2) level of significance, α; (3) power, 1 – β; (4)
size of the population variances, and ; and (5) sum of the squared population
treatment effects, and . If a researcher isn't able to estimate ,
and from a pilot study or previous research, sample sizes for treatments A and B can be
determined by specifying an effect size, f, and the population correlation coefficient, p. p is the
mean correlation between the levels of treatment B.
Suppose that a researcher is interested in estimating sample size for the following conditions: p
= 2, q = 4, α = .05, 1 – β = .80, f = .40, and p = .30. The required number of blocks, n, in each
level of treatment A can be determined from Tang's charts in Appendix Table E.12 for v1 = p − 1
= 1 and v2 = p(n′ − 1) degrees of freedom, where n′ denotes a trial value of n. The charts are
A requires a larger number of blocks to achieve a power of .80 than does treatment B.
where
(αβ)jk is the joint effect of treatment levels j and k and is subject to the restrictions
(βπ)ki(j) is the joint effect of treatment level k and block i; (βπ)ki(j) is NID(0, ) and
independent of πi(j).
εijk is the error effect that is NID(0, ) and independent of πi(j). In this design, εijk
cannot be estimated separately from (βπ)ki(j).
The values of the parameters in model (12.3-1) are unknown, but they can be estimated from
sample data as follows:
The B × Block w. A effect is used to estimate εijk under the assumption that the blocks do not
interact with treatment B; more is said about this later. The partition of the total sum of squares
is obtained by the now-familiar procedure of rearranging the terms in equation (12.3-2),
summing over all of the observations, and squaring both sides of the equation (see Sections
3.2 and 8.1 for the details).
The expected values for the mixed model in which treatments A and B represent fixed effects
are shown in Table 12.2-2. Before describing the expected values for other models, I examine
the nature of the two error terms in the SPF-p.q design.
Unlike the designs described previously, an SPF-p.q design uses two error terms: MSBL(A) for
testing the between-blocks term and MSB × BL(A) for testing the within-blocks terms. MSBL(A)
is read “mean square blocks within A;” MSB × BL(A) is read “mean square B times blocks
within A.” I show in the following paragraphs that MSBL(A) corresponds to MSWG i n a
completely randomized design and that MSB × BL(A) corresponds to MSRESpooled, where the
residual mean squares for p randomized block designs are pooled. I consider MSBL(A) first.
The data in Table 12.2-1 can be laid out as a CR-2 design if we ignore treatment B. This layout
is shown in the following AS summary table and is appropriate because in the vigilance
experiment, the eight subjects were randomly assigned to the two levels of treatment A.
AS Summary Table
a1 a2
s1 = 21 s1 = 18
s2 = 27 s2 = 21
s3 = 23 s3 = 20
s4 = 20 s4 = 22
The between- and within-blocks mean squares for the CR-2 design are computed as follows.
The numbers in the AS summary table are based on the sum of four observations, for example,
s1 = 3 + 4 + 7 + 7 = 21. If we divide MSBG and MSWG by 4, we find that the resulting values
are equal to MSA and MSBL(A) in Table 12.2-2:
Consider now the data at each level of aj in Table 12.2-1. These data can be laid out in two RB-
4 designs, as shown in the following BS summary tables:
This layout is appropriate because in the vigilance experiment, the blocks were randomly
assigned to the levels of treatment A, and the subjects were each observed four times. The
residual mean square for each RB-4 design can be computed as follows:
which is equal to MSB × BL(A) in the split-plot factorial design. Thus, MSB × BL(A) is the
pooled interaction of treatment B and the blocks that are nested within each level of A.
You learned in Section 8.4 that when MSRES is used to estimate the error variance in a
randomized block design, the necessary and sufficient condition for the F statistic to be
distributed as the F distribution, given that the null hypothesis is true, is the sphericity
condition. In Section 12.4, I show how this applies to a split-plot factorial design.
The expected values of the mean squares for models I, II, and III can be obtained from Table
12.3-1. The terms 1 – (n/N), 1 – (p/P), and 1 – (q/Q) in the table become zero if blocks,
treatment A, and treatment B represent fixed effects and 1 if the corresponding terms represent
random effects. If, for example, the effects for blocks and treatment B represent random effects
but those for treatment A represent fixed effects, the expected value of MSA is
The term in the table has been replaced by because treatment A is a fixed
effect. It is apparent from Table 12.3-1 that the between-blocks error term MSBL(A) does not
include the required terms for testing MSA. An error term for testing MSA can be pieced
together using the procedure described in Section 9.10. The quasi F′ statistic for testing MSA is
given by
The degrees of freedom for the numerator is p − 1; the degrees of freedom for the denominator
is given by
A two-treatment split-plot factorial design involves two sets of assumptions: one set for the
between-blocks test and another set for the within-blocks tests. The assumptions for the
between-blocks test can be dispensed with in short order. As you saw in Section 12.3, the
statistic F = MSA/MSBL(A) in a split-plot factorial design is analogous to F = MSBG/MSWG in a
completely randomized design. It follows that the assumptions for the completely randomized
design in Section 3.3 apply to the between-blocks test in the split-plot factorial design. These
assumptions are discussed in detail in Section 3.5 and are not repeated here.
The assumptions for the within-blocks tests are similar to those described in Sections 8.4 and
10.7. The reader may want to review these sections. The key within-blocks assumption for
treatment B is multisample sphericity (Huynh, 1978). The assumption states that
where
This assumption contains two assertions: is the same at each level of treatment A,
and each satisfies the sphericity condition.
For the vigilance experiment described in Section 12.2, the population covariance matrices at
a1 and a2 can be represented as follows:
Ordinarily these population matrices are unknown, but they can be estimated from the data in
Table 12.2-1. The estimators are denoted by and . Formulas for computing the sample
variances, , and covariances, , for the jth level of treatment A are, respectively,
The sample covariance matrices for the data in Table 12.2-1 are
I need to specify one more matrix to compute . It is the orthonormal coefficient matrix
for testing the null hypothesis that all treatment B means are equal.2 This matrix is
Given these sample matrices, the question arises as to the tenability of the multisample
sphericity assumption.
A variety of approaches have been used to assess the tenability of one or both parts of the
multisample sphericity assumption; see, for example, G. E. P. Box (1950), P. Harris (1984),
Huynh and Feldt (1970), Huynh and Mandeville (1979), and Mendoza (1980). The approach
that is adopted here is due to Harris (1984).3 The statistic for testing the multisample sphericity
assumption for treatment B is
where
The critical value of W, denoted by W α;p, q−1, n, is given in Appendix Table E.19. For the data
in Table 12.2-1, the W statistic is
where
The critical value from Appendix Table E.19 is W .25; 2,3, 4 = 11.878. Because W = 11.84 is less
than the critical value, W .25; 2,3, 4 = 11.878, the multisample sphericity assumption is tenable.
We saw in Section 8.4 that if the sphericity assumption is not tenable, the F test is positively
biased. Fortunately, the true distribution of the F statistic for any arbitrary covariance matrix,
assuming that
where ε is a number that depends on the degree of departure of the population variance-
covariance matrix from the required form. The computation of two estimates of ε is illustrated in
Table 12.4-1.4
If the multisample sphericity assumption had not been tenable, the three-step testing strategy
described in Section 8.4 for a randomized block design could have been used. The procedures
for treatment B and the AB interaction are the same except for the degrees of freedom. The
procedure is illustrated using treatment B; a fuller discussion is given in Section 8.4.
In the following discussion, assume that treatments A and B represent fixed effects and blocks
represent random effects. Two statistics for testing hypotheses about contrasts for treatment A
with p levels,
are
Two statistics for testing hypotheses about contrasts for treatment B with q levels,
are
can be obtained by replacing MSB × BL(A) in the denominator of the t and q statistics with a
mean square error term appropriate for the particular contrast. I denote the error term for the ith
contrast by ; i t h a s p(n − 1 ) d e g r e e s o f freedom. The computation of
for contrast ψ1(B) = μ..1 − μ..2 is illustrated in Table 12.5-1. The data are from
the vigilance experiment described in Section 12.2. According to Table 12.5-1, the value of
. The Fisher-Hayter statistic can be used to test the null hypothesis
ψ 1(B) = 0. This hypothesis cannot be rejected, as the following computations show:
The use of mean square error terms appropriate for specific contrasts was introduced in
Section 8.5. The advantage of using instead of MSB × BL(A) is that the
resulting test is exact. If the multisample sphericity assumption is tenable, the use of MSB ×
is required whenever a within-blocks error term is pooled over the p levels of treatment A. The
assumption
Two statistics for testing hypotheses about simple-effects contrasts for treatment A at bk,
are
where
The use of MSWCELL in the denominator of these statistics is discussed later. We know from
the formula for MSWCELL just given that it is composed of between-blocks and within-blocks
variation. We also know that in a split-plot factorial design, the corresponding population error
variances are generally not homogeneous. Cochran and Cox (1957, pp. 100, 298) point out that
under these conditions, test statistics such as Student's t, Tukey's qT, and Fisher-Hayter's qFH
do not follow their distributions except in special cases. They propose a conservative test that
can be used whenever error terms estimating different sources of variability are pooled. The
critical value of the test statistic is equal to
where denotes the critical value of the test statistic of interest, and h1 and h2 denote the
tabled values of the test statistic at the α level of significance for the df associated with MSB(A)
and MSB × BL(A), respectively. For example, if Tukey's statistic is used to test ψi(A) at bk = 0
The value of will always be between those for h1 and h2 except when the degrees of
freedom for the two error terms are equal; then, . See Taylor (1950) for an extensive
discussion of standard error formulas and approximate degrees of freedom.
Two statistics for testing hypotheses about simple-effects contrasts for treatment B at aj,
are
In the following section, I show why MSWCELL is used in testing ψi(A) at bk = 0 but MSB ×
BL(A) is used in testing ψi(B) at a j = 0.
Rule Governing Choice of Error Term for Simple-Effects Means and Simple Main Effects
An easy-to-follow rule governs the choice of error terms for testing hypotheses about simple-
effects contrasts and simple main effects (see Section 12.6) when the design contains
between- and within-blocks effects. Recall from Section 9.6 that
According to Table 12.2-2, when A and B represent fixed effects and blocks represent random
effects, the error terms for testing treatment and interaction null hypotheses are
The rule governing the choice of error terms states that if the treatment(s) and interaction(s)
that equal the sum of simple main effects have different error terms, as in the case of
at bk, then the error terms should be pooled in testing null hypotheses for the associated
simple-effects contrasts and simple main effects. Thus, because A and AB have different error
terms, the error term for testing ψi(A) at bk =0 is
It turns out that this pooled error term is actually MSWCELL, which, for purposes of testing
treatments A and B, was partitioned into between- and within-blocks components.
12.6 Procedures for Testing Hypotheses About Simple Main Effects and Treatment-Contrast Interactions
The following simple main-effects hypotheses can be tested for an SPF-p.q design.
For a mixed model in which treatments A and B represent fixed effects and blocks represent
random effects, the statistics for testing these hypotheses are, respectively,
where
The rationale for using MSWCELL in testing αj at bk = 0 for all j was discussed in Section 12.5.
Treatment-Contrast Interactions
If the AB interaction is significant, considerable insight into the sources of nonadditivity in the
data can be obtained by testing the following treatment-contrast interaction hypotheses:
where
and δ is a constant for a given hypothesis. An examination of the sample data usually suggests
a number of interesting ψi(B)s and ψi(A)s. Statistics for testing the preceding hypotheses are
If the sphericity assumption is not tenable, MSB × BL(A) in the F statistics should be replaced
by error terms appropriate for the specific treatment-contrast interactions—for example:
where SSB × BL(a1, a2,…, ap) is computed for the levels of treatment A that are involved in the
contrast, and p′ denotes the number of levels of treatment A involved in the contrast. The
computation of and is illustrated in Table 12.5-1; the computation of
I conclude this section with a summary comment about the choice of error terms. In Section
9.6, you learned that treatment-contrast interactions, unlike simple main effects, represent a
partition of an omnibus interaction. The error mean square that is used in testing the omnibus
interaction also should be used in testing the various treatment-contrast interactions if the
sphericity assumption is tenable. If the assumption is not tenable, the error mean square that is
used to test the omnibus interaction should be partitioned into mean squares appropriate for
the specific treatment-contrast interactions—that is, into either or MSB×BL(a1,
a2,…, ap).
Earlier, you learned that a split-plot factorial design is one alternative to a randomized block
factorial design when the block size must be kept small. A numerical index of the relative
efficiency of the two designs, disregarding differences in degrees of freedom, is given by the
following formulas (Federer, 1955, p. 274). The data in this example are from Table 12.2-2.
In this example, a test of treatment A is less than half as efficient in the split-plot factorial
design as it is in the randomized block factorial design. On the other hand, the tests of B and
AB are more efficient in the split-plot factorial design. The average standard error of a
difference is the same for the two designs. The increased precision in testing the B and AB
effects is obtained at the expense of the test of the A effects.
To get a better feeling for the relative precision of the between- and within-blocks estimators in
an SPF-p.q design, it helps to examine the expected values of the two error terms. It can be
shown for the mixed model, where treatments A and B represent fixed effects and blocks
represent random effects, that
where is the population variance and ρ is the population correlation.6 The larger ρ is, the
smaller is the within-blocks error relative to the between-blocks error. Also, MSB × βL(A) is
based on p(n −1)(q − 1) degrees of freedom, whereas MSBL(A) is based on only p(n − 1 )
degrees of freedom.
From the foregoing, it is apparent that a split-plot factorial design is an excellent design choice
when a researcher's primary interests involve treatment B and the AB interaction. However, a
design that provides adequate power for testing B and AB may not provide adequate power for
testing A. Of course, the power of the A test can always be increased, if necessary, by securing
additional blocks.
The general analysis procedures for a two-treatment factorial design can be extended to
designs that have three or more treatments. Numerical examples for the more frequently used
split-plot factorial designs and rules for expanding the design to any combination of between-
and within-blocks treatments are provided in the following sections.
The design described here and diagrammed in Figure 12.8-1 is an SPF-22.4 design. It has two
between-blocks treatments, A and C, and one within-blocks treatment, B. For the repeated
measurements case, npr blocks (subjects) are randomly assigned to the ajcl treatment
combinations, with the restriction that n blocks are assigned to each combination. The
sequence of administration of the levels of treatment B is randomized independently for each
block. For the nonrepeated measurements case, npr blocks, each containing q matched
subjects are randomly assigned to the ajcl combinations. Following this, the q subjects within
each block are randomly assigned to the levels of treatment B. The experimental design model
equation is
Figure 12.8-1 ▪ SPF-22.4 design; the subjects in a block receive one combination of
treatments A and C and all levels of treatment B.
The layout and computational procedures for an SPF-22.4 design are given in Table 12.8-1.
The results of the analysis are summarized in Table 12.8-2.
The between-blocks mean square error term MSBL(AC) is composed of variation pooled over
the pr combinations of treatments A and C. For the data in Table 12.8-1:
For example,
For the between-blocks F statistics to be distributed as the F distribution when the null
hypothesis is true, MSBL(a1c1), MSBL(a1c2), and so on must estimate the same population
variance. This homogeneity of variance assumption can be tested by
The within-blocks mean square error term MSB × BL(AC) is also a pooled term. For each of the
pr combinations of treatments A and C, we can compute a q × q covariance matrix . The key
assumption associated with testing the within-blocks null hypotheses is the multisample
sphericity assumption
that was introduced in Section 12.4. The W statistic (P. Harris, 1984) can be used to test the
tenability of the assumption. The formula and critical value for W are, respectively,
and W α; q–1, n. If the tenability of the multisample sphericity condition is in doubt, the three-
step testing strategy described in Section 12.4-1 can be used. The formula for computing is
the same as that in Table 12.4-1 f o r a n SPF-p.q d e s i g n ; t h e formula for
Procedures for testing hypotheses about means were described in detail in Chapter 5 and
Sections 9.5 and 12.5. These procedures generalize to an SPF-pr.q design. The computational
formulas are given in Table 12.8-3.
Table 12.8-3 ▪ Procedures for Testing Hypotheses About Means for SPF-pr.q Design
The use of MSB × BL(AC) as the error term in a test statistic is appropriate if the sphericity
assumption is tenable. If it is not tenable, an error term appropriate for the specific contrast
should be used—that is, MSψ i(B) × BL(AC). The computation of such error terms is discussed
in Section 12.5. When the error mean square in the denominator of a test statistic is composed
of pooled mean squares, the critical value of the test statistic is approximated by as
discussed in Section 12.5.
The expected values of mean squares for models I, II, and III can be determined from Table
12.8-4 on page 579. The terms 1 – (n/N), 1 – (p/P), 1 – (q/Q), and 1 – (r/R) become zero if the
corresponding terms BL, A, B, and C represent fixed effects and become 1 if the corresponding
terms represent random effects. Also, the variances should be replaced by terms of the form
, and if the corresponding treatments represent fixed
effects.
Figure 12.9-1 ▪ SPF-222.3 design; the subjects in a block receive one combination of
treatments A, C, and D and all levels of treatment B.
An outline of the computational procedures appears in Table 12.9-1. The F statistics are
appropriate for a mixed model in which treatments A, B, C, and D represent fixed effects and
blocks represent random effects.
Pattern Underlying Computational Formulas for the Between- and Within-Blocks Error Mean Squares
In Section 10.3, I examined the pattern underlying the four kinds of formulas used in analysis of
variance. Here I illustrate the patterns for the between- and within-blocks error sums of squares
and the degrees of freedom for a split-plot factorial design that has any number of between-
blocks treatments. The patterns are as follows:
The between-blocks mean square error term MSBL(ACD) is composed of variation pooled over
the prt combinations of treatments A , C, a n d D. W e a s s u m e t h a t MSBL(a1c1d1),
MSBL(a1c1d2), …, MSBL(apcrdt) estimate the same population variance. This homogeneity of
variance assumption can be tested by means of the Fmax statistic or one of the other statistics
described in Section 3.5.
The within-blocks mean square error term MSB × BL(ACD) is also a pooled term. For each of
the prt combinations of treatments A, C, and D, we can compute a q × q covariance matrix
. The key assumption associated with testing the within-blocks null hypotheses is the
multisample sphericity assumption:
that was introduced in Section 12.4. The W statistic (P. Harris, 1984) can be used to test the
tenability of the assumption. The formula and critical value for W are, respectively,
and W α; q−1, n.
A split-plot factorial design can be used in research situations that involve two or more within-
blocks treatments. One variation of this design, an SPF-p.qr design, has one between-blocks
treatment and two within-blocks treatments. An example of this design is diagrammed in Figure
12.10-1. For the repeated measurements case, np blocks (subjects) are randomly assigned to
the levels of treatment A, with the restriction that n blocks are assigned to each level of A. The
Figure 12.10-1 ▪ SPF-2.22 design; the subjects in a block receive one level of treatment A
and all combinations of treatments B and C.
The computational procedures for an SPF-p.qr design are illustrated in Table 12.10-1. The
analysis is summarized in Table 12.10-2.
The between-blocks mean square error term MSBL(A) is composed of variation pooled over the
p levels of treatment A. We assume that MSBL(a1), MSBL(a2),…, MSBL(ap) all estimate the
same population variance. This homogeneity of variance assumption can be tested by means of
the Fmax statistic described in Section 3.5.
Four within-blocks mean square error terms can be computed for an SPF-p.qr design: MSB ×
BL(A), MSC × βL(A), MSB × C × BL(A), and MSBC × BL(A). Mendoza, Toothaker, and Crain
(1976) have examined the necessary and sufficient conditions for the within-blocks F statistics
to be distributed as F, given that the associated null hypotheses are true. The key assumption
associated with testing the within-blocks null hypotheses is multisample sphericity.
For an SPF-2.22 design, the qr × qr population covariance matrix, , has the form
where, for example, is the population variance for treatment combination b1c1 and Σ12 is the
population covariance for treatment combination b1c1 and b1c2. In this example, and are
one-by-four vectors. When C*′ has one row, the multisample sphericity condition is
automatically satisfied. If treatment B, for example, had more than two levels, the tenability of
the local sphericity condition could be determined by using the W statistic (P. Harris, 1984). The
formula and critical value for W are, respectively,
and W α; q–1,n.
where C*′ is an orthonormal coefficient matrix for the within-blocks omnibus null hypothesis:
If the omnibus multisample sphericity assumption is tenable, a fourth mean square error term,
with p(n − 1 ) (qr − 1) degrees of freedom can be used in testing all within-blocks null
hypotheses. The resulting F tests are more powerful than tests that assume only local
multisample sphericity. However, omnibus multisample sphericity is a much more restrictive
assumption. Consequently, the usual practice is to partition MSBC × BL(A) into three within-
blocks mean square error terms as follows; the data are from Table 12.10-2:
If one or more of the local multisample sphericity assumptions are not tenable, the three-step
testing strategy described in Section 12.6 can be used. The formulas for computing and that
are given in Table 12.4-1 can be used for an SPF-p.qr design by replacing q with qr.
Procedures for testing hypotheses about means are similar to those described in Section 12.8.
The formulas for testing hypotheses about main-effects means are as follows:
The expected values of mean squares for models I, II, and III can be determined from Table
12.10-3. The terms 1 – (n/N) , 1 – (p/P) , 1 – (q/Q) , a n d 1 – (r/R) become zero if the
corresponding effects represent fixed effects and become 1 if they represent random effects.
Also, the variances should be replaced by terms of the form , , and
, if the corresponding treatments represent fixed effects.
Figure 12.11-1 on page 592 shows a diagram of an SPF-2.222 design. The experimental design
model equation is
Figure 12.11-1 ▪ SPF-2.222 design; the subjects in a block receive one level of treatment A
and all combinations of treatments B, C, and D.
An outline of the computational procedures appears in Table 12.11-1. The F statistics are
appropriate for a mixed model in which treatments A, B, C, and D represent fixed effects and
blocks represent random effects.
By now, the reader should have gained insight into the patterns underlying the analysis of all
split-plot factorial designs. A diagram for one more design, an SPF-pr.qt design, is shown in
Figure 12.12-1. The computational formulas for this design are given in Table 12.12-1. The
experimental design model equation is
Figure 12.12-1 ▪ SPF-22.22 design; the subjects in a block receive one combination of
treatments A and C and all combinations of treatments B and D.
When repeated measurements are obtained, the procedure recommended in Section 12.1 is to
randomize the order of administration of the levels independently for each subject. If the
administration of one treatment level affects a subject's performance on subsequent levels, the
experiment is said to contain carryover effects. That portion of carryover effects attributable to
the treatment levels being administered in a particular order is referred to as sequence effects.
Randomizing the order of administration of the levels independently for each subject is an
effective way to control sequence effects.
An alternative way to control sequence effects is to include the effects as one of the treatments
in the design. For example, an SPF-2.3 design has 3! = (3)(2)(1) = 6 possible sequences in
which treatment B can be administered. If we let the levels of treatment C correspond to the six
sequences, the design can be analyzed as an SPF-26.3 design. The experiment would require
a minimum of 24 subjects, with 2 subjects assigned to each ajcl treatment combination. When
the number of levels of the repeated measurements treatment is greater than 3, the number of
subjects can be kept to a manageable size by using a random sample of the q! possible
sequences. The advantages of including sequence effects as one of the treatments are that (1)
the effects are controlled and (2) a researcher can determine whether they affect the
dependent variable.
A restricted cell means model can be used to represent data for a split-plot factorial design.
Three cases are examined: (1) a design with no missing observations and an equal number of
blocks in the levels of treatment A, (2) a design with no missing observations and an unequal
number of blocks in the levels of treatment A, and (3) a design with missing observations. The
restricted cell means model for a two-treatment split-plot factorial design is
where εijk is NID(0, ) and μijk is subject to the restrictions that all B × BL(A) population effects
equal zero for tests that involve treatment B and the AB interaction.
The analysis procedures for the restricted cell means model are illustrated for the vigilance data
in Table 12.2-1. Treatment A has p = 2 levels, treatment B has q = 4 levels, and there are n = 4
blocks in each level of treatment A. For the purpose of forming coefficient matrices, C′, the
following hypotheses are expressed with vectors and matrices:
The hypothesis for the blocks within the jth level of treatment A, , requires a
word of explanation. We are interested in the blocks at both levels of treatment A, not just at
the jth level. However, the blocks are nested in treatment A. In such cases, the hypothesis must
involve only those blocks that are nested in one level of the other treatment—in this example,
The C′ coefficient matrices for computing the sums of squares are given by the following
Kronecker products:
where
As I discuss in Section 11.8, the identity matrix is used to compute a coefficient matrix when an
effect is nested. Because blocks are nested in treatment A, the coefficient matrix is
obtained by using a p × p identity matrix instead of a 1 × p vector of 1s. Thus,
The sums of squares and degrees of freedom are as follows, where h = npq = (4)(2) (4) = 32:
The J matrix in SSTO is obtained by computing the product of h × 1 and 1 × h sum vectors: J =
1 × 1′. The sums of squares and degrees of freedom are identical to those in Table 12.2-2 and
Section 12.3. It can be shown, following the procedures in Section 8.8, that the tests of
treatment B and the AB interaction are subject to the restriction that all B × BL(A) population
effects equal zero.
Adjusted Cell Means. Tests that involve simple simple main-effects means for treatment B
should be performed using adjusted cell means, . The vector of adjusted cell means for the
data in Table 12.2-1 is given by
where is the coefficient matrix that defines the restrictions on the cell means and
is the vector of unadjusted cell means. When there are no missing observations, adjusted and
unadjusted treatment B main-effects means and simple main-effect means are equal. For
example, the adjusted and unadjusted main-effects means for b1 are, respectively,
and
Treatment Contrast Interactions. In Sections 9.13 and 10.4, you saw that the cell means
model can be used to compute sums of squares for simple main effects, treatment-contrast
interactions, contrasts-contrast interactions, and so on. The computations are more
straightforward for the cell means model approach than for the classical sum-of-squares
approach. Consider, for example, the data in Tables 12.5-1 and 12.6-1, where I used the
classical sum-of-squares approach to compute , and SSB ×
BL(a1,a2). These sums of squares can be obtained with the cell means model as follows:
This 3 × 3 selection matrix will produce a coefficient matrix with at least one row of zeros. These
rows must be deleted.
SPF-p.q designs sometimes have a different number of blocks in the p levels of treatment A.
Two cases need to be distinguished. In the first, the researcher intended to have an equal
number of blocks, but for one reason or another—the equipment malfunctioned, an
appointment was missed, the wrong treatment condition was administered—the numbers of
blocks in the levels of treatment A are not the same. In the second case, the researcher
deliberately selected different numbers of blocks so that the number in each level of A i s
proportional to the number in the corresponding population. The two cases have implications
for the kind of hypotheses that a researcher would want to test for treatment B: unweighted-
means hypotheses versus weighted-means hypotheses. In the first case, the researcher would
want to weight the equally in computing the . In the second case, the should be
weighted so that the reflect the number of blocks and hence the size of the respective
populations.
Table 12.14-1 ▪ SPF-2.3 Design With an Unequal Number of Blocks in the Levels of
Treatment A
The following matrices and vectors are used to compute the coefficient matrices, C′:
Obtaining the other coefficient matrices is less straightforward. First, I obtain the C′ matrices as
if the number of blocks in each level of treatment A is equal to . I refer to this matrix as the
unmodified matrix. I then delete observations to reflect the actual number of blocks in each
level of treatment A. This is the modified coefficient matrix.
The vertical lines in enclose the coefficients for the three observations in a1 that do not
exist: Y311, Y312, and Y313. These coefficients must be deleted. When the coefficients are
deleted, the remaining coefficients do not sum to zero and thus do not define a contrast
between the two levels of treatment A. To obtain contrasts that give equal weight to the blocks
in a1 a n d a2, the nonzero coefficients in a1 a n d a2 are replaced with, respectively,
as follows:
Notice that the number of coefficients in is 15, which is the number of observations in
Table 12.14-1. This coefficient matrix provides a test of the following null hypothesis:
The fractions in can be avoided by replacing the nonzero coefficients in a1 and a2 with,
respectively, and as follows:
The vertical lines in enclose the coefficients for the three observations in a1 that don't
exist. The coefficients in the three columns must be deleted as well as row 2, which is enclosed
with horizontal lines. Row 2 defines a contrast that cannot be estimated because of the
nonexistent observations. The modified matrix is
The computation of the remaining coefficient matrices follows the pattern that has been
illustrated for , and .
A test of unweighted means can be obtained by replacing the nonzero coefficients in with
and as follows:
The fractions in can be avoided by replacing the nonzero coefficients in a1bk and a2bk
The modified coefficient matrix is obtained by deleting the coefficients enclosed in the vertical
columns and replacing the nonzero coefficients in a1 a n d a2 with, respectively,
and as follows:
The fractions in can be avoided by replacing the nonzero coefficients in a1bk and a2bk
The sums of squares and degrees of freedom for the data in Table 12.14-1 are
The computations for the SPF-2.3 design are summarized in Table 12.14-2.
When one or more observations are missing, the analysis is similar to that described in the
previous subsection. Consider the SPF-2.3 design in Table 12.14-3 where Y221 is missing. The
following steps are used to obtain coefficient matrices:
The following matrices and vectors are used to compute the coefficient matrices:
lines identify the column that must be deleted and the row of that contains the
contrast that cannot be estimated. Because contrast Y121 – Y221 does not exist, the 1 in row 3
and column 3 of that corresponds to Y121 is replaced with 0. The modified coefficient
matrix is as follows:
When the missing observation in column 4 is deleted, the remaining coefficients do not sum to
zero and thus do not define a contrast between the two levels of treatment A. To obtain
contrasts that give equal weight to a1 and a2, the nonzero coefficients in a1bk and a2bk are
replaced with, respectively, cijk = ±1/( in bk) and ±1/( in bk) as follows:
The fractions in can be avoided by replacing the nonzero coefficients in a1bk and a2bk
When the missing observation in column 4 is deleted, the remaining coefficients in row 1 do not
sum to zero and thus do not define an interaction contrast. To obtain contrasts that give equal
weight to a1 and a2, the nonzero coefficients in a1bk and a2bk are replaced with, respectively,
The fractions in can be avoided by replacing the nonzero coefficients in a1bk and a2bk
with, respectively, cijk = ±( in bk) and ±( in bk) as follows:
Figure 12.14-1 illustrates the interaction contrasts that are defined by rows 1 and 2 of
Figure 12.14-1 ▪ T w o AB interaction terms are obtained from the crossed lines by
subtracting the observations in the cells connected by a dashed line from the
observations in the cells connected by a solid line.
Figure 12.14-2 illustrates the interaction contrasts that are defined by rows 1, 2, and 3 of
Figure 12.14-2 ▪ Three B × BL(A) interaction terms are obtained from the crossed lines by
subtracting the observations connected by a dashed line from the observations
connected by a solid line.
The computation of sums of squares and degrees of freedom for the data in Table 12.14-1 is
described here.
The coefficient matrix gives a sum of simple simple main-effects sums of squares,
at siaj. It follows from Section 10.2 that at siaj = SSB + SSAB + [SSB ×
BL(A)]. Hence, to obtain SSB, it is necessary to subtract SSAB and SSB × BL(A).
It can be shown that the rows of are linearly dependent on the rows of . SSB ×
BL(A) is subtracted from SSBL(A) because also contains the information in .
where h = npq – mo. The computations for the SPF-2.3 design are summarized in Table 12.14-
4. As the reader sees, when one or more observations are missing, the analysis procedures are
complex. Researchers are encouraged to avoid missing observations if possible.
Table 12.14-4 ▪ ANOVA Table for SPF-2.3 Design With Missing Observation (Y221)
A split-plot factorial design is appropriate for many research problems. The most likely
alternative design when repeated measures or matched subjects are used is a randomized
block factorial design. The advantages and disadvantages of a split-plot factorial design relative
to a randomized block factorial design are as follows:
1.A split-plot factorial design requires a smaller block size than a randomized block factorial
design.
2.A split-plot factorial design can be used in experiments where it is not possible to administer
all treatment combinations within each block.
3.In an SPF-p.q design, the power of tests for treatment B and the AB interaction is usually
greater than that for treatment A. The power of within-blocks tests in a split-plot factorial
design is usually greater than that for the corresponding tests in a randomized block
factorial design. However, the power of between-blocks tests is usually less than that for
the corresponding tests in a randomized block factorial design.
4.The analysis and randomization procedures for a split-plot factorial design are more
complex than those for a randomized block factorial design.
1.Terms to remember:
a.plot (12.1)
b.between-blocks effects (12.1)
c. within-blocks effects (12.1)
d.precision of an estimator (12.1)
e.multisample sphericity (12.4)
f. carryover effects (12.13)
g.sequence effects (12.13)
*2.[12.1] Draw diagrams (see Figure 12.1-1) for split-plot and randomized block factorial
designs with the following levels of treatments A and B.
*a.n = 3, p = 3, q = 2
*b.n = 2, p = 3, q = 3
c. n = 2, p = 4, q = 2
d.n = 2, p = 4, q = 3
3.[12.1] If repeated measurements are obtained, how many times must a subject be observed
for the designs in Exercise 2?
4.[12.1] How does the confounding of treatment A with groups of blocks affect the test of
treatment A?
5.[12.1] Explain why an RBF-23 design is a better choice than an SPF-2.3 design if a
researcher is particularly interested in testing treatment A.
6.[12.1] Give a rule of thumb for determining whether a treatment is a between- or within-
blocks treatment.
*7.[12.2] Three procedures for inspecting printed circuit boards were investigated. The circuit
boards were projected on a screen in front of an inspector. Condition a1 was the standard
condition in which the circuit board being inspected could be compared with the picture of a
perfect circuit board by pressing a button. Condition a2 provided a perfect circuit board
overlay in a contrasting color. Condition a3 alternated the circuit board being inspected with
a picture of a perfect board every 200 milliseconds. Three types of defects occurred in the
circuit boards: Condition b1 was a filled hole in the board (fill defect), b2 was a gap in one
of the lines on the board (gap defect), and b3 was a short between two lines of the circuit
(short defect). Eighteen inspectors were randomly assigned to one of three groups with six
in each group. The groups were randomly assigned to the three levels of treatment A. The
subjects in each group received all three levels of treatment B. The order in which the levels
of treatment B were presented was randomized independently for each subject. The
dependent variable was the time in seconds required to detect a defect. The following data
were obtained. (Experiment suggested by Liuzzo, J. G., & Drury, C. G. An evaluation of
blink inspection. Human Factors.)
*a.Test the null hypotheses for treatments A and B and the AB interaction; let α = .05.
*b.Graph the AB interaction and interpret the graph.
c. Prepare a “results and discussion section” for the Human Factors journal.
8.[12.2] The performances of 15 secretaries on three date-sorting tasks were compared at two
times of day: 10:00 a.m., treatment level b1, and 4:00 p.m., treatment level b2. The task
involved sorting a list of random dates written in British form (e.g., 30 1 74) into two, three,
or four accounting periods—treatment levels a1, a2, and a3, respectively. The secretaries
were randomly assigned to the groups. The tasks were performed for 1 hour on consecutive
days—Thursday and Friday. The day on which a level of treatment B was presented was
randomized independently for each secretary. The dependent variable was the number of
dates sorted minus 2,500. The following data were obtained. (Experiment suggested by
Monk, T. H., & Conrad, M. C. Time of day effects in a range of clerical tasks. Human
Factors.)
a.Test the null hypothesis for treatments A and B and the AB interaction; let α = .05.
b.Graph the AB interaction and interpret the graph.
c. Prepare a “results and discussion section” for the Human Factors journal.
*10.
[12.3] Derive the computational formula for SSBL(A) from
*11.
[12.4]
*a.Use the Fmax statistic to test the hypothesis for the data in Exercise 7.
*b.Test the multisample circularity assumption. Let α = .25 and
*12.
[12.4]
*a.Compute for the data in Exercise 7.
*b.Perform an adjusted F test for treatment B and the AB interaction in Exercise 7.
*c.How do the results of the adjusted test compare with the results of the conventional F
test?
*13.
[12.6] For the data in Exercise 7, test the following hypotheses:
*a.H0: αψ1(B) = δ for all j
*b.H0: αψ2(β) = δ for all j
*c.H0: βψ1(A) = δ for all k
*d.H0: βψ2(A) = δ for all k
where ψ1(B) = μ.j1 − μ.j2, ψ 2(B) = (μ.j1 + μ.j2)/2 – μ.j3, ψ 1(a) = μ.1k – μ.2k, and ψ2(A)
= μ.1k – μ.3k. Compute error terms appropriate for the specific treatment-contrast
interactions. Use the simultaneous test procedure and let α = .05 for the collection of all
possible tests.
14.
[12.4] Use the Fmax test to test the hypothesis . for the data in Exercise 8.
*15.
[12.2] For the data in Exercise 7, determine the powers of the tests of A, B, and the AB
interaction.
16.
[12.6] For the data in Exercise 8, test the following hypotheses.
a.H0: βψ1(A) = δ for all k
*17.
[12.7] For the data in Exercise 7, compute the relative efficiency of the SPF-3.3 and RBF-33
designs for the following treatments and interaction.
*a.treatment A
*b.treatment B and the AB interaction
18.
[12.7] For the data in Exercise 8, compute the relative efficiency of SPF-3.2 and RBF-32
designs for the following treatments and interaction.
a.treatment A
b.treatment B and the AB interaction
19.
[12.2] For the data in Exercise 8, determine the powers of the tests of A, B, and AB.
20.
[12.4]
a.Specify the omnibus multisample circularity assumption for an SPF-p.qr design.
b.If this assumption is tenable, what error term is used to test the within-blocks null
hypotheses?
21.
[12.14] Describe the modification necessary to evaluate sequence effects for the following
designs.
a.SPF-3.2 design
b.SPF-2.3 design
c. SPF-3.5 design
*22.
[12.14] Exercise 7 describes an experiment to evaluate three procedures for inspecting
printed circuit boards. Assume that in a pilot study, the following data were obtained.
*23.
[12.6] For the data in Exercise 22, use Kronecker products to formulate C′ matrices for the
following hypotheses:
*a.H0: αψ1(B) = δ for all j
*b.H0: αψ2(β) = δ for all j
*c.H0: βψ1(A) = δ for all k
*d.H0: βψ2(A) = δ for all k
where ψ1(B) = μ.j1 − μ.j2, ψ 2(B) = (μ.j1 + μ.j2)/2 − μ.j3, ψ1(A) = μ.1k – μ.2k, and ψ 2(A)
= μ.1k – μ.3k. Compute error terms appropriate for the specific treatment-contrast
interactions. Use the simultaneous test procedure and let α = .05 for the collection of all
possible tests. This problem requires a computer with a matrix package.
24.
[12.14] Exercise 8 describes an experiment to compare date-sorting performances at two
times of day and three levels of task difficulty. Assume that in a pilot study, the following
data were obtained:
25.
[12.6] For the data in Exercise 24, use Kronecker products to formulate C′ matrices for the
following hypotheses:
a.H0: βψ1(A) = δ for all k
b.H0: βψ2(α) = δ for all k
c. H0: βψ3(A) = δ for all k
where ψ1(A) = μ.1k − μ.2k, ψ2(A) = μ.1k − μ.3k, and ψ3(A) = μ.2k − μ .3k Use the
pooled error term MSB × βL(A) in testing these hypotheses. Also use the simultaneous
test procedure and let α = .05 for the collection of all possible tests. This problem
requires a computer with a matrix package.
*26.
[12.14] Female college students were exposed to either an aggressive or an erotic film,
treatment levels a1 and a2, respectively, and then alternately provoked, treatment level b1,
and not provoked, treatment level b2, in a series of games that provided opportunity for
retaliation. Retaliation consisted of delivering a noxious noise to the opponent. The
dependent variable was the duration of the noise presentation. Four students were
randomly assigned to the levels of treatment A, with the restriction that two students were
assigned to each level. The order in which the levels of treatment B were presented was
randomized independently for each subject. The following data were obtained:
*27.
[12.14] Assume that because of equipment failure, the data in block a1s2 in Exercise 26
were lost.
*a.Specify, using matrix notation, the hypotheses for Between Blocks, treatment A, Blocks
w. aj, Within Blocks, and treatment B.
*b.Compute and use Kronecker products to formulate C′ matrices for testing the
hypotheses.
*c.Perform the analysis using the cell means model approach; let α = .05. (This problem
involves simple matrix multiplication and can be done without a computer.)
28.
[12.14] Assume that because of equipment failure, the data in block a2s1 in Exercise 26
were lost.
a.Specify, using matrix notation, the hypotheses for Between Blocks, treatment A, Blocks
w. aj, Within Blocks, and treatment B.
b.Compute and use Kronecker products to formulate C′ matrices for testing the
hypotheses.
c. Perform the analysis using the cell means model; let α = .05. (This problem involves
simple matrix multiplication and can be done without a computer.)
29.
[12.14] Assume that because of equipment failure, observation Y122 in Exercise 26 was
lost.
a.Compute and use Kronecker products to formulate C′ matrices for testing the
hypotheses. Use unweighted means.
b.Perform the analysis using the cell means model approach; let α = .05. (This problem
involves simple matrix multiplication and can be done without a computer.)
*30.
[12.14] Assume that because of equipment failure, observation Y222 in Exercise 26 was
lost.
*a.Compute and use Kronecker products to formulate C′ matrices for testing the
hypotheses. Use unweighted means.
*b.Perform the analysis using the cell means model approach; let α = .05. (This problem
involves simple matrix multiplication and can be done without a computer.)
1Portions of this section assume a familiarity with Sections 8.4, 10.7, and the matrix operations
in Appendix D. The essential ideas can be grasped without this background.
7The latter portion of this section assumes a familiarity with Sections 8.4 and 10.7.
*The reader who is interested only in the classical sum-of-squares approach can, without loss
of continuity, omit this section.
9The same procedure was used in Section 11.8 for a hierarchical design in which treatment B is
nested in treatment A.
http://dx.doi.org/10.4135/9781483384733.n12
Analysis of Covariance
The emphasis in previous chapters has been on the use of experimental control to reduce
error variance and obtain unbiased estimates of treatment effects. Experimental control can
take various forms, such as random assignment of subjects to treatment levels, stratification of
subjects into homogeneous blocks, and refinement of techniques for measuring a dependent
variable. An alternative approach to reducing error variance and obtaining unbiased estimates
of treatment effects involves the use of statistical control. This approach also enables a
researcher to remove sources of bias from an experiment, biases that are difficult or impossible
to eliminate by experimental control.
The statistical control described in this chapter is analysis of covariance; it combines regression
analysis with analysis of variance. The procedure involves measuring one or more concomitant
variables (also called covariates) in addition to the dependent variable. The concomitant
variable represents a source of variation that has not been controlled in the experiment and one
that is believed to affect the dependent variable. Through analysis of covariance, the
dependent variable can be adjusted to remove the effects of the uncontrolled source of
variation represented by the concomitant variable. The potential advantages are (1) a reduction
in error variance and, hence, increased power and (2) a reduction in bias caused by differences
among experimental units where those differences are not attributable to the manipulation of
the independent variable.
The goals of analysis of covariance, ANCOVA, are to reduce error variance, remove sources of
bias from an experiment, and obtain adjusted estimates of population means. Error variance,
, can be reduced if a portion of the dependent-variable error variance is predictable from a
knowledge of the concomitant variable. Removing this predictable portion from results in a
smaller error variance and, hence, a more powerful test of a false null hypothesis. The rationale
for this is developed in considerable detail in Section 13.2. Here I simply introduce some basic
ideas and describe several applications of ANCOVA.
Consider an analysis of covariance experiment with two treatment levels, a1 and a2. The
dependent variable is denoted by Y and the concomitant variable by X. The relationship
between X and Y for a1 and a2 might look like that shown in Figure 13.1-1. Each subject in the
experiment contributes one point, determined by his or her X and Y scores, to the figure. The
points form two scatter plots—one for each treatment level. These scatter plots are represented
in the figure by ellipses. Through each ellipsis I have drawn a line that represents the
regression of Y on X. In the typical ANCOVA model, it is assumed that each regression line is a
straight line and that the lines have the same slope. The size of , the error variance, in
ANOVA is determined by the dispersion of the marginal distributions (see Figure 13.1-1). The
size of in ANCOVA is determined by the dispersion of the conditional distributions (see
Figure 13.1-1). The higher the correlation between X and Y, in general, the narrower the
ellipses and the greater is the reduction in due to using analysis of covariance.
Figure 13.1-1 ▪ Scatter plots for the two treatment levels. The size of the error variance in
ANOVA is determined by the dispersion of the marginal distributions. The size of the
error variance in ANCOVA is determined by the dispersion of the conditional
distributions. The higher the value of the correlation between X and Y, the greater is the
reduction in the error variance due to using analysis of covariance.
Figure 13.1-1 depicts the case in which the concomitant-variable means, and , are equal.
If subjects are randomly assigned to the treatment levels, on average, one can expect the
concomitant-variable means to be equal. If random assignment is not used, differences among
the means can be sizable, as in Figure 13.1-2. This figure illustrates the effects of unequal
concomitant-variable means. In Figure 13.1-2(a) and (b), the absolute difference between
adjusted dependent-variable means is smaller than that between unadjusted
means . In part (c), the absolute difference between adjusted means is larger than that
between unadjusted means. Whenever the regression lines have the same slope and intercept,
as in part (b), the difference between adjusted means will equal zero.
Figure 13.1-2 ▪ When the concomitant-variable means, and , differ, the absolute
difference between adjusted means for the dependent variable can be less than that
between unadjusted means, as in (a) and (b), or larger, as in (c).
Analysis of covariance is often used in three kinds of research situations that involve unequal
concomitant-variable means. One of these involves the use of intact groups, which is common
in educational and industrial research. This situation can be illustrated by an experiment
designed to evaluate four methods of teaching arithmetic. It is impractical for administrative
reasons to assign four different teaching methods to the same classroom. An alternative is to
assign the four teaching methods randomly to four classrooms, a design with a serious defect.
If differences in learning abilities or related characteristics exist among the classes prior to the
introduction of the four teaching methods, these extraneous variables will bias the evaluation. In
this example, it is possible to administer a test of intelligence prior to the beginning of the
experiment to obtain an estimate of learning aptitude. If the classes differ in intelligence, one
can, if certain assumptions are tenable, adjust the dependent variable, achievement in
arithmetic, for differences in the concomitant variable, learning ability.
A note of caution concerning the use of intact groups is needed here. Such experiments are
always subject to interpretation difficulties that are not present when subjects are randomly
assigned to experimental groups or conditions. Analysis of covariance equates intact groups on
a particular concomitant variable. There is no assurance that the particular concomitant variable
represents the only dimension on which intact groups differ or that it is the most important
dimension on which they differ. Thus, even when analysis of covariance is skillfully used, we
can never be certain that some variable that has been overlooked will not bias the evaluation of
Analysis of covariance is not limited to the use of only one concomitant variable. In the teaching
example, differences in arithmetic achievement also may exist among the children prior to the
beginning of the experiment. If concomitant measures of arithmetic achievement and
intelligence are obtained prior to the introduction of the treatment, the dependent variable can
be adjusted for both potential sources of bias.
A second situation in which analysis of covariance is often used is illustrated in the following
example. It may become apparent during the course of an experiment that the subjects in the
treatment groups were not equivalent on some relevant variable at the beginning of the
experiment, even though random assignment was employed. For example, an experiment
might be designed to evaluate the effects of different drugs on stimulus generalization in rats.
At the beginning of the experiment, the rats are randomly assigned to the experimental groups,
and a bar-pressing response is shaped by operant conditioning procedures. If the experimental
groups require different amounts of training to establish a stable bar-pressing response, it is
likely that differences in learning ability exist among the groups. A researcher may find, at the
conclusion of the experiment, that the amount of stimulus generalization is related to the
amount of training necessary to establish the stable bar-pressing response. If certain
assumptions are tenable, the generalization scores of the groups can be adjusted for
differences in learning ability. Thus, unsuspected differences present at the beginning of the
Analysis of covariance may be useful in yet another research situation. In the example
concerning the evaluation of four methods of teaching arithmetic, a third variable may bias the
evaluation. This variable is the number of hours spent in study by students in the four
classrooms. Variations in the daily schedules of the classrooms may provide more study periods
for students in one class than for students in other classes. It would be difficult to
experimentally control the amount of time available for arithmetic study. A practical alternative is
to record each day the amount of study time available to the students. This information can be
used to make appropriate adjustments in the dependent variable. In this example, variation in
the concomitant variable of study time did not occur until after the beginning of the experiment.
It is assumed that the amount of study time available during school is not influenced by the
treatment. This assumption would be incorrect if a teacher allocated additional study periods for
arithmetic because students were unable to master the material using the assigned teaching
method. If the amount of study time was influenced by the treatment, it would be incorrect to
adjust the dependent variable for this concomitant variable.
Statistical control and experimental control are not mutually exclusive approaches for reducing
bias and increasing precision. It may be convenient to control some variables by experimental
control and others by statistical control. In general, a researcher should attempt to use
experimental control whenever possible. Statistical control is based on a series of assumptions,
discussed in Section 13.4, that may prove untenable in a particular experiment.
Concomitant variables in analysis of covariance should be selected with care. Effects eliminated
by a covariance adjustment must be irrelevant to the objectives of the experiment. Analysis of
covariance can be used in conjunction with each of the experimental designs described in this
book. An analysis of covariance design that is based on a completely randomized building
block design is denoted by the letters CRAC-p, one based on a randomized block building
block design is denoted by RBAC-p, and so on. A covariance adjustment is appropriate for
experiments that meet, in addition to the assumptions discussed in Section 13.4, the following
conditions:
1.The experiment contains one or more extraneous sources of variation believed to affect the
dependent variable and considered irrelevant to the objectives of the experiment.
2.Experimental control of the extraneous sources of variation is either not possible or not
feasible.
3.It is possible to obtain a measure of the extraneous variation that does not include effects
attributable to the treatment. Any one of the following situations will generally meet this
third condition:
a.The concomitant observations are obtained prior to presentation of the treatment levels.
b.The concomitant observations are obtained after the presentation of the treatment levels
but before the treatment levels have had an opportunity to affect the concomitant
variable.
c. It can be assumed that the concomitant variable is unaffected by the treatment.
If the third condition is not satisfied and the concomitant variable is affected by the treatment,
the adjustment made on the dependent variable is biased. Consider the drug example cited
previously. If the drugs affected both the learning of the bar-pressing response (covariate) and
the amount of generalization (dependent variable), it would not be possible to adjust the
generalization scores correctly. An adjustment of the generalization scores for differences in
learning the bar-pressing response would also remove the effects of drugs from the dependent
variable. This follows because both the covariate and the dependent variable reflect the effects
of the treatment.
where
where is the deviation of the ijth score from the predicted score. This sum of squares
represents the variation among the dependent scores that is not associated with the linear
regression of Y on X. This is the sum of squares that is of interest in analysis of covariance. If
is substituted for in equation (13.2-2), the residual sum of squares can be
shown to equal
but
and
The term on the far right, , represents an adjustment that is made to the
total sum of squares for , that removes the linear effect of the covariate. This
adjustment will always reduce the sum of the squares for . Consequently, the sum of
squares on the left side of equation (13.2-3) is sometimes called the reduced sum of squares.
This sum of squares is referred to in this chapter as an adjusted total sum of squares and
denoted by Tyy(adj) or by Tadj when it is clear from the context that it refers to the dependent
variable. Unadjusted sums of squares for Y and X are denoted by Tyy and Txx, respectively.
The use of Tyy, Txx, and so on instead of SS to denote a sum of squares is common practice
in discussions of analysis of covariance.
The slope of the regression line, , used in predicting Y from X can be expressed as
The term Txy is called the sum of squares for the cross-product of X and Y. It can be shown
that the coefficient provides the best-fitting line according to the least squares criterion for
the np pairs of observations. A brief discussion of the method of least squares is given in
Section 7.4. The adjusted total sum of squares in equation (13.2-3) can be written using the
abbreviated notation just described as
The degrees of freedom for Tyy(adj) are np − 2. The degrees of freedom have been reduced by
1 because of the linear restriction imposed on the sum of squares whereby the deviations are
computed from the regression line.
The abbreviated designation for each sum of squares appears below the formulas.
Similarly, the total sum of squares for X and the cross-product of X and Y can be partitioned as
follows:
An adjusted within-groups sum of squares, following the procedures for computing Tyy(adj), is
Interpretation of , and
If data consist of np paired observations for p treatment levels, a number of different regression
lines can be identified. Two regression lines, with slopes and , have been mentioned in
this section. A third regression line can be identified—a between-groups regression line with
slope . The interpretation of these three regression lines can be clarified by a simple
numerical example and diagrams. The data in Table 13.2-1 are used for purposes of illustration.
Data for the three treatment levels are plotted in Figure 13.2-1. If all 12 points in Figure 13.2-1
are plotted as if they represent one treatment level instead of three levels, a single regression
line with slope equal to can be drawn. The slope of this single line is given by
Figure 13.2-1 ▪ Regression lines for three treatment levels in Table 13.2-1.
This regression line is shown in Figure 13.2-2. For convenience, the X and Y scores are
expressed as deviations from their respective grand means; that is, and .
The within-groups regression coefficient for each of the treatment levels shown in Figure 13.2-1
is given by
By computing a weighted mean of these three coefficients, one obtains the within-groups
regression coefficient :
The weights in this formula are the corresponding values of Exxj. This formula is less
convenient for computational purposes than the formula given previously, but it
One of the assumptions underlying the adjustment of the within-groups sum of squares is that
the within-groups regression coefficients are all estimates of the same common population
regression coefficient; that is,
and refers to the line that fits the means of the three treatment levels. This regression line is
shown in Figure 13.2-4, where the mean of each treatment level is expressed as a deviation
from the grand mean—that is, and .
Earlier, I noted that the total sum of squares for Y, X, and XY can be partitioned into between-
and within-groups sums of squares. Similarly, the total regression coefficient is equal to the
weighted mean of and , where and are weighted by the corresponding sums of
squares for X; that is,
The reader may wonder why Ayy(adj) is obtained by subtraction and not by
This latter formula is analogous to the formulas used to compute Tyy(adj) and Eyy(adj). Upon
reflection, it is apparent that , as defined in equation (13.2-5), is affected by differences
among the means for the dependent variable. The adjustment that is made to the dependent
variable, however, must be independent of the differences to be tested. Consequently, is
unsatisfactory for making this adjustment. The coefficient could be considered for this
adjustment because it is independent of the differences to be tested. However, should not
be used for adjusting both the numerator and the denominator of an F statistic. This is a
consequence of the requirement that an F statistic must be the ratio of two independent chi-
square variables, each divided by its respective degrees of freedom. The computation of
Ayy(adj) b y subtraction circumvents the preceding problems. The degrees of freedom for
Ayy(adj) are p − 1, and not p − 2, because the between-groups regression line did not enter
into the calculation of the adjusted sum of squares.
The experimental design model equation for a completely randomized analysis of covariance
design is
where μ, αj, and εi(j) are the familiar terms for a completely randomized design. The terms βW,
Xij, and refer, respectively, to the within-groups regression coefficient, the value of the
covariate for subject i in treatment level j, and the grand mean of the covariate scores. The
difference between two observations Yij and Yij′ is an estimate of
It is apparent that unless the two covariates X ij and X ij′ are equal, the difference (X ij − X ij′)
affects the observed difference between Yij and Yij′. The terms in the linear model equation for
analysis of covariance can be rearranged by subtracting from Yij. The result of
this subtraction is an adjusted score. Thus,
The adjusted score is free of the effects of the covariate. Furthermore, the adjusted score
provides an estimate of the familiar terms of the model equation for a completely randomized
design. However, εi(j) in ANCOVA will ordinarily be smaller than εi(j) in ANOVA. The magnitude
of the reduction is discussed in Section 13.3.
Assume that the experiment described in Section 13.1 for evaluating four methods of teaching
arithmetic has been carried out. Thirty-two students were randomly assigned to four classrooms
with eight students in each room. The teaching methods were randomly assigned to the four
classrooms. An intelligence test was administered to each student at the beginning of the
experiment. These data were used to adjust arithmetic achievement scores obtained at the
conclusion of the experiment for differences in intelligence among the students. If much larger
samples had been used, random assignment should have resulted in negligible differences in
intelligence among the four classrooms, in which case an ANOVA would have been appropriate.
When a small number of subjects is used, random assignment may not result in similar
samples.
The research hypothesis can be evaluated by means of a statistical test of the following null
hypothesis:
The layout of the design and computational formulas are shown in Table 13.3-1. The analysis is
summarized in Table 13.3-2. The sums of squares in Table 13.3-2 have been adjusted for
intellectual differences among the students. It is obvious that the null hypothesis cannot be
rejected. The reader may wonder what conclusion would have been drawn if the arithmetic
achievement scores had not been adjusted for the nuisance variable of intelligence. The
required sums of squares for this test also appear in Table 13.3-1. The F statistic is
This F statistic is significant beyond the .05 level. Thus, if the covariate adjustment had not
been used, the researcher would have drawn an erroneous conclusion with respect to the four
teaching methods. We infer from the F statistic for adjusted mean squares that there are no
differences in arithmetic achievement among the four population means apart from differences
in intelligence. A test of the hypothesis that the intelligence test scores of students in the four
The null hypothesis for the intelligence scores can be rejected beyond the .05 level.
Three correlation coefficients can be computed for the paired observations in Table 13.3-1:
where rT denotes the overall correlation between X and Y, rB denotes the correlation between
the treatment means for X and Y, and rW denotes the weighted average correlation between X
and Y for the four treatment levels.
If rB is larger than rW , the reduction in the variation attributable to the treatment can be large
relative to the reduction in the error variation. Under this condition, the F statistic in analysis of
covariance is smaller than the corresponding F statistic in analysis of variance. The data
summarized in Table 13.3-2 illustrate this situation. In general, if rB is negative and rW i s
positive, the F statistic in analysis of covariance is greater than the corresponding F statistic in
analysis of variance.
The reduction in the error term MSEadj that occurs from the use of analysis of covariance is
determined by the size of rW . The larger this correlation, the greater is the reduction in the error
term. An alternative way of computing the adjusted mean square error term illustrates this fact.
If is the error term when no covariance adjustment is used, this term reduces to
by the use of analysis of covariance. Here is the squared estimator of the population within-
groups correlation coefficient, and dferror (ANOVA) and dferror (ANCOVA) are the error-term
degrees of freedom, respectively, for ANOVA and ANCOVA. For the data in Table 13.3-1,
which is equal to MSEadj. It is apparent from formula (13.3-1) that the reduction in the error
term due to the covariance adjustment is primarily a function of the size of .
Let Yij be a score for experimental unit i in treatment level aj. For the fixed-effects experimental
design model, it is assumed that
where
Yadjijis the adjusted dependent score for experimental unit i in the treatment level aj.
Yij is the unadjusted dependent score for experimental unit i in treatment level aj.
βW is the common population within-groups regression coefficient for the p treatment levels.
X ij is the covariate measure for experimental unit i in treatment level aj.
is the covariate sample mean.
μ is the grand mean of the treatment-level population means.
αj is the treatment effect for population aj and is subject to the restriction .
εi (j) is the error effect that is NID(0, ).
For the F statistic in Table 13.3-2 to be distributed as the F distribution, the assumptions of a
completely randomized design described in Section 3.5 must be met. In addition to these
assumptions, the following assumptions also must be tenable:
1.The population within-groups regression coefficients are homogeneous; that is, βW1 = βW2
The values of X that are represented in an experiment are determined by the sample of
subjects. Conclusions from an analysis of covariance are restricted to the X values in the
experiment; extrapolations to other levels of X must be made on nonstatistical grounds. It is
assumed that the regression of Y on X is linear. Of course, analysis of covariance is not
appropriate unless the effects eliminated by the covariate adjustment are irrelevant to the
objectives of the experiment.
In general, tests of significance in the analysis of covariance are robust with respect to violation
of the assumptions of normality and homogeneity of variance. Less is known about the effects
of violating the added regression assumptions.3 We know that departure from linearity results
in biased estimates of treatment effects and reduced efficiency of the covariance analysis.4
Measurement errors in the covariate may either obscure true differences among the dependent
variable means or create the illusion of differences where none exists. The effects of such
errors are minor if the variance of the measurement error is small relative to the variance of X.
Finally, we know that heterogeneity of the population within-groups regression coefficients,
βWjs, results in some loss of power and makes the interpretation of the test of treatment A
difficult. The general consensus is that the effects of heterogeneity of the regression
coefficients are typically small and in a conservative direction. Hence, the use of analysis of
covariance is probably appropriate in the face of small departures from homogeneity of the
regression coefficients.
is given by
where
E1 corresponds to the variation of the individual observations around the unpooled within-
groups regression lines. E2 is the variation of the p within-groups regression coefficients around
the pooled within-groups regression coefficient—that is, . The larger this latter
source of variation relative to E1, the less tenable is the assumption of homogeneity of the
within-groups regression coefficients. The F statistic for testing the hypothesis of homogeneity
of the regression coefficients for the data in Table 13.3-1 is
which is less than the tabled value, F.10; 3, 24 = 2.33. Thus, the assumption of homogeneity of
the regression coefficients is tenable. If this key assumption had not been tenable, an
alternative procedure such as the P. O. Johnson and Neyman (1936) technique or a procedure
by Rogosa (1980) that does not assume homogeneity of regression coefficients could be used.
Excellent descriptions of these procedures are given by, respectively, Huitema (2011, chap. 11)
and Maxwell and Delaney (2004, pp. 460–467).
An approximate test of the hypothesis that the overall regression line is linear is given by
with degrees of freedom equal to 2(p − 1) and p(n − 2). The terms E1 and E2 have already
been defined; E3 and E4 are given by
If the relationship between X and Y is not linear, it may be possible to transform the variables
so that the resulting relationship is linear. Transformations are discussed in Section 3.6.
Sometimes an appropriate transformation cannot be found, in which case a polynomial
ANCOVA model can be used. This approach is described by Huitema (2011, chap. 12).
Before comparisons among means can be made, the means must be adjusted for the
concomitant variable. An adjusted mean for treatment level aj is given by
where and is the mean of all the X observations. For purposes of illustration, I
use the data in Table 13.3-1 to illustrate the computation of adjusted means for treatment levels
a1 and a2:
The adjusted means and can be interpreted as the means that would be expected if
treatment levels a1 and a2 had the same concomitant-variable mean. Notice that the difference
between the adjusted means, 5.31 − 5.33 = −0.02, is smaller than the difference between the
unadjusted means, 2.75 − 3.50 = − 0.75.
You learned in Section 5.4 that Student's t test is recommended for testing hypotheses for p − 1
a priori orthogonal contrasts among p means. The t statistic with adjusted means, , is
The critical value for this statistic is obtained from Student's t distribution in Appendix Table E.3
and is denoted by tα/2, v for a two-sided null hypothesis and by tα, v f o r a one-sided
hypothesis. The statistic has v = p(n − 1) − 1 degrees of freedom. The computation of a t
statistic is illustrated for the hypothesis H0: μadj1 − μadj2 = 0. The data are from Tables 13.3-1
and 13.3-2:
Because | t | < t.05, 27 = 2.052, the null hypothesis for this contrast cannot be rejected.
The Dunn-Šidák and Holm tests, described in Section 5.4, are recommended for testing C a
priori nonorthogonal contrasts among p means. The statistic in equation (13.5-1) can be used
with both tests. The critical value for the Dunn-Šidák test, denoted by tDSα/2; C, v, is obtained
from Appendix Table E.15. The critical values for the Holm test for Ci = C, C − 1,…, 2, denoted
by , are obtained from Appendix Table E.15. The critical value for C = 1 is equal to
tα/2, v and is obtained from Student's t distribution in Appendix Table E.3. The degrees of
freedom for the Dunn-Šidák and Holm tests are v = p(n − 1) − 1.
A variety of procedures are recommended for testing hypotheses for all pairwise contrasts
among p means. The Fisher-Hayter test is the simplest of the more powerful procedures. Recall
that the Fisher-Hayter test has two steps. The first step is a test of the omnibus null hypothesis
using either an F or q statistic. If this test is not significant, the omnibus null hypothesis is not
rejected, and no further tests are performed. If the omnibus null hypothesis is rejected, each of
the pairwise contrasts is tested. The test statistic, denoted by qFH, is
If the covariate is a fixed effect, the statistic in equation (13.5-2) is referred to the Studentized
range distribution. The critical value, denoted by qα; p–1, v, is obtained from Appendix Table
E.6. The test has v = p(n − 1) −1 degrees of freedom. Note that Appendix Table E.6 is entered
for p − 1 means instead of p means.
If, as is usually the case, the covariate is a random effect, the statistic in (13.5-2) is referred to
the generalized Studentized range distribution. The resulting test, called the Bryant-
Paulson procedure (Bryant & Paulson, 1976), is denoted by qBP. The Bryant-Paulson
procedure was developed as a generalization of Tukey's HSD test. The critical value of the
Bryant-Paulson test is denoted by , where C is the number of covariates, p is the
number of treatment levels, and v = p(n − 1) − 1. The critical value is obtained from Appendix
Table E.17.
If a researcher is interested in testing hypotheses for pairwise and nonpairwise contrasts that
are suggested from an inspection of the data, Scheffé's procedure should be used. The test
statistic is
with v1 = p − 1 and v2 = p(n − 1) − 1 degrees of freedom. The critical value for the Scheffé test
is (p − 1)Fα; p − 1 p(n − 1) − 1. The critical value is obtained from Appendix Table E.4.
The procedures described for one covariate can be extended to experiments with two or more
covariates. If two covariates, denoted by X and Z, are available, it is possible to adjust the
dependent variable Y so as to remove the extraneous variation associated with X and Z. An
example illustrating the use of two covariates is presented in this section. The computation for
the two-covariate case is relatively simple. If more than two covariates are used, the
computations should be performed with a statistical package or either the regression model
approach or the cell means model approach in Sections 13.10 and 13.11, respectively.
If the two covariates represent good choices, the inclusion of a third covariate generally will not
add appreciably to the reduction in either error variance or bias. The analogous situation occurs
in multiple regression: There is a point beyond which adding predictors does not appreciably
improve prediction.
Assume that in the experiment analyzed in Section 13.3, two covariates were measured prior to
the beginning of the experiment—intelligence and arithmetic achievement. These covariates are
denoted by X and Z, respectively. The layout of the design, computational tables, and formulas
are shown in Table 13.6-1. The data in Table 13.6-1 for Y and X are identical to the data in
Table 13.3-1. All the computational symbols in this latter table are required in the analysis. To
simplify the presentation, only the new symbols appear in Table 13.6-1. The analysis of the
experiment is summarized in Table 13.6-2. The inclusion of a second covariate in the
experiment has reduced the error variance and provided a better estimate of treatment effects
relative to the use of only one covariate. As a result, the null hypothesis can be rejected. When
both intelligence and initial arithmetic achievement are taken into account, the four teaching
methods produce different results with respect to the dependent variable. The assumptions for
this analysis are similar to those described in Section 13.4 for the one-covariate case.
Table 13.6-1 ▪ Computational Procedures for CRAC-4 Design With Two Covariates
Table 13.6-2 ▪ ANCOVA Table for CRAC-4 Design With Two Covariates
Tests of differences among means are carried out with adjusted means. An adjusted mean for
treatment level aj is given by
where and are defined in Table 13.6-1. For the data in Table 13.6-1, adjusted means for
treatment levels a1 and a2 are
The adjusted means and can be interpreted as the means that would be expected if
treatment levels a1 and a2 had the same X and Z concomitant-variable means.
Procedures for testing hypotheses for contrasts among means are extensions of the
procedures described in Section 13.5. Unfortunately, when a contrast involves more than two
means, the formula become relatively complex. The formula can be simplified by using
where
The values of the covariate means, and , and the within-groups sums of squares, Exx,
Ezz, Exz, can be obtained from Table 13.6-1. The t statistic has p(n − 1) − 2 degrees of
freedom. The computations are illustrated for the hypothesis H0: μadj1 − μadj2 = 0. The
required matrices are
The t statistic is
Because | t | < t.05,26 = 2.056, the null hypothesis for this contrast cannot be rejected.
The statistic in equation (13.6-1) can be used with both the Dunn-Šidák and Holm tests. The
test statistic for the Fisher-Hayter test, which always involves two means, is
with p(n − 1) − 2 degrees of freedom. The Fisher-Hayter test is appropriate if the covariates are
fixed effects. If the covariates are random effects, equation (13.6-2) should be referred to the
generalized Studentized range distribution (see Appendix Table E.17).
The computational procedures for a CRAC-p design can be easily extended to other
experimental designs, such as the randomized block analysis of covariance design, the RBAC-
p design. The model equation for an RBAC-p design is
The computational procedures for an RBAC-p design are presented in Table 13.7-1. The
formulas for the unadjusted sums of squares are those for a randomized block design, but the
adjusted formulas follow the pattern shown in Table 13.3-1 for a completely randomized
analysis of covariance design. For example,
where
The procedures shown in Table 13.7-1 are analogous to the adjustment procedures for a
completely randomized analysis of covariate design. This is more apparent if the adjustment
procedure for a CRAC-p design is presented in the following way:
This follows because Tyy = Ayy + Eyy, Txy = Axy + Exy, and Txx = Axx + Exx, where A and E
denote treatment and error sums of squares, respectively. The adjustment formula must be
modified for a randomized block analysis of covariance design because in this design, Tyy
includes a block sum of squares in addition to treatment and error sums of squares. I show in
subsequent sections that whenever the total sum of squares contains variation other than
treatment and error sums of squares, a subtotal consisting of only the latter sources of variation
must be used in the analysis of covariance.
The assumptions associated with tests of significance are those for a randomized block design
(Section 7.4) and the analysis of covariance (Section 13.4). Procedures for making comparisons
among adjusted treatment means are, with the following exception, identical to those described
in Section 13.5 for a CRAC-p design. The exception: The error term MSRadj m u s t b e
substituted for MSEadj.
When one or more observations are missing, the cell means model approach that is described
in Section 13.11 can be used.
It is apparent from this model equation that an adjusted observation estimates the familiar
terms of the linear model equation for a CRF-pq design. For the F statistics in Table 13.8-1 to
be distributed as the F distribution, the assumptions of the CRF-pq design in Section 9.4 must
be tenable. In addition, the following assumptions must be tenable: (1) The jkpopulation within-
cell regression coefficients are homogeneous, (2) X is measured without error, and (3) the
residuals (deviations from regression) are NID with mean equal to zero and common variance. A
test of the hypothesis that
is given by
where
The cell means model approach or the regression model approach described in Sections 13.10
and 13.1, respectively, can be used when the cell ns are unequal.
Contrasts Among Means. Adjusted means for Ay, By, and ABy are given by, respectively,
where .
The statistics in equations (13.8-1), (13.8-2), and (13.8-3) can be used with both the Dunn-
Šidák and Holm procedures. Either procedure can be used to test C a priori nonorthogonal
contrasts.
The statistics for the Fisher-Hayter test, which are used to test all pairwise contrasts, are
Recall that these statistics are referred to the Studentized range distribution if the covariate is a
fixed effect. If the covariate is a random effect, the statistics should be referred to the
generalized Studentized range distribution (Appendix Table E.17). The resulting procedure is
called the Bryant-Paulson procedure. The degrees of freedom for both procedures are pq(n −
1) − 1.
Scheffé's statistic, which is used to test all contrasts suggested by an inspection of the data, is
equal to the square of the statistics in equations (13.8-1), (13.8-2), and (13.8-3). The critical
values for Scheffé's test are, respectively, (p − 1)Fα; p−1, pq(n−1) − 1, (q − 1) Fα; q−1, pq(n−1)
The collection of concomitant observations in a split-plot factorial design can take one of two
forms, as shown in Figure 13.8-1. In Figure 13.8-1(a), a single covariate measure is associated
with all the dependent observations for a subject. In this case, it is assumed that the covariate
is obtained prior to administration of any of the treatment combinations. In Figure 13.8-1(b),
each dependent observation for a subject is paired with a unique covariate measure. The
design in part (a) can be considered a special case of part (b), where the covariate is identical
for each criterion measure. For this reason, only computational procedures for the design in
part (b) are given.
Figure 13.8-1 ▪ The SPFAC-2.3 designs illustrate two forms for the concomitant variable.
(a) A single covariate is associated with all of the criterion measures; (b) each criterion
measure is paired with a unique covariate.
If the regression for the between-blocks variation, βB, is different from the regression for the
within-blocks variation, βW, the following model equation for the design in part (b) is
appropriate:
If it can be assumed that ββ = βW, the latter regression coefficient can be used to adjust the
between-blocks and within-blocks sums of squares. The simplified model (13.8-5) described
earlier is appropriate for this case. A test of the hypothesis that ββ = βW is given by
Because the variances in the denominator of this t′ statistic are likely to be heterogeneous, t′ is
not distributed as the t distribution. The sampling distribution of t′ is approximately distributed
where dfE and dfR refer to the degrees of freedom for MSEadj and MSRadj, respectively. A
numerically large level of significance (α = .10 or .25) should be used for this test to avoid a
Type II error. If the hypothesis that βB = βW is tenable, the adjustment procedures shown in
Table 13.8-2 for the between-blocks sums of squares can be modified to reflect this. The
adjustment procedures for the within-blocks sums of squares are unchanged. The formulas for
computing adjusted sums of squares for treatment A and blocks within groups are
or
According to Winer et al. (1991, pp. 827–828), the latter formulas are preferred. Adjusted
means for A and AB, respectively, are given by
I noted earlier that the design in Figure 13.8-1(a), which involves a single covariate measure for
all criterion measures, is a special case of the design in Figure 13.8-1(b). Only the between-
blocks terms are adjusted for the covariate. This follows because the within-blocks adjustments
are equal to zero; that is, Bxx, ABxx, Rxx, Bxy, ABxy, and Rxy equal zero.
designs should be apparent to the reader. Space limitations prevent a discussion of all analysis
of covariance designs. Federer (1955, chap. 16) describes the application of analysis of
covariance to a variety of designs.
If concomitant variables are obtained prior to the presentation of a treatment, several alternative
research strategies should be considered by a researcher. The alternative described in this
chapter is to statistically remove the variation in the dependent variable that is associated with
the variation in the covariate. Another alternative is to use the covariate to form homogeneous
blocks by assigning subjects with the highest covariate to one block, subjects with a lower
covariate to a different block, and so on. If this type of stratification is employed, the data are
analyzed by means of a randomized block design in which variation associated with blocks is
isolated from error variation. In general, this design strategy is preferable to analysis of
covariance if a researcher's principal interest is in reducing the error variance rather than in
removing bias from estimates of treatment effects. Note that a randomized block analysis
requires somewhat less restrictive assumptions than a covariance analysis.
The analysis of covariance design assumes that the correct regression equation has been fitted
and that the within-treatment regression coefficients are homogeneous. The analysis is greatly
simplified if the relationship between Y and X is linear. A randomized block design is essentially
a function-free regression scheme and is appropriate even though the relationship between Y
and X is nonlinear. Of course, it is assumed that the interaction of blocks and treatment levels
is zero in a randomized block design. This assumption is similar to the assumption of
homogeneity of regression coefficients in a covariance analysis.
Covariance and stratification are almost equally effective in removing the effects of an
extraneous source of variation from the error term if the relationship between X and Y is linear.
If it is possible to assign subjects to blocks so that the X values are equal within a block, the
use of stratification in a randomized block design reduces the error term to approximately
where is the error variance when no stratification is used, and is the within-groups
correlation coefficient. The corresponding reduction due to analysis of covariance is
Feldt (1958) found that an analysis of covariance design is more precise than a randomized
block design when the true correlation between X and Y is greater than 0.6. The randomized
In analysis of covariance, the relationship between Y and X is estimated from the data. As I
noted, the computations required for analysis of covariance can be involved and laborious.
Under certain conditions, the use of difference scores in a conventional analysis of variance
achieves some of the same advantages as analysis of covariance without the laborious
computations. This procedure is applicable to situations in which the concomitant variable is
measured on the same scale as the dependent variable. For example, the covariate might be
an initial score X ij obtained on a test, and the dependent variable might be a score Yij obtained
on the test after administration of a treatment. The difference score,Dij, is defined as
and reflects the change in a subject's test performance that is attributable to treatment aj. The
analysis of variance is performed with the difference score. The ANOVA model equation for
difference scores, assuming that subjects have been randomly assigned to p treatment levels,
is
If the regression of Y on X is linear and βW = 1.0, the analysis of difference scores gives the
same estimate of treatment effects as analysis of covariance. In this case, the ANOVA model for
difference scores and the ANCOVA model are essentially equivalent. To illustrate, an adjusted
which, except for the constant , is the same as the ANOVA model for difference scores.
The ANCOVA model and the difference-score approach assume that the relationship between Y
and X is linear. This assumption may or may not be tenable for a particular experiment.
However, the analysis of covariance does not require an a priori assumption concerning the
value of βW but instead estimates the value of βW from the data. A moderate departure of βW
from the required value of 1 does not result in a serious loss of precision when difference
scores are used. Reichardt in Cook and Campbell (1979) provides an excellent summary of the
relative merits of analysis of covariance, stratification, and difference scores.
The regression model approach that I introduced in Chapter 7 can be used to estimate the
parameters of a completely randomized analysis of covariance design,
and to test interesting hypotheses about these parameters. Consider the CRAC-4 design for
evaluating four methods of teaching arithmetic described in Section 13.3. A qualitative
regression model with h = (p − 1) + 1 = 4 independent variables (X i1, X i2, X i3, and X i4) can be
formulated for this design as follows:
where εi is NID(0, ). The independent variables of the regression model can be coded using
dummy coding as follows.
It can be shown, following the procedure in Section 7.5, that this coding scheme results in the
following correspondence between the parameters of the regression model and those of the
CRAC-4 experimental design model:
The regression sum of squares reflecting the contribution of X 1, X 2, and X 3 over and above
that due to including X 4 in the model is given by
where SSR a n d SSE denote, respectively, the regression and error sums of squares.
Procedures for computing SSR(X 1X 2X 3 | X 4) and other terms required in testing the
hypothesis β1 = β2 = β3 = 0 are illustrated in Table 13.10-1. The data are from Table 13.3-1. A
comparison of the mean squares and F statistic in part (iii) of Table 13.10-1 with those in Table
13.3-2, where the classical sum-of-squares approach was used, reveals that both approaches
lead to the same results: MSR(X 1X 2X 3 | X 4) = MSAadj, MSE = MSEadj, and F = 2.29.
Table 13.10-1 ▪ Computational Procedures for CRAC-4 Design Using a Regression Model
With Dummy Coding
I can formulate a regression model for the data in Table 13.10-1 that allows for p − 1 = 3
different within-group regression coefficients as follows:
are obtained by multiplying the coefficients for X 4 by those for X 1, X 2, and X 3. A test of the
hypothesis of no interaction
is equivalent to testing the hypothesis that the within-group regression coefficients are equal. To
test β5 = β6 = β7 = 0, I compare the fit of model (13.10-2), the full model, with model (13.10-1),
now called the reduced model. The test statistic is
where
The regression model can easily be modified for the case in which there are several covariates.
The model for the data in Table 13.6-1, where two covariates were used, is
where εi is NID(0, ). The first four independent variables of the regression model are coded
as before; the fifth independent variable is coded as . The regression sum of
squares reflecting the contribution of X 1, X 2, and X 3 over and above that due to including X 4
and X 5 in the model is given by
The computation of these terms follows the procedures shown in Table 13.10-1.
I described the cell means model approach for a completely randomized design in Section 7.7.
Here I extend the approach to a completely randomized analysis of covariance design with one
covariate. The cell means model is
where Yij is the observation for subject i in treatment level j, μj is the mean of the jth population,
βW is the pooled within-groups regression coefficient, Zij is the covariate for subject i i n
treatment level j, is the mean of the np covariate observations, and εi(j) is the error effect that
is NID(0, ). I use the data for the four methods of teaching arithmetic in Table 13.3-1 t o
illustrate the computations. The null hypothesis for these data can be expressed as
The computational procedures are shown in Table 13.11-1. The values of the means squares
and F statistic in part (ii) are the same as those in Table 13.3-2, where the classical sum-of-
squares approach was used.
Table 13.11-1 ▪ Computational Procedures for CRAC-4 Design Using the Cell Means
Model Approach
The extension of the CRAC-p design for the case in which there are two or more covariates is
described by Kirk and Bekele (2009).
The restricted cell means model for a randomized block analysis of covariance design is
where μij is subject to the restrictions that for all i, i′, j, and j′. Yij is the
observation in block i and treatment level j, μij is the dependent variable population mean for
the ijth cell, βW is the pooled within-treatment levels regression coefficient, Zij is the covariate
for the ijth cell, is the mean of the np covariate observations, and εij is the error effect that is
NID(0, ).
where , and are coefficient matrices that define the null hypotheses, μy is a vector
of dependent variable cell means, and 0A and 0BL are null vectors. The restrictions on μij can
be expressed as follows:
where
The procedures described in Section 8.8 can be used to compute , and R′ when one or
more observations are missing. The analysis procedures described here for CRAC-p a n d
RBAC-p designs generalize to other designs that are formed from these building block designs.
Kirk and Bekele (2009) describe the cell means model approach to analyzing data for a
completely randomized factorial analysis of covariance design.
1.Enables a researcher to reduce the error variance and remove one or more sources of bias
from estimates of treatment effects that are difficult or impossible to eliminate by
experimental control.
2.Provides approximately the same reduction in error variance as the use of stratification
(blocking) of subjects. However, covariance analysis can be used after the data have been
collected, whereas stratification of subjects into homogeneous blocks must be performed at
the beginning of the experiment.
1.Computations are more laborious than those for a corresponding analysis of variance.
2.Analysis of covariance requires a restrictive set of assumptions that may prove untenable in
a particular research application.
3.Computational formulas for estimating missing observations and carrying out comparisons
among means for some analysis of covariance designs are relatively complex.
1.Terms to remember:
a.experimental control (13.1)
b.statistical control (13.1)
c. concomitant variable (13.1)
d.covariate (13.1)
e.conditional distribution (13.1)
f. reduced sum of squares (13.2)
g.adjusted total sum of squares (13.2)
h.adjusted score (13.2)
i. generalized Studentized range distribution (13.5)
j. Bryant-Paulson procedure (13.5)
k. difference score (13.9)
*2.[13.1] For each of the following experiments, identify the (i) independent variable, (ii)
dependent variable, and (iii) concomitant variable.
*a.The effects of four diets on the weight gains of rats were investigated. Data were
recorded for the amount of each diet that was consumed.
*b.The effects of three reading programs on the reading performance of fourth-graders
were investigated. A reading achievement test was administered prior to beginning the
programs.
c. The effects of four workspace arrangements on employee productivity were
investigated. Production records of the employees prior to the experiment were
obtained.
3.[13.1] Analysis of covariance is sometimes used to statistically equate intact groups that are
known to differ on one or more relevant variables. What interpretation problem does this
pose?
4.[13.1] In an ANCOVA design, how can one be certain that the measurement of the
concomitant variable does not include effects attributable to the treatment?
5.[13.2] Why is referred to as a reduced sum of squares?
*6.[13.2] Show that Txy = Exy + Axy by replacing in the formula for Txy with
and with .
*7.[13.2] For the data below compute the following.
*a.
*b.
*c.
d.Construct figures like Figures 13.2-2 to 13.2-4.
8.[13.2] For the data in Exercise 7, compute the following.
a.
b.
10.
Exercise 4 in Section 4.8 described an experiment to evaluate the effects of written
instructions designed to maximize a subject's attention to hypnotic facilitative information.
The subjects, 36 hypnotically naive male and female college students, were randomly
assigned to one of four groups with nine subjects in each group. Following the assignment,
the subjects took the Harvard Group Scale of Hypnotic Susceptibility. The following data
were obtained, where Y denotes a subject's score on the Stanford Hypnotic Susceptibility
Scale, Form C, and X denotes a subject's score on the Harvard Scale.
*11.
Exercise 8 in Section 8.11 described an experiment to determine the effects of 0, 1, 2, 3, 4,
and 5 irrelevant stimuli on the probability of correct response in a visual search task. During
the 5-second viewing time, the length of time each animal actually scanned the display was
recorded. The following data were obtained, where Y denotes the probability of a correct
response and X denotes the scanning time in seconds.
*12.
Exercise 5 in Section 9.16 described an experiment in which college students administered
a series of either painful or mild electric shocks as feedback for errors made by a White or
Black male confederate. Assume that at the beginning of the experiment, the subjects took
the Test Anxiety Scale for Adults (TASA). The following data were obtained, where Y
denotes the change in ratings from the pretest to the posttest on likability, intelligence, and
personal adjustment of the confederate and X denotes the TASA score.
*a.[13.8] Use ANCOVA to test H0: μ1. = μ2., H0: μ.1 = μ.2, and H0: μjk − μj′k − μj′k +μj′k′ =
0 for all j and k; let α = .05.
*b.[13.8] Compare the results for ANCOVA with those for ANOVA.
*c.[13.3] (i) Compute rB for treatments A and B and the AB interaction; also compute rW .
(ii) Are these correlations consistent with the results in Exercise 12(b)?
*d.[13.8] Test the hypothesis of homogeneity of within-cell regression coefficients; let α =
.25.
*e.[13.8] Compute the four adjusted and four unadjusted treatment combination means.
f. (i) Graph the AB interaction in terms of adjusted and unadjusted treatment combination
means. (ii) How does the use of adjusted instead of unadjusted treatment combination
means alter your interpretation of the AB interaction?
g.Prepare a “results and discussion section” for the Journal of Experimental Social
Psychology.
13.
Exercise 6 in Section 9.16 described an experiment to test the hypothesis that persons who
are less physically attractive believe that they have less control over reinforcements in their
lives than those who are more attractive. Assume that ratings by three judges of the
physical attractiveness of the participants in the experiment were also obtained during the
experiment. The following data were obtained, where Y denotes a score on the Rotter I-E
scale and X denotes the mean physical attractiveness rating of a participant:
a.[13.8] Use ANCOVA to test H0: μ1. = μ2., H0: μ.1 = μ.2 = μ.3, and H0: μjk – μj′k – μjk′
+μj′k′ = 0 for all j and k; let α = .05.
b.[13.8] Compare the results for ANCOVA with those for ANOVA.
c. [13.3] (i) Compute rB for treatments A and B and the AB interaction; also compute rW .
(ii) Are these correlations consistent with the results in Exercise 13(b)?
d.[13.8] Test the hypothesis of homogeneity of within-cell regression coefficients; let α =
.25.
e.[13.8] Compute the adjusted means for treatments A and B.
f. [13.8] (i) Use the Bryant-Paulson statistic to determine which population means differ for
treatment B. (ii) If you did Exercise 6(e) in Section 9.16, compare the two sets of tests
for treatment B. (iii) Do the results of the ANCOVA support the original research
hypothesis?
g.Prepare a “results and discussion section” for Perceptual and Motor Skills.
*14.*a.[13.10] For the data in Exercise 9 of this section, write a regression model equation.
*b.Assume that dummy coding is used and . Give the correspondence
between the parameters of the regression model and those of the CRAC-3 experimental
design model.
*c.[13.10] Write the reduced model equation for testing H0: β1 = β2 = 0.
*d.[13.10] Use the regression model approach to test the null hypothesis in part (c). (This
exercise requires the inversion of a 4 × 4 matrix. A computer with a matrix inversion
program should be used.)
*e.[13.10] Write full and reduced regression model equations for testing the hypothesis of
homogeneity of within-group regression coefficients.
*f.[13.10] Use the regression model approach to test the homogeneity of within-group
regression coefficients; let α = .25. (This exercise requires the inversion of a 6 × 6 matrix.
A computer with a matrix inversion program should be used.)
15. a.[13.10] For the data in Exercise 10 of this section, write a regression model equation.
b.[13.10] Assume that dummy coding is used and . Give the correspondence
between the parameters of the regression model and those of the CRAC-4 design.
c. [13.10] Write the reduced model equation for testing H0: β1 = β2 = β3 = 0.
d.[13.10] Use the regression model approach to test the null hypothesis. (This exercise
requires the inversion of a 5 × 5 matrix. A computer with a matrix inversion program
should be used.)
e.[13.10] Write full and reduced regression model equations for testing the hypothesis of
homogeneity of within-group regression coefficients.
f. [13.10] Use the regression model approach to test the homogeneity of the within-group
regression coefficients; let α = .25. (This exercise requires the inversion of an 8 × 8
matrix. A computer with a matrix inversion program should be used.)
*16.*a.[13.11] For the data in Exercise 9 of this section, write the cell means model equation
and the null hypothesis.
*b.Use the cell means model approach to test the null hypothesis.
17. a.[13.11] For the data in Exercise 10 of this section, write the cell means model equation
and the null hypothesis.
b.Use the cell means model approach to test the null hypothesis.
*18.*a.[13.11] For the data in Exercise 11 of this section, write the cell means model equation
and the null hypotheses.
*b.Use the cell means model approach to test the null hypotheses.
1For discussions of this problem, see Huitema (2011) and Shadish, Cook, and Campbell (2002,
chap. 8). Both references provide extensive treatments of the interpretation problems in
ANCOVA.
2Shadish et al. (2002, chap. 9) provide an excellent discussion of what to do when random
assignment does not result in equivalent groups.
3An excellent summary of the ANCOVA assumptions is provided by Huitema (2011, chap. 8).
The special problems posed by quasi-experimental design are examined by Shadish et al.
(2002, chaps. 4 and 5) and West, Biesanz, and Pitts (2000). Maxwell, Delany, and O'Callaghan
5Vector and matrix operations are discussed in Section 7.2 and Appendix D.
*The reader who is interested only in the classical sum-of-squares approach can, without loss
of continuity, omit this section.
*The reader who is interested only in the classical sum-of-squares approach can, without loss
of continuity, omit this section.
http://dx.doi.org/10.4135/9781483384733.n13
A Latin square has p rows and p columns with p ordered letters assigned to the cells of the
square so that each letter appears once in each row and once in each column. The Latin
square got its name from an ancient puzzle that dealt with the number of ways that Latin letters
could be arranged in a square matrix.1 Let A, B, and C be three ordered letters. One of the 12
arrangements of a 3 × 3 Latin square is the following:
Ronald A. Fisher is responsible for promoting the use of experimental designs based on a Latin
square. I denote a Latin square design with p treatment levels by LS-p. In Chapter 7, I describe
a randomized block design that enables a researcher to isolate the effects of one nuisance
variable: variation among blocks. A Latin square design extends this procedure to two nuisance
variables: One nuisance variable is assigned to the rows of a square, and a second nuisance
variable is assigned to the columns of the square. The p levels of a treatment are assigned to
the cells of the square. By isolating two nuisance variables, Latin square designs are generally
more efficient than completely randomized and randomized block designs. For example,
Cochran (1938, 1940) reported that during a 7-year period at the Rothamsted Experimental
Station and associated centers, the efficiency of the Latin square design relative to those of
completely randomized and randomized block designs was 222% and 137%, respectively.
A Latin square design is used often in agricultural and industrial research. The design is used
less frequently in behavioral and educational research for reasons that are discussed later. The
design is appropriate for experiments that meet, in addition to the general assumptions of
analysis of variance, the following conditions:
1.One treatment with p ≥ 2 treatment levels and two nuisance variables with p levels each.
The p levels of the treatment are assigned to the cells of the square. The levels of one
nuisance variable are assigned to the p rows of the square; the levels of the other nuisance
variable are assigned to the p columns. A Latin square design with p < 5 is not practical
because of the small number of degrees of freedom that are available for estimating error
variance unless more than one experimental unit is assigned to each cell of the square or
unless several squares are combined. This latter procedure is described in Chapter 15.
2.No interactions among the rows, columns, and treatment levels of the square. If this
assumption is not met, a test of one or more of the corresponding effects is biased.
3.Random assignment of treatment levels to the p2 cells of the square, with the restriction
that each treatment level must appear only once in a row and once in a column. The design
requires a total of N = np2 experimental units, where n ≥ 1.
A Latin square is a standard square if the first row and the first column are ordered
alphabetically or numerically. The following are examples of standard squares:
Two squares are conjugate if the rows of one are identical to the columns of the other. For
example, the conjugate of square 5 × 5(2) is
A square is self-conjugate if the same square is obtained when its rows and columns are
interchanged. For example, squares 2 × 2, 3 × 3, 4 × 4(1, 2, 3, and 4), and 5 × 5(1) are self-
conjugate because the same arrangement of Latin letters results when the rows and columns
are interchanged. Squares 5 × 5(2) and 6 × 6 are not self-conjugate, as illustrated for square 5
× 5(2).
The simplest way to construct Latin squares is to perform p − 1 one-step cyclic permutations of
a sequence of letters. The process involves successively moving the first letter in a sequence—
say, A B C D—to the extreme right and simultaneously moving all other letters one position to
the left. For example, a one-step cyclic permutation of the letters A B C D gives B C D A, a
second cyclic permutation gives C D A B, and a third gives D A B C. Rows 2, 3, and 4 of the
following Latin square are produced by the three one-step cyclic permutations.
The letters in a Latin square can be rearranged to produce a total of p!(p − 1)! Latin squares,
including the original square. An enumeration of Latin squares of size 7 × 7 and smaller has
been made by Fisher and Yates (1934), Norton (1939), and Sade (1951). The number of
arrangements of Latin squares is as follows:
1.2 × 2 Latin square. One standard square (self-conjugate) and one nonstandard square,
obtained by interchanging either rows or columns of the standard square. There are a total
of two arrangements.
2.3 × 3 Latin square. One standard square (self-conjugate) and (3!)(2!) = (3 × 2 × 1) (2 × 1) =
12 arrangements of this standard square.
3.4 × 4 Latin square. Four standard squares (all self-conjugate) with (4!)(3!) = 144
arrangements of each standard square. The total number of arrangements is equal to
4(144) = 576.
4.5 × 5 Latin square. Twenty-five standard squares and their conjugates, plus 6 self-
conjugate squares for a total of 56 standard squares. The total number of arrangements is
56(5!)(4!) = 161,280.
5.6 × 6 Latin square. 9408 standard squares with 9408(6!)(5!) = 812,851,200 arrangements.
6.7 × 7 Latin square. 16,942,080 standard squares.
The number of standard squares increases sharply as the dimensions of the square increase.
Rules for randomizing Latin squares have been described by Fisher and Yates (1963). In
theory, a researcher should randomly select a square from the population of all possible
squares of the proper dimension. However, this is not practical for squares larger than 5 × 5.
The following randomization rules are adequate for most research applications:
For example, to randomize a 4 × 4 Latin square, draw nine random digits between 1 and 4 from
a table of random numbers. Assume that the digits are 2, 3, 1, 2, 4, 4, 3, 1, and 2. Because the
first digit is 2, the second 4 × 4 square shown at the beginning of this section is selected. This
square is
The rows are ordered according to the second through the fifth random digits (3, 1, 2, 4) to
yield the following square:
Finally, the columns are ordered according to the sixth through the ninth random digits (4, 3, 1,
2). This gives the following square:
For squares of size 2 × 2, 3 × 3, and 4 × 4, the randomization procedures described here select
a Latin square at random from the set of all squares. For 5 × 5 and larger squares, a square is
selected at random from a subset of squares. Although the procedure does not select larger
squares from all possible squares, according to Cox (1958), the procedure is suitable for both
practical and theoretical purposes.
Assume that I want to evaluate automobile tires by means of a road test. The independent
variable is four kinds of rubber compounds, denoted by a1, a2, a3, and a4, used in the
construction of the tires. The dependent variable is the thickness of the tread that remains on
each tire after 10,000 miles of driving. Random samples of tires made from each rubber
compound were obtained from the production line. The tires were mounted on four cars
according to the randomization plan shown in Figure 14.3-1. The cars are designated by c1, c2,
c3, and c4 and the wheel positions by b1, right front; b2, left front; b3, right rear; and b4, left
rear. The rubber compounds and wheel positions represent fixed effects; the automobiles are
regarded as random effects. Thirty-two professional drivers were randomly assigned to the 16
treatment combinations in Figure 14.3-1, with the restriction that two drivers were assigned to
each combination. Thus, there were two observations in each cell. The layout for this design is
shown in Figure 14.3-2.
Figure 14.3-1 ▪ Randomization plan for evaluating four kinds of rubber compounds (a1,
Figure 14.3-2 ▪ Layout for a Latin square design (LS-4 design) where 32 subjects were
randomly assigned to the 16 combinations of ajbkcl from the Latin square in Figure 14.3-
An examination of Figure 14.3-1 reveals that each kind of tire was used equally often on each
car and appears the same number of times at each wheel position. The Latin square design
enables me to isolate the effects of two nuisance variables—differences among automobiles
and wheel positions—while evaluating the effects of the rubber compounds. The Latin square
in Figure 14.3-1 is identical to the 4 × 4 square obtained by the randomization procedure
illustrated in Section 14.2. To maintain a consistent notation, the treatment levels A, B, C, and D
are denoted by a1, a2, a3, and a4, respectively. Treatment effects are denoted by αj = μj.. – μ,
row effects by βk = μ.k. − μ, and column effects by γl = μ..l – μ, where μ is the grand mean.
The level of significance adopted is .05. The data and computational procedures are given in
Table 14.3-1. The results of the analysis are summarized in Table 14.3-2. The first test that
should be performed is a test of the residual mean square, F = MSRES/MSWCELL. If this test
is significant, it indicates that MSRES and MSA or MSB or MSC or possibly all three of the
latter mean squares estimate, in addition to main effects, one or more interaction components.2
When this occurs, a test of A, B, or C or a combination of these is positively biased. A test is
positively biased if it yields too many significant results. For example, if a test of treatment A is
significant but positively biased, we have no way of knowing whether the significance is due to
treatment A or to a component of the BC or ABC interactions. Because the test of MSRES in
Table 14.3-2 is close to 1 and not significant, it is reasonable to assume that the rows, columns,
and treatment levels do not interact and hence that tests of A, B, and C are not positively
biased. I conclude from the analysis in Table 14.3-2 that the treatment effects associated with
the rubber compounds are not all equal to zero. Procedures for testing hypotheses about
contrasts among the population means are described in Section 14.6.
When n is equal to or greater than 2 and the assumption of no interactions among treatment
levels, rows, and columns is tenable, the analysis provides two independent estimates of error
variance: MSWCELL and MSRES. Most writers recommend pooling these two mean squares to
obtain a better estimate of the population error variance as follows:
Despite a researcher's best efforts, occasionally one or more observations are missing. This
presents a problem because the analysis procedures in Table 14.3-1 require that each cell of a
Latin square design contain the same number of observations. The statistical analysis can be
carried out using the cell means model or the regression model. The cell means model is
described in Section 14.9.
Procedures for estimating strength of association and effect size are described in Sections 4.5
and 8.2. The rubber compounds data in Table 14.3-1 are used to illustrate the computation of
partial omega squared and the partial intraclass correlation for a Latin square design. Recall
that the rubber compounds and wheel positions represent fixed effects, whereas the
automobiles are regarded as random effects. Furthermore, I assume that treatment levels,
rows, and columns do not interact. The measures of strength of association, partial omega
squared and partial intraclass correlation, are as follows:
where
The computations of , and so on are based on the expected values of the mean
squares in Table 14.3-2. The value of is unusually high; it is rare to find
treatments that account for more than 25% to 45% of the variance.
Partial omega squared for treatment A also can be computed from FA, n, and p. The alternative
formula is
where FA, n, and p are obtained from Table 14.3-2. This formula is useful for assessing the
practical significance of published research where only the F statistic and degrees of freedom
are provided.
Partial omega squared can be converted into Cohen's measure of effect size, f. The formula for
treatment A is
Procedures for calculating power and the number of subjects necessary to achieve a specified
power for a completely randomized design are described in Section 4.5. These procedures
generalize to a Latin square design with two or more observations per cell. The rubber
compounds data in Table 14.3-1 are used to illustrate the computation of power for treatment A.
Recall that treatment A is a fixed effect. The value of is
Tang's charts also can be used to estimate the number of observations in a Latin square design
that are required to achieve a given power. To accomplish this, it is necessary to specify the (1)
level of significance, α; (2) power, 1 – β; (3) size of the population variance, ; and (4) sum of
the squared population treatment effects, . If it is not possible to estimate and
from a pilot study or previous research, the required sample size can be approximated by
specifying the number of treatment levels, α, 1 – β, and either ω2 or f. Suppose that a
researcher wants to determine the required sample size for a LS-4 design and specifies that α
= .05, 1 – β = .80, and f = .40 (a large effect size). The required sample size can be determined
from Appendix Table E.13 f o r v1 = p − 1 = 4 − 1 = 3 and
If the road test had not been repeated, there would be no within-cell error term. In this case,
MSRES can be used as an error term for testing the treatment and nuisance effects if it can be
assumed that MSRES estimates only error variance. To state it another way, MSRES is an
appropriate error term if and only if all of the interactions are equal to zero (i.e.,
). Tukey's test for nonadditivity described next provides a partial test of
this assumption.
Computational procedures for the case where n = 1 are very similar to those shown in Table
14.3-1. However, there are several important differences: (1) A score is denoted by Yjkl; (2)
there is no ABCS summary table, [ABCS] term, and MSWCELL term; (3) SSTO = [ABC] – [Y];
and (4) the summary tables and computational symbols should be modified by deleting n and
wherever they appear.
In Section 14.1, I observed that a Latin square design is not practical for p < 5 unless more than
one observation per cell is obtained. This follows because for 2 × 2, 3 × 3, and 4 × 4 Latin
squares, MSRES has only 1, 2, and 6 degrees of freedom, respectively. A 5 × 5 square has 12
degrees of freedom. Although marginal, this number of degrees of freedom is acceptable.
As I have discussed, if n = 1, there is no within-cell error term. To use MSRES as an error term,
all interactions among rows, columns, and treatment levels must equal zero. A partial test of
the additivity assumption when n = 1 can be made by means of Tukey's (1955) test for
nonadditivity, which is described in Section 8.3 in connection with a randomized block design.
For purposes of illustration, assume that the data in the ABC summary table of Table 14.3-1 are
based on an n of 1 instead of 2. Computational procedures for Tukey's test are shown in Table
14.4-1.
It is apparent from the analysis that the assumption of additivity is tenable. A relatively low level
of significance (α = .10) was adopted for the FNONADD test. The selection of this numerically
large significance level reflects a desire to increase the power of the test. A plot of the residual
for each observation, , against the estimated value,
, is shown in Figure 14.4-1. The absence of an
association between and supports the tenability of the additive model.
A significant test of nonadditivity means that MSRES estimates components of the two- and
three-factor interactions and that combinations of these components appear in MSA, MSB, and
MSC. Under these conditions, tests of main effects and nuisance effects are biased i n a
complicated manner. For a discussion of this point, see Scheffé (1959, pp. 154–158) and Wilk
and Kempthorne (1957). If F = MSRES/MSWCELL or FNONADD is significant, a transformation
may be found that will produce additivity.
The classical experimental design model for the rubber compounds experiment described in
Section 14.3 is
where
Yijkl is the score for the ith experimental unit in treatment level j, row k, and column l.
is the grand mean of the population means. The grand mean is a constant for all
μ
observations in the experiment.
is the treatment effect for population j and is equal to μj.. – μ, the deviation of the grand
αj mean from the jth population mean. The jth treatment effect is a constant for all
observations in treatment level aj and is subject to the restriction .
is the row effect for population k and is equal to μ.k. – μ, the deviation of the grand mean
βk from the kth population mean. The kth row effect is a constant for all observations in row
bk and is subject to the restriction .
is the column effect for population l and is equal to μ..l − μ, the deviation of the grand
γl mean from the lth population mean. The column effect is a random variable that is NID(0,
).
is the residual error effect for the jth treatment level, kth row, and lth column and is equal
εjkl to (μjkl – μj.. – μ.k. – μ..l + 2μ). If all interaction effects are equal to zero, εjkl is NID(0, )
and independent of γl and εi(jkl).
is the within-cell error effect and is equal to (Yijkl – μjkl) . εi(jkl) is NID(0, ) and
εi(jkl)
independent of γl and εjkl.
This model is called a mixed model because the treatment and row effects are fixed effects, but
the automobiles (columns) are regarded as random effects. If I repeated the experiment, I
would obtain a new sample of automobiles. The expected values of the mean squares for this
model are given in Table 14.3-2. Expected values for other models are readily obtained by
replacing a variance term by a sum of squared effects divided by its degrees of freedom when
the effects are fixed and vice versa when the effects are random. For example, if the column
effects in Table 14.3-2 were regarded as fixed, would be replaced by and
subject to the restriction .
The residual error effect, εjkl, in model (14.5-1) is so named because it represents that portion
of a score that remains after the other terms in the model have been subtracted from it, that is,
If each cell of a Latin square design contains one observation, then a residual mean square
must be used to estimate the error variance because it is not possible to compute a within-cell
estimate of the error variance.
The values of the parameters μ, αj, βk, γl, εjkl, and εi(jkl) in model (14.5-1) are unknown, but
they can be estimated from sample data as follows.
The partition of the total sum of squares into SSA + SSB + SSC + SSRES + SSWCELL is
obtained by rearranging the terms in equation (14.5-2) following the examples in Sections 3.2
and 8.1.3 The derivation of the expected values of the mean squares follows the procedures
illustrated in Section 3.3 for a completely randomized design.
Statistics for testing hypotheses about contrasts in a Latin square design have the same
general form as those in Chapter 5. An a priori t statistic for the ith contrast among the
treatment A means has the following form:
with p2(n − 1) degrees of freedom. To reject the null hypothesis that ψi = c1μ1.. + c2μ 2.. + … +
cpμp.. = 0, the absolute value of t must equal or exceed tα/2, v for a two-tailed test or tα, v for a
one-tailed test, and in the latter case, the value of the contrast, , must be consistent with the
alternative hypothesis.
The Fisher-Hayter (Hayter, 1986) and Scheffé (1959) a posteriori test statistics are, respectively,
with (p − 1) and p2(n − 1) degrees of freedom. The null hypothesis for a contrast is rejected if |
qFH | ≥ qα; p–1, v for the Fisher-Hayter test. For Scheffé's test, the hypothesis is rejected if FS
≥ (p − 1)Fα; (p–1), (n–1)(p–1).
A Latin square design enables a researcher to isolate the effects of two nuisance variables
while evaluating treatment effects. If the row and column effects are appreciably greater than
zero, a Latin square design is generally more powerful than completely randomized and
randomized block designs. Next, I show how to estimate the relative efficiencies of Latin square,
completely randomized, and randomized block designs.
The relative efficiency of completely randomized and Latin square designs, ignoring differences
in degrees of freedom, is given by
MSRES is the error term for the Latin square design. MSWG for the completely randomized
design is computed from the mean squares in a Latin square design as follows:
If the data in Table 14.3-1 were based on n = 1, the required mean squares for B, C, and the
residual can be obtained from Table 14.3-1 and are, respectively, 6.167, 5.167, and 1.667.
These mean squares are computed as if the values in the ABC summary table of Table 14.3-1
are based on n = 1 observation instead of two observations. The formulas for [B], [C], and so on
are modified accordingly. The value of MSWG required in equation (14.7-1) is
The efficiency of the Latin square design relative to that of a completely randomized design is
This number indicates that if the effects of the nuisance variables had not been isolated and if
the p2 = 16 experimental units had been randomly assigned to the treatment levels in a
completely randomized design, the resulting error variance would be 1.96 times larger than that
of the Latin square design. A correction developed by Fisher (1935a) for the smaller error
degrees of freedom for the Latin square design can be incorporated in the efficiency index as
follows:
The efficiency of a Latin square design relative to that of a randomized block design can be
estimated in two ways depending on whether the rows or the columns of the square are
considered to be replicates. In the rubber compounds experiment, automobiles, the column
variable, represent replications and correspond to blocks in a randomized block design. The
relative efficiency of the two designs, with the columns considered to be replications, is given
by
where MSRESRB for a randomized block design is computed from the mean squares in a Latin
square design as follows:
Again, with the data in Table 14.3-1 considered to be based on n = 1, the relative efficiency is
Hence, the error variance of the randomized block design would be 1.56 times larger than that
of the Latin square design if the nuisance variable of wheel positions, variable B, had been
ignored. This estimate of the relative efficiency of the LS-p and RB-p designs includes a
correction (0.933) for the smaller error degrees of freedom for the Latin square design.
If the rows of the Latin square are considered to be replications, MSRESRB is estimated by
The efficiency of a Latin square design relative to that of a randomized block design, with rows
considered to be replications, is
Comparison of the Partition of the Total Sum of Squares for Three Designs
At the beginning of this section, I observed that a Latin square design is generally more
powerful than either a completely randomized or a randomized block design if the row and
column effects in the Latin square design are appreciably greater than zero. I illustrate this
point with a schematic comparison of the total sum of squares and degrees of freedom for the
three designs. Assume that an experiment has four treatment levels with four observations in
each level. A partition of the total sum of squares and degrees of freedom for the three designs
is presented in Figure 14.7-1. It is evident from this figure that SSRES for the Latin square
design will be less than SSWG for a completely randomized design if either the row or column
effects are greater than zero. Also, SSRES for the Latin square design will be less than SSRES
for the randomized block design if the column effects are greater than zero. However, if the row
and column effects in the Latin square design are both equal to zero, the design is less
powerful than the completely randomized and randomized block designs because the error
terms of the latter designs have more degrees of freedom.
Figure 14.7-1 ▪ Partition of the total sum of squares and degrees of freedom for three
designs. The rectangles with the thicker lines indicate the sums of squares that are used
to compute an estimate of error variance.
I describe a variety of analysis of covariance designs in Chapter 13. The procedures there
generalize to a Latin square design. Computational procedures for a LSAC-p design are shown
in Table 14.8-1. The experimental design model equation for the design where n ≥ 2 is
The assumptions associated with the model equation are those for a Latin square design
described in Sections 14.1 and 14.5 and the analysis of covariance described in Section 13.4.
Comparisons among adjusted means follow the procedures in Section 13.5 for a CRAC-p
design.
If a Latin square design contains one observation per cell, the error term for testing treatment A
and the two nuisance variables is MSRES rather than MSWCELL. In this case, the formulas for
the adjusted sums of squares are as follows:
The cell means model approach I introduced in Section 7.7 can be used to analyze data for a
Latin square design. Two cases need to be distinguished: the restricted model and the
unrestricted model. If each cell of a Latin square contains only one observation, a restricted cell
means model should be used. If each cell contains two or more observations, either a restricted
or an unrestricted cell means model can be used. The unrestricted model is discussed first.
where εi(jkl) is NID(0, ). The analysis procedures are illustrated using the data in Table 14.3-1
for the rubber compounds experiment. In addition to computing SSTO and SSWCELL, I need
to formulate coefficient matrices, C′, for computing SSA, SSB, SSC, and SSRES. As an aid to
understanding the null hypotheses and associated coefficient matrices, the reader may find it
helpful to refer to Tables 14.9-1 and 14.9-2.
Table 14.9-1 ▪ Parameters Used to Construct Coefficient Matrices for an LS-4 Design
where
and h = p2. The coefficients, cjkl = ±1/p or 0, in have been multiplied by p = 4 to avoid
fractions. This does not affect the nature of the hypothesis that is tested.
where
where
The hypothesis that is the basis for formulating the residual coefficient matrix is
or
This hypothesis is the only one in the Latin square design that is unintelligible. In words, it
states that the sources of variation that correspond to components of two- and three-treatment
interactions are all equal to zero. An algorithm for obtaining the residual matrix for any Latin
square design is described later.
Formulas for computing sums of squares, F statistics, and cell means are given in Table 14.9-2.
A comparison of these sums of squares with those in Table 14.3-1, where the classical sum-of-
squares approach is used, reveals that they are identical. The cell means model can be used to
analyze data for a Latin square design with missing observations and cells. The required
modifications are similar to those presented in Section 9.13 for a completely randomized
factorial design.
Table 14.9-2 ▪ Computational Procedures for LS-4 Design Using a Cell Means Model
The hypothesis that interactions involving A, B, and C are all equal to zero is
where R′ is a coefficient matrix and θ is a vector of zeros. I now describe an algorithm for
obtaining R′. The 3 × 3 and 4 × 4 Latin squares in Figure 14.9-1 are used for illustrative
purposes.
Figure 14.9-1 ▪ Layout and mean vectors for 3 × 3 and 4 × 4 Latin squares.
Step 1 Partition μ into p groups according to the levels of treatment A (see Figure 14.9-1).
Step 2 Form residual vectors, , for p − 2 arbitrarily selected means in each of p − 1
arbitrarily selected groups. (In Figure 14.9-1, the p − 1 groups and the p − 2 means within the
groups that have been selected are circled.) Coefficients for a particular residual vector are
determined as follows:
a.The circled mean is given the coefficient (p2– 3p + 2).
b.Means that have a subscript in common with the circled mean are given the coefficient –(p
− 2).
c. All other means are given the coefficient 2.
Example for
Example for
Step 3 If desired, linearly transform the residual vectors to reduce the coefficients to the 1, 0,
−1 form. For example,
I can follow the same steps to obtain R′ for the 4 × 4 Latin square in Figure 14.9-1. Following
step 2, I obtain (p − 1)(p − 2) = 6 residual vectors:
The vectors can be transformed to the 1, 0, −1 form as follows. The transformation that is used
must be one that preserves the linear independence of the residual vectors:
Graeco-Latin and hyper-Graeco-Latin square designs are described in Sections 14.10 and
14.11, respectively. Rules similar to those just described can be used to obtain R′ for these
designs. Let k denote the number of orthogonal Latin squares in a hyper-Graeco-Latin square
design. The three-step algorithm is as follows:
b.Means that have a subscript in common with the circled mean are given the coefficient –(p
– k − 1).
c. Means that do not have a subscript in common with the circled mean are given the
coefficient (k + 1).
Step 3 If desired, linearly transform the residual vectors to reduce the coefficients to the 1, 0,
−1 form.
A restricted cell means model should be used if the cells of a Latin square contain only one
observation. For this case, a within-cell estimate of the population error variance cannot be
computed. Instead, the residual mean square is used to estimate the population error variance
under the assumption that interactions among A, B, and C are all equal to zero. A restricted
model also can be used when n > 1 and a researcher wants to pool MSRES and MSWCELL
under the assumption that all interactions are equal to zero. The modifications of the
computational formulas for this case follow those shown in Section 9.13.
where εjkl is NID (0, ) and μjkl is subject to the restrictions that all interaction components
involving A, B, and C are equal to zero. These restrictions can be expressed as
where s = (p − 1)(p − 2). The coefficient matrices, R′, , and so on, for the restricted model
are identical to those defined earlier for the unrestricted model.
I want to test the null hypotheses for treatment A and nuisance variables B and C, subject to
the restrictions that R′ μ = θ. This is accomplished by forming augmented matrices of the form
where C′ is the coefficient matrix for one of the null hypotheses and 0 is the null vector
associated with C′.
The computational procedures for the restricted cell means model are illustrated in Table 14.9-3
using the data from the ABC summary table of Table 14.3-1. I show in Section 8.8 that if a
restriction matrix, R′, is orthogonal to a coefficient matrix, C′, then the general computational
formula simplifies to . Recall from Section
8.8 that R′ and C′ are orthogonal if R′C = 0. For the data in Table 14.9-3, the general and
simplified formulas give the same results.
Table 14.9-3 ▪ Computational Procedures for LS-4 Design Using a Restricted Cell Means
Model With No Missing Observations (Data from ABC summary table of Table 14.3-1.)
The restricted cell means model approach can be used to analyze data for a Latin square
design with missing cells. The required modifications are similar to those presented in Section
8.8 for a randomized block design.
I have shown that a Latin square design enables a researcher to isolate variation due to two
nuisance variables while evaluating treatment effects. A Graeco-Latin square design, denoted
by GLS-p, enables a researcher to isolate three nuisance variables. A Graeco-Latin square
design consists of two superimposed orthogonal Latin squares. For example, if I denote four
levels of a treatment by the Latin letters A, B, C, and D and four levels of a third nuisance
variable by the Greek letters α, β, γ, and δ, the GLS-4 design has the following form:
Two Latin squares are orthogonal if, when they are superimposed, each letter of one square
occurs once and only once with each letter of the other square. The 4 × 4 square just shown
satisfies this requirement.
The computational procedures for a Graeco-Latin square design are similar to those for a Latin
square design. The computational formulas, degrees of freedom, and expected values of the
mean squares for a fixed-effects model are given in Table 14.10-1. The analysis follows that for
the Latin square design in Table 14.3-1; however, the ABCS and ABC summary tables are
replaced by ABCDS and ABCD tables, respectively. Also, a D summary table for the third
nuisance variable is required. The entry in this table is for dm, where m = 1,…,
p.
Complete sets of orthogonal Latin squares of sizes 3, 4, 5, 7, 8, 9, and 10 are given by Fisher
and Yates (1963, pp. 25, 86–89). It has been shown (Fisher & Yates, 1934) that no orthogonal
pair of 6 × 6 Latin squares exists. Orthogonal squares are known to exist for all odd prime
numbers, powers of primes, and multiples of four. Bose, Shrikhande, and Parker (1960) proved
that orthogonal squares also exist for squares of the size 4s + 2, where s > 1.
Graeco-Latin square designs have not proven very useful in the behavioral sciences for two
reasons. First, the design is not appropriate if any of the variables interact. In the planning
stages of an experiment, it is difficult to predict with any degree of confidence that all
interaction effects for four variables are equal to zero. A second reason for the design's lack of
popularity is the restriction that each variable must have the same number of levels. Achieving
this balance can be difficult. Graeco-Latin squares are most useful as parts of more inclusive
designs.
If one or more orthogonal Latin squares are superimposed on a Graeco-Latin square, the
resulting square is called a hyper-Graeco-Latin square. The number of orthogonal Latin
squares that can be combined to form hyper-squares is limited. For example, no more than p −
account for all of the (4)2 − 1 = 15 degrees of freedom. For a 5 × 5 square, the p2 − 1 = 24
degrees of freedom are accounted for by six modes of classification. Hence, no more than p − 1
= 4 orthogonal 5 × 5 squares can be combined. The residual degrees of freedom and residual
sum-of-squares formulas for hyper-Graeco-Latin squares with five and six modes of
classification are, respectively,
and
The sum-of-squares formulas for rows, columns, and so on have the same general pattern as
the corresponding formulas in a Latin square design.
A crossover design, also called a changeover design, is so named because subjects are
administered first one treatment level and then “crossed over” to receive a second and perhaps
a third or even a fourth treatment level. The simplest crossover design has two treatment levels,
denoted by a1 and a2. Half of the subjects receive a1 followed by a2; the other half receive a2
followed by a1. This design can be used when it is reasonable to assume that subjects revert to
their original state before the second treatment level is administered. For example, the effects
of a drug should be completely eliminated before a second drug is administered. Carryover
effects—treatment effects that continue after a treatment has been discontinued—are a
potential threat to internal validity. Sometimes carryover effects can be eliminated or at least
minimized by inserting a rest or washout period between administrations of the treatment
levels. Alternatively, a more complex crossover design can be used that provides a statistical
adjustment for the carryover effects of the immediately preceding treatment level. These
designs are discussed by Jones and Kenward (2003).
Crossover designs have features of randomized block and Latin square designs. Each subject
receives all p treatment levels and serves as his or her own control, as in a randomized block
design. And, as in a Latin square design, each treatment level occurs an equal number of times
in each time period, and the effects of two nuisance variables, blocks and time periods, can be
isolated. Crossover designs, denoted by CO-p, are often used in medical clinical trials,
agricultural research, and marketing research. The layouts for two simple crossover designs
with two and three treatment levels are shown in Figure 14.12-1. Such designs are appropriate
for experiments that meet, in addition to the general assumptions of analysis of variance, the
following conditions.
Figure 14.12-1 ▪ (a) Layout for a CO-2 design. (b) Layout for a CO-3 design. Each block of
the crossover designs contains p treatment levels, and each treatment level occurs an
equal number of times in each time period. The subjects are randomly assigned to the
blocks.
1.There is one treatment with p ≥ 2 treatment levels and two nuisance variables. One of the
nuisance variables is experimental units (blocks or replications); the other nuisance variable
is periods of time and must have p levels.
2.Each treatment level must occur an equal number of times in each time period. Hence, the
design requires n = kp blocks, where k is a positive integer.
3.Experimental units are randomly assigned to n blocks. Each experimental unit is observed p
times.
4.The variability within each of the n blocks should be less than the variability among units in
different blocks.
5.The effects of a treatment level must dissipate before another treatment level is
administered.
6.It must be reasonable to assume that there are no interactions among the blocks, periods of
time, and treatment levels. If this assumption is not met, a test of one or more of the
corresponding effects is biased.
where
Yijkis the score in the ith block, jth treatment level, and kth time period.
is the grand mean of the population means, μ111, μ221,…, μnpp. The grand mean is a
μ
constant for all observations in the experiment.
is the treatment effect for population k and is equal to μ.j. – μ, the deviation of the grand
αj mean from the jth population mean. The treatment effect is a constant for all observations in
treatment level aj and is subject to the restriction .
is the time-period effect for population k and is equal to μ..k – μ, the deviation of the grand
βk mean from the kth population mean. The time-period effect is a constant for all observations
in column bk and is subject to the restriction .
is the block effect for population i and is equal to μi.. – μ, the deviation of the grand mean
πi
from the ith population mean. The block effect is a random variable that is NID (0, ).
is the residual error effect that is equal to Yijk – μ.j. – μ..k – μi.. + 2μ. The error effect is a
εijk
random variable that is NID (0, ); εijk is independent of πi.
The values of the parameters in equation (14.12-1) for this mixed model are unknown, but they
can be estimated from sample data as follows:
The partition of the total sum of squares into SSA + SSB + SSBL + SSRES is obtained by
rearranging the terms in equation (14.12-2) following the examples in Sections 3.2 and 8.1. The
derivation of the expected values of the mean squares follows the procedures illustrated in
Section 3.3 for a completely randomized design.
is given by F = MSA/MSRES. If the null hypothesis is true, the necessary and sufficient
condition for F = MSA/MSRES to be distributed as F with (p − 1) and (n − 2)(p − 1) degrees of
freedom is the sphericity condition (see Section 8.4).
A crossover design may be considerably more powerful than completely randomized and
randomized block designs. Consider an experiment in which p = 3 and n = 6. A schematic
comparison of the partition of the total sum of squares and degrees of freedom for these three
designs is shown in Figure 14.12-2. If the block and column effects in the crossover design are
appreciably greater than zero, the crossover design is more powerful than the other two
designs.
Figure 14.12-2 ▪ Partition of the total sum of squares and degrees of freedom for three
designs. The rectangles with the thicker lines indicate the sums of squares that are used
to compute an estimate of error variance.
The null hypothesis for treatment A is rejected. To determine which population means differ, the
Fisher-Hayter multiple comparison procedure can be used. The critical difference that a
contrast among means must exceed (see Section 5.5) is
According to Table 14.12-3, the use of color resulted in better recognition than the other two
highlighting techniques. The portion of variance accounted for by treatment A is
where
The formulas for computing and are based on the expected values of the mean
squares in Table 14.12-2.
The major advantages of designs that are based on a Latin square are as follows:
1.Greater power for many research applications than completely randomized and randomized
block designs. The designs permit a researcher to isolate the effects of two or more
nuisance variables and thereby reduce the error variance and obtain a more precise
estimate of treatment effects.
2.Simplicity in the analysis of data.
1.The number of levels of each nuisance variable must equal the number of treatment levels.
Because of this requirement, Latin squares larger than 8 × 8 are seldom used. In a
crossover design, this restriction applies only to treatments and periods of time.
2.Latin squares smaller than 5 × 5 are not practical because of the small number of degrees
of freedom unless each cell contains more than one experimental unit.
3.The assumption of the model that there are no interactions among any of the variables is
quite restrictive. The complex designs described in Chapter 15 that use a Latin square as
the building block do permit interactions among some variables.
4.Randomization is relatively complex.
1.Terms to remember:
a.standard square (14.2)
b.conjugate squares (14.2)
c. self-conjugate squares (14.2)
d.positively biased test (14.3)
e.carryover effects (14.12)
2.[14.1] What is the principal advantage of a Latin square design over completely randomized
and randomized block designs?
*3.[14.1] Why is a Latin square design rarely used in behavioral and educational research?
4.[14.2]
a.Construct a 5 × 5 standard Latin square.
b.Is the square self-conjugate?
c. How many arrangements of this square are possible?
5.[14.2] Randomize the Latin square from Exercise 4(a); describe each of the steps in the
randomization procedure.
*6.The effects of three levels of staff disciplinary intervention on patient morale in a large
Veterans Administration hospital were investigated (a1 = low intervention, a2 = moderate
intervention, a3 = high intervention). The nuisance variables were length of the intervention
(b1 = 6 weeks, b2 = 12 weeks, b3 = 18 weeks) and degree of heterogeneity of social
activities available (c1 = low heterogeneity, c2 = m e d i u m heterogeneity, c3 = high
heterogeneity). Twenty-seven chronic schizophrenic patients were randomly assigned to the
cells of the following 3 × 3 Latin square. At the beginning and end of the study, each
patient completed a 20-item morale inventory. The dependent variable was the difference
between the pre- and posttest morale scores (Yposttest – Ypretest). The following data
were obtained; the larger the number, the greater was the improvement in morale.
(Experiment suggested by Jennings, R. D. Three levels of staff intervention and their effect
on inpatient small group morale, leadership, and performance. Journal of Abnormal
Psychology.)
*a.[14.3] Test the hypotheses μ1.. = μ2.. = μ3.., μ.1. = μ.2. = μ.3., and μ..1 = μ..2 = μ..3;
use MSWCELL as the error term and let α = .05.
*b.[14.3] Calculate the power of the test of μ1.. = μ2.. = μ3…
*c.[14.3] For the hypothesis μ1.. = μ2.. = μ3.., determine the number of subjects required
to achieve a power of approximately .82.
*d.[14.3] Does it make sense to compute measures of strength of association? Explain.
e.Prepare a “results and discussion section” for the Journal of Abnormal Psychology.
where c1j denotes the linear trend coefficient for the jth treatment level. (ii) What
percent of the treatment trend is accounted for by the nonlinear component?
*g.[14.7] Determine the efficiency of the randomized block design relative to that of a
completely randomized design; use Fisher's correction.
h.Prepare a “results and discussion section” for the Journal of Experimental Psychology.
8.The Von Restorff effect, which is the facilitation in learning a particular item in a list due to
making the item distinctive from other items, was investigated using different degrees of
color saturation. Three degrees of color saturation were used: a1 was high saturation, a2
was medium saturation, and a3 was low saturation; a4 was a control condition in which no
color was used. Sixteen first-grade students were assigned to one of four groups on the
basis of their performance on the verbal subtests of the Wechsler Intelligence Scale for
Children. The four children with the highest scores were assigned to row 1 of a 4 × 4 Latin
square, the four with the next highest scores to row 2, and so on. The columns of the Latin
square represented stimulus items selected from artificial alphabets that had low, medium,
high, and very high resemblance to English letters. The children were required to associate
nouns with the stimulus items using a paired-associates learning paradigm. One stimulus
item in each paired-associate list was printed in color, except for the control condition. A
computer was used to present the lists. The dependent variable was the number of trials
required to learn the noun associated with the colored stimulus or control stimulus. The
following data were obtained:
a.[14.4] Test the hypotheses , and μ..1 = μ..2 = μ..3 = μ..4; let α
= .01.
b.[14.3] Calculate the power of the test of μ1.. = μ2.. = μ3.. = μ4…
c. [14.3] Calculate and interpret the statistics , and .
d.[14.4] Use Tukey's procedure to determine whether the additive model is appropriate; let
α = .05.
e.[14.6] Use Dunnett's statistic to determine which treatment A population means differ
from the mean of the control population; test nondirectional hypotheses.
f. [14.7] Determine the efficiency of the randomized block design relative to that of a
9.[14.4] If each cell of a Latin square design contains one experimental unit, it is necessary to
use MSRES to test hypotheses about treatment and nuisance effects. What assumption
must be tenable to use the residual mean square as an error term? How can the tenability
of this assumption be determined?
*10.
[14.5] For a Latin square design, write the expected values of the mean square for the
following:
is equal to
plus 10 cross-product terms that equal zero. Recall that the subscripts j, k, and l are
redundant. Once the row and column, for example, are specified, the treatment level is
determined. Summation is performed over only np2 scores rather than np3 scores as
would suggest. Derive the computational formulas for SSA and SSWCELL
from the deviation formulas above.
12.
[14.6] Under what conditions is it appropriate to use a pooled error term in testing a
hypothesis about a population contrast?
*13.
[14.7] Suppose that the efficiency of a LS-4 design relative to that for a CR-4 design was
found to be 1.35. Interpret this relative efficiency index.
*14.
[14.9] Exercise 6 described an experiment about the effects of three levels of staff
disciplinary intervention on patient morale.
*15.
[14.9] Exercise 7 described an experiment about the effects of five levels of associative
strength of adjective-noun pairs on recognition threshold.
16.
[14.9] Exercise 8 described an experiment that investigated the Von Restorff effect.
*17 [14.10]
square.
c. Construct a 5 × 5 Graeco-Latin square.
*18.
[14.10] A number of rules of thumb have been developed for determining the existence of
various-size orthogonal Latin squares. For squares of size 10 × 10 to 20 × 20, use the rules
to support the existence of orthogonal Latin squares; in each case, cite the rule you used.
*19.
The effect of strength of association between two words on learning to read the second
word in a two-word pair was investigated. A pretest was used to select 32 first-graders who
were able to read the first word of each pair but not the second word. Following the pretest,
word association training was given in which the response words were repeated 30 times
after the stimulus words, treatment level a1; 10 times, a2; 5 times, a3; and 0 times, a4. The
children were then shown the stimulus-response words printed on 5 × 8 index cards and
given a series of reading training trials. The four two-word pairs, corresponding to the levels
of variable B, were presented in four different sequences (variable C). Variable D
represented the four assistants who conducted the reading training. The children were
randomly assigned to the cells of the following 4 × 4 Graeco-Latin square. The dependent
variable was the number of words recognized on a word-recognition test. (Experiment
suggested by Samuels, S. J., & Wittrock, M. Word-association strength and learning to
read. Journal of Educational Psychology.)
*20.
[14.9 and 14.10]
*a.Write an unrestricted cell means model for the GLS-4 design in Exercise 19.
*b.Write the C′ and R′ matrices for testing the null hypotheses for A, B, C, D and the
residual. Assume that μ′ = μ111 μ1223 μ1334 μ1442 μ2122 μ2214…μ4413].
*c.Use the cell means model approach to test the null hypotheses for A, B, C, D and the
residual; let α = .05.
*d.Use Dunnett's multiple comparison statistic to determine which treatment A population
means differ from the control population mean, μ4; test nondirectional hypotheses.
21.
Make a schematic comparison like Figure 14.7-1 for CR-5, RB-5, LS-5, and GLS-5 designs;
let n = 5.
*22.
The effects of oxygen gel, treatment level a1, versus a placebo gel, treatment level a2, on
gingivitis were investigated. Five of 10 patients were randomly assigned to receive treatment
level a1 followed by a2; the other 5 received a2 followed by a1. The patients were measured
on an index of oral hygiene at the beginning and end of each treatment period. The
dependent variable was the change in the index of oral hygiene. The treatment periods
were separated by a washout period of 3 months. The following data were obtained; the
larger the score, the greater was the improvement in the index of oral hygiene.
8 3
s5 a1 a2
15 9
s6 a2 a1
10 11
s7 a1 a2
8 2
s8 a1 a2
11 12
s9 a2 a1
9 15
s10a2 a1
6 5
*a.[14.12] Test the hypotheses μ.1. = μ.2., μ..1 = μ..2, and ; let α = .05.
*b.[14.12] Calculate the power of the test of μ.1. = μ.2..
*c.[14.12] Calculate and interpret , and .
d.Prepare a “results and discussion section” for the Journal of Oral Hygiene.
*The reader who is interested only in the classical sum-of-squares approach can, without loss
of continuity, omit this section.
http://dx.doi.org/10.4135/9781483384733.n14
The designs in this chapter use the technique of group-interaction confounding to achieve a
reduction in the number of treatment combinations that must be assigned to blocks. The
technique was first described by Sir Ronald A. Fisher in 1926, and it was used as early as 1927
in agricultural research at the Rothamsted Experimental Station. I have discussed the
advantages of using a blocking procedure to isolate a nuisance variable at length in Chapters
8, 10, and 12. Unfortunately, when the number of treatment combinations is large, it can be
difficult to secure enough homogeneous subjects or experimental units to form the blocks.
Even a relatively small design, such as an RBF-33 (randomized block factorial) design, requires
3 × 3 = 9 subjects per block. Obtaining repeated observations on the same subjects is no
solution because there is a practical limit to the number of times a subject can participate in an
experiment. And the nature of the treatments often precludes obtaining more than one
measurement per subject. The split-plot factorial design described in Chapter 12 provides one
solution to the problem of unwieldy block size: Assign only a portion of the treatment
combinations to each block. For example, an SPF-3.4 design has 12 treatment combinations,
but only four are assigned to a block. The reduction in block size is achieved by confounding
the effects of treatment A with the effects of groups of blocks. Unfortunately, the test of
treatment A usually is less powerful than tests of treatment B and the AB interaction. This is
satisfactory if a researcher's primary interest is in B and AB. In most experiments, however,
treatments A and B, not the interaction, are of primary interest. The confounded factorial
designs that are described in this chapter achieve a reduction in block size by confounding one
or more interactions with groups of blocks. This reduces the number of treatment combinations
within each block and, in addition, has an important advantage over the confounding scheme of
a split-plot factorial design: Treatments A and B are tested using the same within-blocks error
term. Ordinarily, the use of a within-blocks error term results in a more powerful test than the
use of a between-blocks error term. Confounded factorial designs are particularly appropriate if
an interaction is expected to be negligible. This interaction can be confounded with groups,
thereby achieving a reduction in block size without sacrificing power in evaluating the
treatments.
Confounded factorial designs can be constructed from either randomized block or Latin square
designs. The letters RBCF and LSCF denote factorial designs in which an interaction is
completely confounded with groups of blocks. If an interaction is partially confounded with
groups, the design is denoted by the letters RBPF. The latter design provides partial within-
blocks information about the confounded interaction. The designation RBCF-pk indicates that
the design is restricted to the case in which k treatments each have p levels. For example, an
RBCF-32 design has two treatments, A and B, each having three levels.
A comparison of two factorial designs is shown in Figure 15.1-1. Suppose that repeated
measurements are obtained on the subjects. In the confounded factorial design, the subjects in
each block receive only three treatment combinations, and each block contains three levels of
treatment A and three levels of treatment B. Later, I show that this design confounds the AB
interaction with groups, but it does not confound either treatment with groups. In the split-plot
factorial design, the subjects in each block also receive only three treatment combinations.
However, each block contains only one level of treatment A and three levels of treatment B. As
you learned in Chapter 12, this scheme confounds treatment A with groups.
A randomized block confounded factorial design and a Latin square confounded factorial
design are appropriate for experiments that meet, in addition to the assumptions of the
experimental design model, the following conditions:
1.There are two or more treatments, where each treatment has p levels (p ≥ 2). Exceptions to
the general requirement that all treatments must have p levels are discussed in Sections
15.11 and 15.12.
2.The number of treatment combinations is greater than the desired size of each block.
3.Variation between groups is confounded with one or more interactions. Because
confounded effects are usually evaluated with less power than nonconfounded effects, the
interaction that is confounded with groups should be one that is believed to be negligible or
one in which there is little interest.
4.If repeated measurements on the subjects or experimental units are obtained, each block
contains one subject who is observed v times, where v is the number of treatment
combinations in a block. If repeated measurements are not obtained, each block contains v
homogeneous subjects.
5.For the repeated measurements case, nw blocks (subjects) are randomly assigned to the w
groups with n in each group. The order of administration of the v treatment combinations
within a block is randomized independently for each block.
6.For the nonrepeated measurements case, nw blocks, each containing v matched subjects,
are randomly assigned to the w groups. Following this, the v matched subjects within a
block are randomly assigned to the v treatment combinations.
7.It must be possible to administer the levels of each treatment in every possible sequence.
This requirement precludes, for example, the use of treatments whose levels consist of
successive periods of time.
The construction of randomized block confounded factorial designs is described in the first part
of this chapter. Confounding by means of a Latin square is discussed in Section 15.13.
The construction of RBCF-pk and RBPF-pk designs requires a scheme for assigning treatment
combinations to groups of blocks so that the variation between groups is confounded with one
or more interactions or interaction components. Several schemes have been devised for this
purpose (Bailey, 1977; J. A. John & Dean, 1975; Kempthorne, 1947, 1952; Patterson & Bailey,
1978; Yates, 1937). One scheme that is applicable to designs of the form pk, where p is a prime
number, is relatively simple.1 This scheme, which uses modular arithmetic, is described next.
Let I and m be any integers, with m > 0. If I divide I by m, I obtain a quotient q and a remainder
z because
In modular arithmetic, the remainder is the term of interest. Consider now dividing the integer J
= 5 by the same m = 3. The remainder also is equal to 2, because
Note that 17 and 5 leave the same remainder when divided by 3. Two integers, I and J, that
leave the same remainder when divided by a positive integer m are said to be congruent
integers with respect to the modulus m. This relation—congruence—can be written
On reflection, it should be apparent that any integer I is always congruent to its remainder z;
that is,
For example, I = 17 and z = 2 are congruent modulo 3 because when 17 and 2 are reduced
modulo 3 (divided by the modulus 3), they leave the same remainder:
The possible values of the remainder z are 0, 1, 2,…, m − 1. Thus, any integer is always
congruent to one of 0, 1, 2,…, m − 1, where m is the modulus. To illustrate, consider the
remainders for the integers 0 to 6:
Two operations of modular arithmetic are used in constructing confounded factorial designs:
modular addition and modular multiplication. The operation of addition is illustrated by the
following examples.
To add two integers aj and bk, one obtains their sum and reduces it modulo m—that is,
expresses it as a remainder with respect to the modulus m. I use this operation later to
confound an interaction with groups of blocks. I do this by letting aj, bk, z, and m correspond to
properties of an experimental design as follows:
The second operation of modular arithmetic used in constructing factorial designs is modular
multiplication. This operation is illustrated by the following examples:
To multiply two integers aj and bk, one obtains their product and expresses it as a remainder
with respect to the modulus m.
Designs described in the first part of this chapter are limited to those of the form pk, where p,
the number of levels of each treatment, is a prime number. Recall that a prime number is any
number that is divisible by no number smaller than itself other than 1. Examples of prime
numbers are 1, 2, 3, 5, 7, and 11. To use modular arithmetic in assigning treatment
combinations to groups, I have to use a new scheme for denoting the levels of treatments,
groups, and so on. According to this scheme, the first level of a treatment is denoted by the
subscript 0 instead of the usual 1. For example, the treatment levels of an RBCF-32 design are
denoted by a0, a1, a2, b0, b1, a n d b2. The nine treatment combinations and their
corresponding designations are shown in Table 15.2-1. The digit in the first position denotes the
level of treatment A; the digit in the second position denotes the level of treatment B. A design
This scheme leads to an odd-looking notation for sums of observations. For example, the sum
of i = 0,…, n − 1 observations is written
To keep the notation for summation consistent with that used in Chapters 1 to 14, I use the new
notation only to identify the levels of treatments, blocks, and groups. When summation is
performed, I revert to the use of 1 for the first level of treatments and so on. Thus, i ranges over
1,…, n and not 0,…, n − 1. It is understood that in writing denotes the first value of Y. This
lets me write instead of .
A randomized block factorial design with two levels of treatments A and B has four treatment
combinations—a0b0, a0b1, a1b0, and a1b1 or 00, 01, 10, and 11—and requires blocks of size
4. Suppose that it is possible to observe a subject only twice and that the researcher's primary
interest is in the two treatments. The block size can be reduced from 4 to 2 by confounding the
AB interaction with groups of blocks. The interaction, AB in this example, that is used to assign
the treatment combinations to groups of blocks is called the confounding contrast o r
defining relation. Modular arithmetic is used to determine which treatment combinations are
assigned to each group of blocks. Let aj denote the jth level of treatment A and bk the kth level
of treatment B. If AB is the confounding contrast, all treatment combinations that satisfy the
relation
where z is equal to 0, are assigned to group 0. Those that satisfy the relation where z is equal
to 1 are assigned to group 1. Modulus 2 is used because treatments A and B each have two
levels. The range of z is 0 and 1 because all integers are congruent to 0, 1,…, m − 1, and m is
equal to 2 in this example.
Thus, treatment combinations 00 and 11 are assigned to group 0; combinations 01 and 10 are
assigned to group 1. The notation (AB)z is an alternative way to denote the treatment
combinations that are assigned to group z = 0, 1. The treatment combinations of this RBCF–22
design with n = 4 blocks are shown in Figure 15.2-1.
Figure 15.2-1 ▪ Treatment combinations of an RBCF-22 design with blocks of size 2. The
AB interaction is confounded with groups of blocks. The numbers in the cells denote the
levels of treatments A and B.
I now show that the arrangement in Figure 15.2-1 confounds the AB interaction with groups.
Let μijkz denote the population mean for the ith block, jkth treatment combination, and zth
group. By definition, a two-treatment interaction effect has the form μjk – μj′k – μjk′ + μj′k′. The
it is equal to the AB interaction effect. Thus, the AB interaction effect and the groups effect are
completely confounded—that is, the effects of the two sources of variation cannot be
distinguished.
I have now described the basic concepts underlying the construction of randomized block
completely confounded factorial designs where each treatment has two levels. Before
describing partially confounded factorial designs and designs with more than two treatment
levels, I illustrate the computational procedures for RBCF-22 and RBCF-23 designs.
Assume that an experiment has been designed to evaluate the relative effectiveness of several
procedures for using computer-aided instruction material. The material was prepared to
acquaint mechanics with servicing procedures for a new airplane engine. The criterion used to
assess the effectiveness of the material was the number of simulated malfunctions in an engine
that trainees were able to diagnose. The instruction material was presented to trainees by
means of a computer. Treatment A consisted of two presentation rates for the material. Level a0
was an unpaced rate in which trainees pressed the “return” key on the computer when they
were ready to view the next frame of information. Level a1 was a paced presentation with 30
seconds between successive frames of information. The second variable investigated was the
type of response that the trainees made to each frame of information. Two types of responses
were investigated: one in which trainees responded to each frame of information by touching
the appropriate area of the computer screen, b0, and a second in which trainees typed a
response using the computer keyboard, b1.
The research hypotheses leading to this experiment can be evaluated by means of statistical
tests of the following null hypotheses:
Treatment A
Treatment B
AB interaction
The researcher's primary interest is in evaluating hypotheses for treatments A and B. The level
of significance adopted for all tests is .05.
To evaluate these hypotheses, a random sample of 16 trainees was obtained. Aptitude-test data
were used to assign the trainees to eight blocks of size 2 so that those in a block had similar
aptitude-test scores. The eight blocks were randomly assigned to two groups, with four blocks
in each group. Following this, the matched trainees in a block were randomly assigned to the
ajbk treatment combinations. Ordinarily, each group should contain at least 12 blocks to have
an adequate number of degrees of freedom for error variance.
Treatment combinations 00 and 11 satisfy the first relation and were assigned to the blocks in
Group0. Treatment combinations 01 and 10 satisfy the second relation and were assigned to
the blocks in Group1. The layout of the RBCF-22 design and computational procedures are
shown in Table 15.3-1. The analysis of variance is summarized in Table 15.3-2. According to the
analysis, the null hypotheses for both treatments were rejected. On the basis of the information
in Tables 15.3-1 and 15.3-2, we can conclude that the paced presentation rate, a1, is superior
to the unpaced rate, a0, and that typing a response at the terminal, b1, is better than touching
the appropriate area of the computer screen, b0.
In Section 15.12, I use the cell means model approach to analyze these data. The approach is
particularly useful if one or more observations is missing or if the ns in the groups are
unequal.2
where
is the error effect that is NID(0, ); εijkz is independent of πi(z). In this design, εijkz
εijkz
cannot be estimated separately from (αβ × π)jki(z).
Two sets of assumptions underlie the F tests for a randomized block confounded factorial
design: one set for the between-blocks test and a second set for the within-blocks tests. The
situation is similar to that described in Section 12.4 for a split-plot factorial design. The
assumptions underlying the between-blocks test are the same as those for a completely
randomized design (see Section 3.3). The key assumption is that the population variances for
g0 and g1 are homogeneous. Sample estimators of the two variances are given by
and
Procedures for testing the assumption of homogeneity of variance are described in Section 3.5.
The assumptions for the within-blocks tests include those described in Section 8.4 for a
randomized block design and the multisample sphericity assumption that is described in
Section 12.4. The mean square within-blocks error term MSAB × BL(G) is a pooled term that is
equal to [SSAB × BL(g0) + SSAB × BL(g1)]/ w(n − 1)(v − 1). When the block size is equal to 2,
MSAB × BL(g0) and MSAB × BL(g1) are each equal to the average of two variances minus a
covariance. That is,
The key assumption for the within-blocks tests is that the population interactions estimated by
MSAB × BL(g0) and MSAB × BL(g1) are equal. When the block size is equal to 2, the sphericity
assumptions
The advantage of an RBCF-22 design over an RBF-22 design is that it enables a researcher to
reduce the block size from 4 to 2. The advantage of an RBCF-22 design over a SPF-2.2 design
is that it enables a researcher to test treatments A and B using the same within-blocks error
term. The use of a within-blocks error term ordinarily results in more powerful tests than the use
of a between-blocks error term.
The computational procedures for a randomized block confounded factorial design with two
treatments, each having two levels, can be easily extended to designs with three or more
treatments. I now describe the layout and analysis for an RBCF-23 design. In this design, there
are four interactions: AB, AC, BC, and ABC. The size of a block can be reduced from 8 to 4 by
confounding one of these interactions with groups of blocks. The interaction that is chosen for
this purpose should be relatively unimportant or thought to be negligible. Usually, this is the
highest-order interaction—in this example, the ABC interaction.
Modular arithmetic is used to determine which treatment combinations are assigned to each
group of blocks. Let aj denote the jth level of treatment A, bk the kth level of treatment B, and cl
the lth level of treatment C. The ABC interaction can be confounded with groups by assigning
the treatment combinations that satisfy the relation
The treatment combinations of this design are shown in Figure 15.5-1. To have an adequate
number of degrees of freedom for the error term, each group should contain at least five
blocks.
Figure 15.5-1 ▪ Treatment combinations of an RBCF-23 design with blocks of size 4. The
ABC interaction is confounded with groups of blocks. The numbers in the cells denote
the levels of treatments A, B, and C.
Next, the levels 0 and 1 of treatment C are added to these combinations in a balanced manner.
The procedures just described can be used to confound the AC interaction with groups. I begin
with the treatment combinations that satisfy the relation
They are
Next, the levels 0 and 1 of treatment B are added to these combinations in a balanced manner.
For example,
The computational procedures for an RBCF-23 design in which the ABC interaction is
confounded with groups are shown in Table 15.5-1. The results of the analysis are summarized
in Table 15.5-2. These computational procedures are easily extended to confounded factorial
designs with four or more treatments.
The two designs described thus far are examples of completely confounded factorial designs.
In these designs, the AB or ABC interaction is confounded with groups of blocks. As I have
shown, confounded effects are usually evaluated with less power than unconfounded, within-
blocks effects. In designs that have more than two treatments, each with two levels, it is
possible to confound one interaction in one group of blocks, a second interaction in a second
group of blocks, and so on. This confounding scheme has the advantage of providing some
within-blocks information about an interaction from the blocks in which the interaction is not
confounded. The procedure is called partial confounding, and the design is called a
Consider the RBPF-pk design in Figure 15.6-1. The AB interaction is confounded with the
blocks in Group0, the AC interaction is confounded with the blocks in Group1, the BC
interaction is confounded with the blocks in Group2, and the ABC interaction with the blocks in
Group3. Because the AB interaction is confounded only in group 0, within-blocks information
on this interaction is available from groups 1, 2, and 3. It should be apparent from an
examination of Figure 15.6-1 that within-blocks information from three of the four groups also is
available for the AC, BC, and ABC interactions. The advantage of partial confounding is that
the block size can be reduced, and we can still obtain partial within-blocks information for each
of the confounded interactions. If an interaction is known to be insignificant, complete
confounding in which the interaction is confounded in all of the groups is preferable to partial
confounding.
Figure 15.6-1 ▪ Treatment combinations of an RBPF-23 design with blocks of size 4. The
numbers in the cells denote the levels of treatments A, B, and C. This design confounds
a different interaction in each group of blocks.
Federer (1955, p. 230) distinguishes between balanced partial confounding and unbalanced
partial confounding. The former designation refers to designs in which all effects of a
particular order—for example, all two-treatment interactions—are confounded with blocks an
equal number of times. The RBPF-23 design just described, in which AB, AC, and BC were
each confounded in one group of blocks, illustrates balanced partial confounding. If all effects
of a particular order are confounded with blocks an unequal number of times, the arrangement
is described as unbalanced partial confounding. For example, the AB and AC interactions
could be confounded in groups 0 and 1, respectively, and the ABC interaction in group 2.
Because AB and AC are each confounded once but the BC interaction is not confounded, the
design is said to involve unbalanced partial confounding.
Here I describe the computational procedures for the randomized block partially confounded
factorial design shown in Figure 15.6-1. In this design, the AB, AC, BC, and ABC interactions
have been confounded with the blocks in groups 0 to 3, respectively. The procedures described
in Section 15.5 were used to determine which treatment combinations to assign to each block.
If repeated measurements are obtained on the subjects, the design requires eight subjects who
are randomly assigned to four groups, each containing two subjects. Each subject corresponds
to a block. The sequence of the administration of the four treatment combinations within a block
is randomized independently for each block. For the nonrepeated measurements case, a
minimum of 32 subjects is required. Eight blocks, each containing four matched subjects, are
formed. The eight blocks are randomly assigned to four groups. The four matched subjects in a
block are then randomly assigned to the four treatment combinations in a group. More than
eight blocks can be used in the design. However, the number of blocks in each group must be
a multiple of 2.
The layout and computational procedures for this design are shown in Table 15.7-1; the
analysis is summarized in Table 15.7-2. The analysis can be compared with that in Table 15.5-
2, where the ABC interaction was completely confounded with groups. The advantage of partial
confounding is that a within-blocks estimate of the ABC interaction can be computed. The
reader has undoubtedly noted that this advantage is gained at the price of greater complexity in
the analysis of the data. Furthermore, the AB, AC, BC, and ABC interactions are computed
from only three fourths of the groups. The ratio 3/4 is called the relative information on the
confounded effects (Yates, 1937). The choice between a completely confounded and a partially
confounded design rests in part on the researcher's expectations about the interactions. If an
interaction is known to be insignificant, a completely confounded design is the better choice.
Numerous design possibilities are inherent in partial confounding for researchers who are
knowledgeable in a research area. For example, a researcher could choose to confound the AB
interaction in groups 0 and 1 and the ABC interaction in groups 2 and 3. This is an example of
unbalanced partial confounding, which I described in Section 15.6. If sufficient subjects are
available for five groups, a researcher could choose to confound AB, AC, and BC in groups 0
through 2, respectively, and ABC in groups 3 and 4. This design provides as much information
on the AB, AC, and BC interactions as the completely confounded design summarized in Table
15.5-2. In addition, it provides three-fifths relative information on the ABC interaction.
Computation of Means
Main-effects means for the data in Table 15.7-1 can be computed from summary tables based
on all four groups because the treatments are not confounded with groups. For example,
Thus far, I have focused on designs in which each treatment has two levels. In this section, I
show how to analyze designs in which each treatment has three levels.
The AB interaction in an RBCF-32 design can be partitioned into two orthogonal components
as follows:
SS df
AB (3 − 1)(3 − 1) = 4
(AB) (3 − 1) = 2
(AB2)(3 − 1) = 2
The two orthogonal interaction components (AB) and (AB2) have no special significance
apart from their use in partitioning the interaction. Treatment combinations that satisfy the
relation
constitute the (AB) component of the interaction. These treatment combinations are
The (AB2) interaction component is composed of the treatment combinations that satisfy the
relation
where the powers of A and B are used as the coefficients of aj and bk. These treatment
combinations are
By convention, the power of A is always equal to 1. This is necessary to uniquely define the
interaction components because (A2B) and (AB2) define the same treatment combinations; that
is,
strategy is to confound (AB) in one group of blocks and (AB2) in a second group of blocks as
shown in Figure 15.8-1(b). This confounding scheme provides within-blocks information on
both components of the AB interaction. The number of blocks in each group must equal 3 or a
multiple of 3.
Figure 15.8-1 ▪ Treatment combinations of RBCF-32 and RBPF-32 designs with blocks of
size 3. The numbers in the cells denote the levels of treatments A and B. Design (a)
confounds the (AB) component with groups; design (b) confounds the (AB) component
with blocks in Group0 and the (AB2) component with blocks in Group1.
The layout and computational procedures for an RBCF-32 design are shown in Table 15.8-1;
the analysis is summarized in Table 15.8-2. In this design, the (AB) component is confounded
The assumptions underlying this design are similar to those for an RBCF-22 design described
Main-effects means for the data in Table 15.8-1 can be computed from summary tables based
on all three groups because the treatments are not confounded with groups. A reader who is
interested in examining simple main-effects means should consult Federer (1955).
The layout and computational procedures for an RBPF-32 design, where (AB) and (AB2) are
confounded with the blocks in groups 0 and 1, respectively, are shown in Table 15.8-3.
The analysis is summarized in Table 15.8-4. The summary in Table 15.8-4 should be compared
with that in Table 15.8-2, where the (AB) component is completely confounded with groups.
The advantage of the partially confounded design is that within-blocks information is available
for both components of the AB interaction.
The computational procedures for SSAB′(between) and SSAB′(within) shown in Table 15.8-3
represent relatively simple analysis procedures, but they throw little light on the nature of these
terms. Actually, both sums of squares are pooled terms; for example,
where and are the sums of the observations in the treatment combinations that
satisfy the relations
(z = 0, 1, 2), respectively, and are confounded with blocks within a group. These treatment
combinations and associated sums are as follows:
Computation of the between-blocks components of the AB interaction using the above sums
gives
The sum of these two components is equal to SSAB′(between). The two interaction
components are completely confounded with blocks within groups, SSBL(G). This can be
verified by comparing the sources of variation that make up SS(AB)between and SS(AB2)
between with those for SSBL(G).
where Σ(AB2)z and Σ(AB)z are not confounded with blocks within a group. The treatment
combinations and associated sums are as follows:
It is customary to pool the two components, as was done in Table 15.8-4, in carrying out a test
Computation of Means
Main-effects means for the data in Table 15.8-3 can be computed from the AB summary table.
Simple main-effects means should be computed from adjusted cell totals based on the group
in which the interaction component is not confounded. The procedure is described by Federer
(1955).
The design in Table 15.8-3 provides only four degrees of freedom for testing the within-blocks
null hypotheses. The degrees of freedom for the error term can be increased by assigning n >
1 blocks to each interaction component, (AB)0, (AB)1,…, (AB2)2. The treatment combinations
in the modified design are shown in Figure 15.8-2. The computational formulas and degrees of
freedom for this design are given in Table 15.8-5. The analysis follows the pattern illustrated in
Table 15.8-4, where one block is assigned to each (AB)0, (AB)1,…, (AB2)2.
Figure 15.8-2 ▪ Treatment combinations of an RBPF-32 design with n > 1 blocks assigned
to each (AB)0, (AB)1,…, (AB)2.
So far in this chapter, I have described designs that have 2 × 2, 2 × 2 × 2, and 3 × 3 treatments.
The principles discussed in connection with these designs can be readily extended to any
design of the form pk, where p is a prime number. Experiments involving mixed primes, such as
3 × 2 × 2 and 3 × 3 × 2, are more difficult to lay out and analyze. A computational example for a
3 × 3 × 2 design is given in Section 15.11. The purpose of this section is to extend the
principles described earlier in connection with designs having 2 × 2, 2 × 2 × 2, and 3 × 3
treatments to other unmixed designs. In the process, I show how to achieve further reductions
in block size by confounding two or more interactions with groups of blocks.
An RBCF-24 design in which the ABCD interaction is confounded with groups can be
constructed using the relation
A complete block design, such as an RBF-2222 design, requires blocks of size 16. By
confounding the ABCD interaction with groups of blocks, the block size can be reduced to 8.
A researcher can choose to confound an interaction other than ABCD with groups if this seems
desirable. Generally, the highest-order interaction is confounded because it usually is of less
interest and less likely to be significant.
The block size of an RBCF-24 design can be further reduced by using two confounding
contrasts—say, ACD and AB—instead of one to assign treatment combinations to blocks. For
example, the two relations are
If we examine these combinations closely, we find that the first four satisfy both (ACD)0 and
(AB)0. These combinations are
A different set of four treatment combinations satisfies (ACD)0 and (AB)1 and so on. Thus, we
can assign four treatment combinations to blocks so that they satisfy, simultaneously, two
relations. This procedure produces the RBCF-24 design with blocks of size 4 shown in Figure
15.9-1.
Figure 15.9-1 ▪ Treatment combinations of an RBCF-24 design with blocks of size 4. The
treatment combinations in each block satisfy two confounding contrasts: (ACD)z and
(AB)z.
When two interactions are simultaneously confounded with groups of blocks, their generalized
interaction(s) or generalized treatment(s) is also confounded. The generalized interaction(s)
for any two confounding contrasts, symbolized by (X) and (Y), is given by the product (X)(Y)m
– z reduced modulo m, where m is the modulus and z assumes values m − 1, m − 2,…, m – (m
m – z = 2 − 1 = 1 and (X)(Y)2 − 1 = (X)(Y). If (ACD) is substituted for (X) and (AB) for (Y), the
generalized interaction is given by
Hence, if (ACD) a n d (AB) are simultaneously confounded with groups of blocks, another
interaction, (BCD), also is confounded with groups of blocks. Care must be used in choosing
the confounding contrasts so that no main effects are confounded with groups of blocks. Such
confounding would have occurred if I had chosen (ABCD) and (ACD) as the confounding
contrasts because then treatment B would have been the generalized treatment:
For any confounded design of the form pk, i interactions can be completely confounded in pi
blocks, where each block contains pk−i treatment combinations. The resulting design will have
[pi – (p − 1)i − 1]/(p − 1) generalized interactions. The computational formulas for an RBCF-24
design with (ACD) and (AB) as the confounding contrasts are given in Table 15.9-1.
An RBPF-33 design can be laid out in blocks of size 9 by confounding one interaction
component with each group of blocks. The design also can be laid out in blocks of size 3 by
confounding two interaction components with each group of blocks. I describe the layout with
blocks of size 9 first.
I showed in Section 15.8 that the AB interaction in an RBPF-32 design can be partitioned into
two orthogonal components as follows:
SS df
AB (3 − 1)(3 − 1) = 4
(AB) (3 − 1) = 2
(AB2)(3 − 1) = 2
Similarly, the ABC interaction in an RBPF-33 design can be partitioned into four orthogonal
components as follows:
SS df
ABC (3 − 1)(3 − 1)(3 − 1) = 8
(ABC) (3 − 1) = 2
(ABC2) (3 − 1) = 2
(AB2C) (3 − 1) = 2
(AB2C2)(3 − 1) = 2
The treatment combinations of the design are shown in Figure 15.9-2. This design provides
within-blocks information on treatments A, B, and C and the AB, AC, and BC interactions from
all four groups. It provides three-fourths relative information on each component of the ABC
interaction.
The design shown in Figure 15.9-2 requires blocks of size 9. If this block size is considered too
large, further confounding can be used to reduce the block size to 3. This
If (ABC) and (AB2) are confounded with groups of blocks, two generalized interactions also are
confounded:
Actually, (A3B5C) = (B2C) and (A2B3C) = (A2C) when reduced modulo 3. As discussed earlier,
the power of the first term should always equal 1 to uniquely define the interaction
components. This is achieved by squaring the terms (B2C) and (A2C) and reducing them
modulo 3. For example,
It can be shown that (B2C) and (BC2) define the same treatment combinations; the same can
be said for (A2C) and (AC2).
As we have seen, if (ABC) and (AB2) are simultaneously confounded with groups of blocks,
two other interaction components also are confounded: (BC2) and (AC2). Care must be
exercised in choosing confounding contrasts so that main effects are not confounded with
groups of blocks. Kempthorne (1952, p. 299) lists 15 confounding schemes for a 3 × 3 × 3
design with blocks of size 3; only four of these do not confound a treatment with groups of
blocks. One layout for an RBPF-33 design with blocks of size 3 that provides within-blocks
information on all main effects and interaction components is shown in Figure 15.9-3. A
summary of the computational formulas for this design is given in Table 15.9-2.
The (ABC2) component is computed from groups g0, g2, and g3 according to the formula
The procedures that I have described for RBCF-pk and RBPF-pk designs, where p is equal to 2
or 3, are applicable to designs in which p is any prime number. If p is equal to 5, the
AB interaction degrees of freedom for RBCF-52 and RBPF-52 designs can be partitioned as
follows:
SS df
AB (5 − 1)(5 − 1) = 16
(AB) (5 − 1) = 4
(AB2)(5 − 1) = 4
(AB3)(5 − 1) = 4
(AB4)(5 − 1) = 4
A balanced, partially confounded design can be laid out in four groups or a multiple of four
groups. Combinations of treatments A and B are assigned to blocks within each group by
means of the following relations:
It is not possible to describe in this chapter all or even a majority of the confounded factorial
designs. The approach that I have adopted is to present selected examples illustrating basic
principles and procedures applicable to a broad range of confounded designs. A listing of
confounded designs, together with pertinent information, is given in Table 15.9-3. The table
includes references that describe the layout and main features of each design. Many of the
There are almost as many notation systems in experimental design as there are books on the
subject. Fortunately, a common thread runs through most of the systems. The reader can, with
a little effort, learn to be comfortable with each system. One specialized notation system,
however, is difficult to follow without a brief note of explanation. This system is widely used in
designating treatment combinations when each treatment has two levels. In this system, the
symbol for a treatment combination contains only those letters for which the treatment is at the
higher level. The absence of a letter indicates that the treatment is at the lower level. If all
treatments are at the lower level, the symbol (1) is used. The terms lower and higher may refer
to positions on a scale or may be an arbitrary distinction
between levels. A comparison of this specialized notation for a three-treatment experiment with
that used in this chapter is as follows:
Although this notation system has certain advantages over the notation in this chapter, it is not
used here because it is appropriate only for treatments having two levels. The notation in this
chapter is more general in that it is appropriate for treatments with any number of levels.
Furthermore, the notation in this chapter lends itself to the application of modular arithmetic.
This alternative scheme is widely used but is less descriptive than that used in this chapter.
Confounded factorial designs in which the numbers of levels of each treatment are not equal
are called mixed designs. Examples of mixed designs are those with 3 × 2 × 2, 3 × 3 × 2, and
4 × 3 × 2 treatments, where the numbers denote, respectively, the number of levels of
treatments A, B, and C. These designs generally entail a more complex analysis than unmixed
designs. And the choice of block size is much more restricted for the mixed designs. To not
confound main effects, the block size must be an integral multiple of the number of levels of
each treatment. Thus, for a 3 × 2 × 2 design, the block size must equal 6. This permits a0, a1,
and a2 to each occur twice in a block and b0, b1 and c0, c1 to occur three times. All 12
treatment combinations of a 3 × 2 × 2 design can be assigned to two blocks, as shown in the
following illustration.
A tabulation of the occurrence of aj, bk, cl, ajbk, ajcl, and bkcl, in the two blocks is as follows.
Each level of aj, bk, and cl as well as each combination of ajbk and ajcl occurs equally often in
blocks 0 and 1. This is not true for bkcl. As a result, the BC interaction is partially confounded
with blocks. Furthermore, the ABC interaction also is confounded with blocks because different
combinations of ajbkcl occur in the blocks. On reflection, it should be evident that the BC
interaction, which involves four treatment combinations, must be confounded in a 3 × 2 × 2
design because the block size is not a multiple of 4. Another mixed design, an RBPF-322
design, also is laid out in blocks of size 6. In this design, it is the AB and ABC interactions that
are confounded. The block size of 6 allows all ajcl and bkcl treatment combinations to occur
within a block but only six of the nine ajbk combinations.
A balanced RBPF-322 design can be laid out in three groups of two blocks each. The layout is
shown in Table 15.11-1. The symbol a0(BC)0 stands for treatment combinations 000 and 011,
Treatment combinations denoted by a1(BC)1 are 101 and 110, where a1 is equal to 1 and
(BC)1 satisfies the relation
Analysis procedures for an RBPF-322 design are illustrated in Table 15.11-2. The procedures
for adjusting SSBC and SSABC require special comment. Earlier I showed (Table 15.11-1) that
the effects of blocks 0 and 1, 2 and 3, and so on are confounded with (BC)0 and (BC)1 as well
as the ABC interaction. Consequently, the BC and ABC interactions must be adjusted for block
effects. The rationale underlying these adjustments has been discussed by Federer (1955, p.
255), Kempthorne (1952, p. 348), Li (1944), Nair (1938), and Yates (1937). I simply illustrate the
procedure here. If the BC effects were not confounded with blocks, they could be estimated by
taking the difference between two components, Σ(BC)0 and Σ(BC)1:
where Σ(BC)0 is equal to the sum of the treatment combinations that satisfy bk + cl = 0(mod 2),
Σ(BC)1 is equal to the sum of the treatment combinations that satisfy bk + cl = 1(mod 2), and
nvw = (2)(6)(3) = 36. These sums can be obtained from the BC summary table. They are
The reader can verify that SSBC = 6.25 is the same value that is obtained using the more
familiar formula
where ΣSi denotes the sum of all observations in block i. The adjusted BC effects are based on
32 effective replications as opposed to 36 for the unadjusted BC effects. Federer (1955, p. 260)
illustrates the calculation of the number of effective replications. The adjusted sum of squares
for the BC interaction is given by
The same general procedure is used to adjust the ABC interaction for block effects. If the ABC
interaction is not confounded, it can be computed as follows:
Computation of the ABC interaction by this formula is analogous to summing the BC interaction
over the p levels of treatment A. The numbers required for this computation are obtained from
the ABC summary table:
The SSABC adj can be obtained by adjusting each [Σaj(BC)0 – Σaj(BC)1] for block effects and
subtracting BC adj:
The results of the analysis in Table 15.11-2 and the adjustment just described are summarized
in Table 15.11-3. It is apparent that the null hypotheses for treatments B and C and the BC
interaction can be rejected. In view of the significant BC interaction, a researcher might want to
test hypotheses regarding simple-effects contrasts or treatment-contrast interactions. These
tests should be carried out with adjusted bkcl cell means to remove the confounding effects of
the blocks. Procedures for adjusting the bkcl cell means are described next.
In an RBPF-322 design, treatments A, B, and C as well as the AB and AC interactions are not
confounded with blocks. Comparisons among means for main effects and simple main effects
involving AB and AC follow the procedures described previously for factorial designs. However,
because the BC interaction is partially confounded with blocks, cell means must be adjusted
before simple-effects contrasts or treatment-contrast interactions are tested. The required
adjustment for ΣYij00z and ΣYij11z is given by
The adjustment for and is numerically equal to the adjustment for and but
opposite in sign.
A t statistic for adjusted bkcl cell means that takes into account the effective number of
observations is
For a description of the adjustment procedure for simple simple-effects contrasts, see Cochran
and Cox (1957, p. 210).
Mixed confounded factorial designs present many special computational problems. A complete
presentation of these designs is beyond the scope of this book. General references include the
classic monograph by Yates (1937) and the work of Li (1944). Additional references are cited in
Table 15.11-4, which presents relevant information for a variety of mixed confounded factorial
designs.
A restricted cell means model can be used to represent data for a confounded factorial design.
The analysis procedures are similar to those for a split-plot factorial design that are described in
Section 12.14.
The restricted cell means model for the RBCF-22 design in Section 15.3 is
where Yijkz denotes a score in block i, treatment combination ajbk, and group z. εijkz is NID (0,
), and μijkz is subject to the restrictions that all AB × BL(G) population effects equal zero for
The analysis procedures for the restricted cell means model are illustrated for the computer-
aided instruction data in Table 15.3-1. For the purpose of forming coefficient matrices, C′, the
following hypotheses are expressed with vectors and matrices:
The hypothesis for the blocks within the g0th level of groups, , requires a
word of explanation. Because the blocks are nested in the groups, the hypothesis must involve
only those blocks that are nested in one level of groups—in this example, g0. A similar
modification is discussed in Sections 11.8 and 12.14.
The C′ coefficient matrices for computing the sums of squares are given by the following
Kronecker products:
where
The coefficient vectors, and , are used to partition SSAB × BL(G) into SSAB × BL(g0) and
SSAB × BL(g0).
The J matrix in SSTO is obtained by computing the product of h × 1 and 1 × h sum vectors: J =
1 × 1′. The sums of squares and degrees of freedom are identical to those in Table 15.3-2
where the classical sums-of-squares approach was used. It can be shown, following the
procedures in Section 8.8, that the sum-of-squares formulas for treatments A and B provide
tests that are subject to the restriction that all AB × BL(G) population effects equal zero.
RBCF-pk designs sometimes have a different number of blocks in the w levels of groups and
missing observations. Procedures for dealing with the two problems are described in Section
12.14 for a split-plot factorial design. The procedures there can be used with confounded
factorial designs.
LSCF-p2 Design
An alternative confounding scheme that is much simpler than that already described uses a
Latin square as the building block design. Confounded factorial designs that are based on a
Latin square are not limited to experiments in which the number of treatment levels is a prime
number or a power of a prime number. This confounding scheme does have other restrictions
that are described in subsequent paragraphs.
Consider the Latin square in Figure 15.13-1. A comparison of the treatment combinations in this
square with those in Group1 of Figure 15.8-2, which forms an RBCF-32 design, reveals that
they are identical if the columns are reordered. These treatment combinations are based on the
relation aj + 2bk = z (mod 3) or (AB2). From this it is apparent that a confounded factorial
design based on a Latin square also confounds a component of the AB interaction with groups
of blocks. Although both designs use the same treatment combinations and reduce the block
size from 9 to 3, they involve different construction procedures and assumptions.
A confounded factorial design that is based on a Latin square is denoted by LSCF-p2. The
design in Figure 15.13-1 has two treatments and one nuisance variable. A design based on a
Latin square is classified as a confounded factorial design if the letter in the square, aj, and the
column variable, bk, represent treatments and the row (group) variable represents a nuisance
variable. This point is developed further in Section 16.8.
where
The reader has probably noted an unusual feature of this model: the presence of an AB′
interaction component in a design developed from a Latin square building-block design. This
A LSCF-p2 design is restricted to experiments with two treatments and one nuisance variable.
Any number of levels of the two treatments and the nuisance variable can be used in the
design as long as they all have the same number of levels. The designation LSCF-p2 instead of
LSCF-pq, for example, conveys this.
If repeated measurements are obtained on the subjects, nw subjects are randomly assigned to
the w groups (nuisance variable) with n in each group. The sequence of administration of the p
treatment combinations within a block is randomized independently for each block. For the
nonrepeated measurements case, nw blocks, each containing p matched subjects, are
randomly assigned to the w groups. The p matched subjects in each block then are randomly
assigned to the p treatment combinations within a block.
For purposes of comparison, the data from Table 15.8-1 are analyzed by means of an LSCF-32
design. This example provides only six degrees of freedom for the error term, which is
inadequate. The design should have at least five blocks per group. This would provide 24
degrees of freedom for the error term. It is assumed in Latin square confounded factorial
designs that treatments represent fixed effects and blocks are random. The layout and
computational procedures are presented in Table 15.13-1. The results are summarized in Table
15.13-2.
According to this summary, the null hypothesis can be rejected for treatment B. The same
decision was reached earlier using RBCF-32 and RBPF-32 designs; see the ANOVA summaries
in Tables 15.8-2 and 15.8-4, respectively. All three designs require the same number of
observations. Is one design a better choice than the others? The choice is really between (1) an
RBPF-32 design and (2) RBCF-32 and LSCF-32 designs because the latter designs are
essentially equivalent. Consider the interaction and error degrees of freedom for the three
designs summarized as follows:
The RBCF-32 and LSCF-32 designs provide more degrees of freedom for estimating the error
term but do not provide within-blocks information on both components of the AB interaction. A
researcher who is more interested in the AB interaction than in one of the treatments should
consider using an SPF-3.3 design (see Figure 15.13-2). This design provides within-blocks
information on the AB interaction but confounds treatment A with groups of blocks. It should be
apparent that subtle variations in the assignment of subjects to treatment combinations can
markedly affect the nature and power of tests of significance.
Figure 15.13-2 ▪ Treatment combinations of an SPF-3.3 design. This design involves the
same number of observations as the RBCF-32, RBPF-32, and LSCF-32 designs discussed
in the text.
LSCF-r.p2 Design
Latin squares. The design is denoted by LSCF-r.p2. Assumptions of the LSCF-32 design also
apply to this design. There are no restrictions on the number of levels of treatment C, and
interactions of C with A and B are permissible. Computational formulas for this design are given
in Table 15.13-3. The E(MS)s in the table are appropriate for an experiment in which treatments
A, B, and C are fixed effects. Blocks are assumed to be random. The AB components of the AB
interaction can be tested separately for the various squares or pooled prior to the test. If the
components are pooled, the resulting sum of squares represents a component of the ABC
interaction.
Researchers today are well aware of the benefits of using a randomized block factorial design
to reduce residual error variation by isolating variation associated with blocks (subjects).
Unfortunately, even a relatively small randomized block factorial design may require a
prohibitively large block size. For example, an RBF-33 design requires blocks of size 9. A split-
plot factorial design provides one solution to this problem by confounding treatment A with
between-blocks variation, thereby reducing the block size to 3. An alternative solution is
provided by RBCF-32, RBPF-32, and LSCF-32 designs. These designs confound the AB
interaction or interaction component with between-blocks variation and reduce the block size
from 9 to 3. However, the advantage of reduced block size in split-plot and confounded factorial
designs is achieved at a price. In the case of an SPF-3.3 design, the price a researcher pays is
loss of power in evaluating a treatment. In the cases of RBCF-32, RBPF-32, and LSCF-32
designs, the price is loss of power in evaluating an interaction. The particular research question
determines which form of confounding is appropriate and, indeed, if either form is appropriate.
An examination of the current literature in the behavioral and social sciences and education
reveals an almost total absence of experiments using confounded factorial designs. This
observation is in sharp contrast to the observation that a split-plot factorial design, which
involves a different form of confounding, is one of the most popular designs. The difference in
the popularity of the two designs, both of which accomplish the same objective of reducing
block size, can be attributed to two factors: (1) Researchers are familiar with computer
programs for split-plot factorial designs but less familiar with programs for confounded factorial
designs, and (2) many researchers are not familiar with the advantages of confounded factorial
designs. It should be abundantly clear that the selection of an analysis of variance design
requires an intimate knowledge of a research area as well as knowledge of the advantages and
disadvantages of alternative designs.
1.The block size can be reduced by a factor of 1/2, 1/3, and so on, thereby achieving more
homogeneous blocks.
2.These designs can be used when it is not possible to administer all treatment combinations
within each block. This condition may exist because of a lack of sufficient homogeneous
subjects to form complete blocks or because only a portion of the treatment combinations
can be administered in a reasonable time interval. If repeated observations are obtained on
subjects, the sheer number of treatment combinations may preclude administering all
combinations to each subject.
3.All main effects can be tested using a within-blocks error term. The use of a within-blocks
error term instead of a between-blocks error term usually results in a more powerful test of
a false null hypothesis.
4.The experimental design can be tailored so that only interactions that are believed to be
insignificant are partially or completely confounded.
5.For designs in which each treatment has two levels, the block size always can be reduced
to 2 by confounding a sufficient number of interactions.
4.Because of their greater complexity of layout and analysis, randomized block factorial
designs may actually involve more work for the researcher than less powerful designs that
use more subjects, such as a completely randomized factorial design. The availability of
subjects, cost of assigning homogeneous subjects to blocks, and cost of running subjects
must be taken into account in the choice of a design.
1.Terms to remember:
a.group-interaction confounding (15.1)
b.congruent integers (15.2)
c. prime number (15.2)
d.modular addition (15.2)
e.modular multiplication (15.2)
f. confounding contrast (15.2)
g.complete confounding (15.2)
h.partial confounding (15.6)
i. balanced partial confounding (15.6)
j. unbalanced partial confounding (15.6)
k. relative information (15.7)
l. interaction component (15.8)
m.generalized interaction or treatment (15.9)
n.mixed design (15.11)
*2.[15.2] Are the following pairs of numbers congruent?
*a.12 and 5 reduced modulo 2
*b.6 and 10 reduced modulo 2
*c.6 and 12 reduced modulo 3
*d.9 and 24 reduced modulo 5
e.4 and 10 reduced modulo 3
f. 7 and 9 reduced modulo 2
g.6 and 12 reduced modulo 5
h.11 and 26 reduced modulo 5
*3.[15.2] Fill in the blank.
*a.2 + 4 = ____ (mod 5)
*b.(2)(3) = ____ (mod 3)
*c.2 + 3 = ___ (mod 3)
*d.(1)(4) = ____ (mod 5)
data (see next page) for this RBCF-22 design were obtained.
*a.[15.3] Test the null hypotheses for treatments A and B; let α = .05.
*b.[15.3] Does this confounded factorial design appear to be a good choice for the
experiment? Why?
*c.[15.4] Test the homogeneity of variance hypothesis that Use the Fmax statistic;
let α = .05.
*d.[15.4] Test the hypothesis that the population variances estimated by MSAB × BL(g0)
and MSAB × BL(g1) are homogeneous. Use the Fmax statistic; let α = .05.
e.[15.5] Prepare a “results and discussion section” for the Journal of Verbal Learning and
Verbal Behavior.
6.[15.5]
a.Construct a diagram for an RBCF-23 design in which the BC interaction is confounded
overall intensities: 88 dBA, treatment level b0; 85 dBA, treatment level b1; and 82 dBA,
treatment level b2. Six male observers with normal audiograms were randomly assigned to
two groups, with three observers in each group. The order of administration of the three
treatment combinations within a block was randomized independently for each block. The
following data were obtained:
9.[15.9]
a.Determine the treatment combinations assigned to groups for an RBCF-25 design
based on the relations aj + bk + cl = z (mod 2) and cl + dm + eo = z (mod 2).
b.What is the generalized interaction for this design?
*10.
[15.9] Determine the treatment combinations assigned to groups for an RBCF-52 design
*b.Specify, using scalar notation, the hypotheses that are used to compute the coefficient
matrices, C′, and list the coefficient matrices.
*c.Specify μ′.
*d.Perform the analysis using the cell means model approach; let α = .05. This problem
requires a computer with a matrix package.
*14.
[15.12] Assume that the following data are obtained in Exercise 5:
15.
[15.12] Exercise 8 describes an experiment to determine the annoyance values of three
aircraft noise spectra.
a.Write the cell means model for this RBPF-32 design.
b.Specify the hypotheses that are used to compute the coefficient matrices C′, and list the
coefficient matrices. Some coefficient matrices such as cannot be obtained from
Kronecker products. Obtain by determining the treatment combinations that
provide a test of H0: μ..0. – μ..1. = 0 and μ..1. – μ..2. = 0. Similarly, obtain
The (AB) and (AB2) components are computed from the groups in which they are not
confounded with blocks. The first two rows of the null hypothesis are from group 0; the
second two rows are from group 1 (see Section 15.8).
c. Specify μ′.
d.Perform the analysis using the cell means model approach; let α = .05.
e.Use the Fisher-Hayter statistic to determine which population means differ.
*16.
[15.13] The data of Exercise 8 can be analyzed as an LSCF-32 design. The layout is as
follows.
*a.Compare the randomization procedure for this design assuming the use of repeated
17.
[15.13] Twelve elementary school teachers viewed a videotape of three fourth-grade boys
and then completed a referral form containing items from the Children's Personality
Questionnaire, Form A, and the California Test of Personality, Elementary Series. The
teachers were randomly assigned to 1 of 12 blocks. Prior to viewing the videotape, teachers
in three of the groups were told that the three boys were learning disabled (treatment level
c1). Teachers in the other three groups were told that the boys were normal (treatment level
c0). Additional fictitious information concerning the boys—occupation of father and
classroom behavior—was also provided. The categories professional (level a0), blue collar
(level a1), and unemployed (level a2) were combined in two 3 × 3 Latin squares with the
categories “no disruptive behavior” (b0), “disruptive behavior” (b1), and “disruptive, acting
out behavior” (b2). The order of administration of the ajbk treatment combinations in each
block was randomized independently for each teacher. The dependent variable in this
LSCF-2.32 design was the number of referral items checked by the teachers. The following
data were obtained.
1A prime number is any number that is divisible by no number smaller than itself other than 1.
2Kirk (1993) illustrates the analysis of a RBCF-22 design with a missing observation using a
cell means model and a regression model.
*The reader who is interested only in the classical sum-of-squares approach can, without loss
of continuity, omit this section.
http://dx.doi.org/10.4135/9781483384733.n15
I have described a variety of factorial designs in Chapters 9, 10, 12, and 15. These designs
share a common characteristic: All of the treatment combinations appear in the design.
Fractional factorial designs do not share this characteristic. These designs include only a
fraction—for example, and so on—of the treatment combinations of a complete
factorial design. Fractional factorial designs significantly reduce the number of treatment
combinations in an experiment. For example, a CRF-3333 design has 81 treatment
combinations. By using a one-third fractional factorial design, only of these
combinations need to be included in the experiment.
Twenty-five years after Fisher's original work on analysis of variance, Finney (1945, 1946)
developed the theory for 2k and 3k fractional factorial designs. Kempthorne (1947) extended
Finney's work to designs of the type pk, where p is any prime number. In the pk designation, p
denotes the number of levels of each treatment, and k denotes the number of treatments.
Fractional factorial designs have found their greatest use in industrial research. Only limited
applications of these designs have been made in agricultural research, which provided the
impetus for the development of most designs in use today.
Fractional factorial designs have much in common with the confounded factorial designs in
Chapter 15. The latter designs achieve a reduction in the number of treatment combinations
that must be included in a block. Fractional factorial designs achieve a reduction in the number
of treatment combinations in an experiment. The reduction in the size of an experiment,
however, comes at a price. Considerable ambiguity may exist in interpreting the results of an
experiment when the design includes only one half or one third of the treatment combinations.
Ambiguity occurs because two or more names can be given to each sum of squares. For
example, a sum of squares might be attributed to the effects of treatment A and the BCDE
interaction. The two or more names given to the same sum of squares are called aliases. In a
one-half fraction of a 2k factorial design, all sums of squares have two aliases. In a one-fourth
fraction of a 2k factorial design, all sums of squares have four aliases. Careful attention must
be given to the alias pattern of a proposed design to minimize confusion in interpreting tests of
significance. Treatments are customarily aliased with higher-order interactions that are
assumed to equal zero. This helps to minimize but does not eliminate ambiguity in interpreting
the outcome of an experiment.
A fractional factorial design is appropriate for experiments that meet, in addition to the
assumptions of analysis of variance, the following conditions:
1.The experiment contains a large number of treatments and a prohibitively large number of
treatment combinations. Fractional factorial designs are rarely used for experiments with
less than four or five treatments.
2.The number of treatment levels should, if possible, be equal for all treatments. With the
exception of fractional factorial experiments using a Latin square building block design,
analysis procedures for designs with mixed numbers of treatment levels are relatively
complex.
3.A researcher should have some a priori reason for believing that a number of higher-order
interactions are equal to zero or small relative to treatments. In practice, fractional factorial
designs, with the exception of those based on a Latin square, are most often used with
treatments having either two or three levels. The use of a restricted number of levels
increases the likelihood that interactions are insignificant.
In this chapter, I describe fractional factorial designs that are constructed from three building
block designs: a completely randomized design, a randomized block design, and a Latin
square design. The fractional factorial designs are designated by the letters CRFF, RBFF, and
LSFF, respectively.
Fractional factorial designs achieve a reduction in the number of treatment combinations that
must be included in an experiment by confounding main effects with higher-order interactions.
This form of confounding completes a cycle that I began in Chapter 12. There I described split-
plot factorial designs that use group-treatment confounding. In Chapter 15, I described
confounded factorial designs that use group-interaction confounding. Here I describe
treatment-interaction confounding. Note that whenever confounding is used, some
information is lost. However, if information concerning certain interactions in a fractional factorial
design is of negligible interest, a researcher can employ confounding so as to sacrifice only this
information. The advantage of confounding in terms of a reduction in experimental resources
and time may more than compensate for the lost information.
16.2 General Procedures for Constructing Completely Randomized Fractional Factorial Designs
Procedures for constructing fractional factorial designs are closely related to those for
confounded factorial designs. Recall from Chapter 15 that the confounding contrast (ABCD)z,
denoted by CRFF-2k–1, where 2k indicates that each of the k treatments has p = 2 levels, and
the negative exponent, −1, indicates that the design is a one-half fraction of a complete 2k
factorial design. The rationale for the 2k–1 designation is not immediately obvious. On
reflection, the designation makes sense because a one-half fraction of a 2k factorial design can
be written as .
All sums of squares in a fractional factorial design have two or more aliases. The alias pattern
for a one-half fractional factorial design can be determined by multiplying the letters that
denote each sum of squares by the defining contrast and expressing the product modulo m.
Computation of the alias pattern for a CRFF-24–1 design where (ABCD) is the defining contrast
is shown in Table 16.2-1. It is apparent from the table that all treatments have three-treatment
interactions as aliases. For example, treatment A is aliased with the BCD interaction. All two-
treatment interactions have other two-treatment interactions as aliases. This alias pattern is not
entirely satisfactory because of the ambiguity associated with interpreting significant two-
treatment interactions. It is not possible to determine from this design whether a significant two-
treatment interaction is associated with, for example, the AB or the CD interaction. The question
about which name or label to assign to a significant two-treatment interaction can be answered
by conducting a second experiment that includes the treatment combinations not included in
the first experiment. For example, suppose that the treatment combinations for the defining
contrast (ABCD)0 are run and the AB, alias CD, interaction is significant. A second experiment
can be run using (ABCD)1 as the defining contrast. If the two experiments are combined, a
complete factorial design is obtained in which the ABCD interaction is confounded with the two
halves of the combined experiment. This sequential approach clarifies the interpretation
because the AB and CD interactions are no longer confounded.
Table 16.2-1 ▪ Alias Pattern for CRFF-24–1 Design With (ABCD) as the Defining Contrast
(One-Half Fractional Factorial Design)
Often, it is not necessary to carry out the second fraction of an experiment. The first fraction
provides enough information about the treatments to move on to the next stage of
experimentation such as adding or removing treatments or changing the levels of the
treatments.
denoted by CRFF-2k−2, where 2k indicates that each of the k treatments has p = 2 levels, and
the negative exponent, −2, indicates that the design is a one-fourth fraction of a complete 2k
factorial design. The rationale for the 2k–2 designation is as follows. A one-fourth fraction of a
design is denoted by CRFF-2k−i, where i is the key to identifying the design fraction. As you
will see, i also represents the number of defining contrasts that are required to construct a
design. A one-half fractional factorial design, CRFF-2k–1, requires one defining contrast. A one-
fourth design, CRFF-2k–2, requires two defining contrasts. A one-eighth design, CRFF-2k–3,
requires three defining contrasts, and so on.
As I noted above, a CRFF-2k–2 design can be constructed by using two defining contrasts to
select the treatment combinations that are included in an experiment. When two defining
contrasts are used, all sums of squares have four aliases. Two of the aliases are given by the
product of a treatment or interaction with each of the two defining contrasts. The third alias is
given by the product of a treatment or interaction with the generalized interaction of the two
defining contrasts.1 The fourth alias is the treatment or interaction itself. For example, assume
that the two defining contrasts for a CRFF-24–2 design are (ABC) and (ACD). The generalized
interaction of (ABC) and (ACD), following procedures described in Section 15.9, is
The aliases for each source of variation are shown in Table 16.2-2.
Table 16.2-2 ▪ Alias Pattern for CRFF-24–2 Design With (ABC) and (ACD) as the Defining
Contrasts (One-Fourth Fractional Factorial Design)
A CRFF-24−2 design with (ABC) and (ACD) as the defining contrasts is not a satisfactory
design because some treatments are aliased with other treatments. For example, treatment B is
aliased with treatment D. If an experiment contains a large number of treatments, a one-fourth
fraction of a 2k design can be a practical design. For example, an experiment with eight
treatments can be designed so that no treatment or two-treatment interaction is aliased with
other treatments of two-treatment interactions. The number of treatment combinations can be
reduced from 256 to 64 by this procedure. Care must be used in the selection of defining
contrasts so as to avoid aliasing treatments with other treatments.
The nature of the alias pattern for a fractional factorial design is an important characteristic of
the design. The letter R for resolution followed by a Roman numeral is used to denote
different alias patterns. The alias pattern for resolution II through resolution VI designs is
summarized in Table 16.2-3. According to the table, a resolution II design, denoted by RII, is
one for which a treatment, say A, is aliased with another treatment, say B, or an interaction.
This alias pattern describes the one-fourth fractional factorial design in Table 16.2.2. A
resolution III design, denoted by RIII, is one for which no treatment is aliased with other
treatments, but treatments are aliased with two-treatment or higher-order interactions. Two-
treatment interactions are aliased with treatments, other two-treatment interactions, or higher-
order interactions. This alias pattern describes the following one-half fractional factorial design
with (ABC) as the defining contrast.
II 1 II − 1 = 1
III 1 III − 1 = 2
III 2 III − 2 = 1
IV 1 IV − 1 = 3
IV 2 IV − 2 = 2
IV 3 IV − 3 = 1
V 1 V−1=4
V 2 V−2=3
V 3 V−3=2
V 4 V−4=1
VI 1 VI − 1 = 5
VI 2 VI − 2 = 4
VI 3 VI − 3 = 3
VI 4 VI − 4 = 2
VI 5 VI − 5 = 1
A resolution IV design, denoted by RIV, is one for which no treatment is aliased with other
treatments or two-treatment interactions, but two-treatment interactions are aliased with other
two-treatment or higher-order interactions. This alias pattern describes the one-half fractional
factorial design in Table 16.2.1. A resolution V design, denoted by RV, is one for which no
treatment or two-treatment interaction is aliased with other treatments or two-treatment
interactions, but two-treatment interactions are aliased with three-treatment or higher-order
interactions. For designs in which each treatment has two levels, the resolution is equal to the
smaller of the number of letters in the defining contrast and generalized interaction.
treatment or interaction that contains fewer than 4 − 1 = 3 letters and no source of variation
denoted by s = 2 letters is aliased with another treatment or interaction that contains fewer than
4 − 2 = 2 letters.
A careful examination of the alias pattern in Table 16.2-1 reveals an interesting feature of this
one-half fractional factorial design. The incomplete four-treatment design contains all of the
treatment combinations of a complete three-treatment design. That is, if any one of the four
treatment labels is ignored, the experiment becomes a complete three-treatment factorial
design. This point is more obvious if the treatments and interactions in Table 16.2-1 are
rearranged. For purposes of illustration, treatment label D is ignored in the following source
column.
SourceAlias
A BCD
B ACD
C ABD
AB CD
AC BD
BC AD
ABC D
An examination of the source column reveals that all treatments and interactions of a complete
CRF-222 design are present. The significance of this arrangement becomes apparent in the
following section where I describe the analysis procedures for fractional factorial designs.
those for the complete 23 design, CRF-222, in Section 10.2. This means that if one of the
treatments is ignored, the analysis of an incomplete design can be carried out as if it were a
complete design. The strategy of ignoring a treatment generalizes to other fractional factorial
designs. For example, the analysis of a one-fourth fraction of a 24 design, CRFF-24−2, is the
same as the analysis of a complete 22 design, CRF-22. Similarly, the analysis of a one-eighth
fraction of a 25 design, CRFF-25−3, is the same as the analysis of a complete 22 design, CRF-
22.
The first step in laying out a CRFF-24−1 design is to determine the treatment combinations to
include in the experiment. This is facilitated by the use of modular arithmetic. In this chapter, as
in Chapter 15, I follow the convention of using the subscript zero to designate the first level of a
treatment. In designing a one-half fractional factorial design, it is customary to use the highest-
order interaction as the defining contrast because it is less likely to be significant than lower-
order interactions. For a CRFF-24−1 design, treatment combinations that satisfy either (ABCD)0
or (ABCD)1 can be used as the defining contrast. Assume that (ABCD)0 is selected by the toss
of a coin. The treatment combinations that are based on this defining contrast are
The numbers 0000 and so on refer to the level of treatments A, B, C, and D, respectively. Thus,
1010 denotes treatment combination a1b0c1d0.
Procedures for analyzing a fractional factorial design are illustrated in Table 16.3-1. For
computational purposes, treatment D is ignored. The choice of which treatment to ignore is
arbitrary. This example has the virtue of simplicity, but, as noted previously, a CRFF-24−1
design is not entirely satisfactory because of ambiguity in interpreting two-treatment
interactions. Recall that this design is of resolution IV, which means that two-treatment
interactions are aliased with other two-treatment or higher-order interactions. It is assumed that
the following experimental design model equation is appropriate for the design:
The notation [(αβ)jk (γδ)lm] indicates that the two interactions are aliases and cannot be
differentiated in the design. A fixed-effects model is assumed to be appropriate for the CRFF-
The analysis of the CRFF-24−1 design is summarized in Table 16.3-2. The null hypothesis for
treatments B and C can be rejected. The ANOVA table illustrates a major problem inherent in
fractional designs—how to interpret significant mean squares that are aliased with other mean
squares of the same order. There is no way to determine from the analysis just performed
whether, for example, the AB or CD interaction is responsible for the significant F statistic or
whether both are responsible. The experiment provides only one mean square that is a function
of AB and CD. Confusion with respect to interpreting AB and CD can be resolved by carrying
out the other half of the experiment and combining it with the first half. G. E. P. Box et al.
(2005) discuss the general problem of carrying out follow-up experiments to clarify the
interpretation of fractional factorial experiments. Bennett and Franklin (1954, p. 597) interject a
word of caution concerning combined experiments. They point out that some bias is introduced
in the significance test for combined experiments if a decision to carry out the second half of an
experiment is based on results obtained in the first half. The discussion in Section 9.11
concerning preliminary tests on the model and pooling procedures is relevant to this issue. In
actual practice, it is customary to combine fractional factorial experiments as if no test of
significance had preceded the joint analysis. Finney (1960) states, “Undoubtedly the danger of
bias arises, although the nature of this cannot be made clear without much fuller consideration
of sequential experimentation, but many researchers will, probably rightly, regard the risk as a
reasonable price to pay for the advantages and economics gained” (p. 139). Onyiah (2009, pp.
554–555) echoes Finney's position.
It is easy to show that the sums of squares for the AB and ABC interactions, for example, in
Table 16.3-2 are indistinguishable from the sums of squares, respectively, for the CD interaction
and treatment D. The information contained in the ABCS summary table in Table 16.3-1 can be
used to construct the following CD summary table where treatment B has been ignored.
A comparison of SSD with SSABC and SSCD with SSAB shows that they are equal. This
example can be extended to all aliases in Table 16.3-2.
The example in Table 16.3-1 contains a total of 16 subjects with two observations in each cell.
This example provides a within-cell estimate of the error variance. In larger fractional factorial
designs, it is customary to assign only one observation to the treatment combinations. The
higher-order interactions are pooled to obtain an estimate of the error variance. Extensive tables
of plans (Giesbrecht & Gumpertz, 2004; Gunst & Mason, 1991; Montgomery, 2009) are
available that minimize undesirable alias patterns for , , , and so on fractional factorial
designs. Of particular interest are the extensive tables published by the National Bureau of
Standards (1957), which provide plans for through fractional factorial designs with up to 16
treatments.
If each treatment has three instead of two levels, the experiment can be designed so that only
or or and so on of the treatment combinations must be included in the experiment.
The computational procedures described in previous sections of this chapter are applicable to
these experiments. However, the analysis and interpretation of one-third fractional factorial
designs, CRFF-3k–1, are more complicated than those for one-half fractional factorial designs.
Effects have three aliases in a one-third fraction of a 3k design, instead of two as in a one-half
fraction of the 2k design. If a one-ninth fraction of a 3k design is used, effects have nine
aliases, whereas in a one-fourth fraction of the 2kdesign, effects have only four aliases. In
general, CRFF-3k–1 designs are less satisfactory than CRFF-2k–1 designs. One reason, in
addition to greater complexity, is that CRFF-3k–1 designs must have a relatively large number
of treatments (k > 5) to provide useful estimates of two-treatment interactions. Another problem
concerns the interpretation of interactions. In Chapter 15 on confounded factorial designs, I
showed that the AB interaction in a complete 3k design can be partitioned into two orthogonal
components (AB) and (AB2). It is customary in 3k confounded factorial designs to pool these
two estimates. This is rarely possible in a one-third fractional factorial design because the
interaction components have different aliases. Connor and Zelen (1959) have prepared
extensive tables of fractional factorial designs.
Assume that a researcher wants to lay out a CRFF-34−1 design using one third of the
treatment combinations. One of the components of the ABCD interaction can be selected as
the defining contrast. This interaction, which has 16 degrees of freedom, can be partitioned
according to procedures described in Section 15.9 as follows:
Interaction df
ABCD 16
(ABCD) 2
(ABCD2) 2
(ABC2D) 2
(AB2CD) 2
(ABC2D2) 2
(AB2CD2) 2
(AB2C2D) 2
(AB2C2D2)2
Any one of the eight components of the interaction can be used to partition the treatment
combinations into three sets. If the (ABCD2) component is selected as the defining contrast,
Only one of the sets of defining contrasts, (ABCD2)0, (ABCD2)1, or (ABCD2)2, is used in the
experiment. A table of random numbers can be employed to determine which of these three
sets is used. Assume that the defining contrast (ABCD2)0 has been selected. A CRFF-34−1
design has 81 treatment combinations, but this number can be reduced to by the use
of (ABCD2)0.
The pattern of aliases for treatments that have three levels can be determined by procedures
similar to those discussed in Sections 15.9 and 16.2. If X and Y symbolize the defining contrast
and treatment (interaction), respectively, the alias pattern is given by (X) (Y) and (X)(Y)2,
reduced modulo 3. For example, the aliases of treatment A are
Recall from Section 15.9 that the power of the first term in (A2BCD2) must equal 1 to uniquely
define the interaction component. This is achieved by squaring (A2BCD2) and reducing it
modulo 3. The alias pattern for this CRFF-34−1 design in which (ABCD2) is the defining
contrast is given in Table 16.4-1. It is apparent that no treatment is aliased with other treatments
or two-treatment interaction components, but two-treatment interaction components are aliased
with other two- or three-treatment interaction components. Hence, the design is of resolution IV,
RIV. Treatments and two-treatment interaction components that are not aliased with other
treatments or two-treatment interaction components are considered measurable. An asterisk
designates the measurable effects in Table 16.4-1.
Table 16.4-1 ▪ Alias Pattern for CRFF-34−1 Design With (ABCD2) as the Defining Contrast
The computational procedures for the design are simplified by ignoring one of the treatments—
say, treatment D. It should be apparent from Table 16.4-1 that all main effects and interaction
components can be obtained from summary tables that involve only treatments A, B, and C.
The procedures described in Section 15.8 must be used to compute the interaction-component
sums of squares. The data in Table 16.4-2 that I used to illustrate the computations have one
observation in each of the 27 treatment combinations. The levels of treatment D that occur in
each cell of the ABC summary table are included for illustrative purposes only; this information
is not used in the analysis. The reader can verify that the treatment combinations in the 27 cells
satisfy the relation
This example does not provide a within-cell error term. If higher-order interaction components
can be assumed to be insignificant, they can be pooled to form a residual error term.
The results of the analysis are summarized in Table 16.4-3. The residual error term is based on
only six degrees of freedom. If a larger number of degrees of freedom is desired, all of the
remaining two- and three-treatment interaction components, with the exception of those aliased
If the (CD) component, for example, had been significant, a researcher would be faced with the
problem of how to interpret this result. Several courses of action can be pursued in carrying out
follow-up experiments designed to clarify the interpretation of aliased effects. The researcher
could choose to complete the experiment by running the remaining two-thirds of the treatment
combinations. Another alternative would be to carry out a complete factorial experiment
involving only treatments B, C, and D under the assumption, supported by this analysis, that
treatment A is of no consequence.
Comparisons among the means for treatments B and C can be carried out following the
procedures in Section 9.5 for a completely randomized factorial design. An alternative to this
CRFF-3k–1 design uses a Latin square as the building block design. This design, which is
described by Winer et al. (1991, chap. 9), uses three balanced Latin squares, with n subjects
assigned to each cell. I describe fractional factorial designs that are based on a Latin square in
Sections 16.9 through 16.13.
The cell means model can be used to describe the data for a completely randomized fractional
factorial design. For example, the model for the four-treatment, completely randomized
where Yijklm denotes an observation for the ith experimental unit in treatment combination
ajbkcldm, μjklm is the population mean for the jklmth treatment combination, and εi(jklm) is the
error effect that is NID(0, ). When there are no missing observations or empty cells, the
design has N = npqr observations.
In Table 16.3-1, I illustrate the classical sum-of-squares computational procedures for the
CRFF-24−1 design. Here I use the same data to illustrate the cell means model computational
procedures. The procedures are almost identical to those described in Section 9.13 for a
completely randomized factorial design. As in the classical sum-of-square computational
approach, I ignore treatment D and treat the design as if it has three treatments. The null
hypotheses for treatments A, B, and C in Table 16.3-1 are expressed with hypothesis matrices,
H′, and vectors as follows:
Coefficient matrices, C′, for computing the treatment and interaction sums of squares are
obtained from Kronecker products, ⊗, of the hypotheses matrices and sum vectors, 1′, as
follows:
where .
where
N = (n)(p)(q)(r) = (2)(23) = 16 and X′X−1 is an h × h diagonal matrix with diagonal elements that
are equal to the reciprocal of the number of observations in each treatment combination. The
order of the means in determines the order of the Kronecker products. The cell means in
are ordered first by the two levels of treatment C, then by the levels of treatment B, and finally
by the levels of treatment A. The Kronecker products appear in the reverse order: first
treatment A, then B, and finally C.
where J is an N × N matrix of 1s. These sums of squares are the same as those in Table 16.3-1
where the classical sum-of-squares approach is used.
The cell means model computational procedures for a fractional factorial design are essentially
the same as those for the completely randomized factorial design in Section 9.13. An
advantage of the cell means model computational approach over the classical sum-of-squares
approach is that it can be used when the cell ns are unequal or when one or more cells are
empty. When the cell ns are unequal, a researcher also has a choice, as discussed in Section
9.14, of testing hypotheses for unweighted means or for weighted means. The computational
procedures for the case in which the cell ns are unequal or one or more cells are empty are
described in Section 9.14.
The building block for the fractional factorial designs described in the previous sections is a
completely randomized design. The designs described in this section use a randomized block
design as the building block. A randomized block fractional factorial design, RBFF-pk–i, is
constructed by first using a defining contrast to determine which treatments combinations are
included in the design. Then a confounding contrast is used to assign the treatment
combinations to groups (blocks). A randomized block fractional factorial design can be laid out
in two groups, four groups, and so on by means of the confounding procedures described in
Chapter 15. No new principles are involved in these designs. The assumptions underlying an
RBF-pq design (see Sections 10.5 and 10.7) also are required for an RBFF-pk–i design.
I use an RBFF-25−1 design with one defining contrast to illustrate the general procedures for
assigning treatment combinations to groups. The first step is to choose the defining contrast.
As is often done, I let the highest-order interaction, (ABCDE)z, where z = 0 or 1, be the defining
contrast. The 32 treatment combinations of a complete factorial design can be reduced to 16 by
using either defining contrast (ABCDE)0 or (ABCDE)1. To assign the 16 treatment combinations
to two groups of eight combinations each, it is necessary to confound an interaction other than
the defining contrast with between-groups variation. The interaction selected as the
confounding contrast should be one that is thought to be insignificant. Procedures for
confounding an interaction with groups are described in Section 15.2. If (ABCDE)0 is used as
the defining contrast and the AB interaction is confounded with between-groups variation, I
obtain the design shown in Table 16.6-1. In this design, the treatment combinations in group 0
satisfy two relations:
Table 16.6-1 ▪ Layout of RBFF-25−1 Design in Two Groups With (ABCDE)0 as the Defining
All treatments and interactions except (AB), its alias (CDE), and the defining contrast
(ABCDE)0 are within-groups effects. The analysis of the five-treatment experiment is carried out
as if the experiment were a complete four-treatment experiment. The alias pattern and
computational formulas for the design are shown in Table 16.6-2. An examination of the alias
pattern reveals that the design is of resolution RV.
Table 16.6-2 ▪ Alias Pattern and Computational Formulas for RBFF-25−1 Design
For computational purposes, treatment E can be ignored. Only summary tables involving
treatments A, B, C, and D are required for the analysis. All main effects are aliased with four-
treatment interactions. If it can be assumed that the two- and three-treatment interactions are
insignificant, they can be pooled to form a residual error term with nine degrees of freedom.
This design is not satisfactory if a researcher is interested in evaluating two- and three-
treatment interactions.
If an experiment contains six treatments, a one-half fractional factorial design can be laid out in
blocks of size 8. In this design, only 32 of the 64 treatment combinations are included in the
experiment. To lay out the 32 treatment combinations in blocks of size 8, it is necessary to
confound two interactions with between-groups variation as described in Section 15.9. This
design permits a researcher to evaluate all two-treatment interactions except one. Higher-order
interactions are pooled to form a residual error term.
A restricted cell means model can be used to represent data for a randomized block factorial
design. The computational procedures described in Section 10.8 for a randomized block
factorial design generalize to randomized block fractional factorial designs. No new procedures
are required.
Fractional factorial designs in which all treatments have four levels can be easily constructed
from 2k designs. Cochran and Cox (1957, pp. 273–274) and Montgomery (2009, pp. 382–385)
describe procedures for laying out these designs. Johnson and Leone (1964, p. 216) describe a
Designs with mixed treatments at two and three levels present special problems with respect to
layout and analysis. Montgomery (2009, pp. 378–385) and Kempthorne (1952, p. 419) discuss
some of the problems inherent in these designs. Connor and Young (1961) present plans
based on 2k–1 × 3l–1 designs, with k and l equal to one through nine treatments. Addelman
(1963) describes general procedures for constructing complex fractional factorial designs.
Other useful information sources for fractional factorial designs are G. E. P. Box et al. (2005),
Giesbrecht and Gumpertz (2004), Montgomery (2009), and T. P. Ryan (2007).
The classic application of a Latin square design in agricultural research involves one treatment
with nuisance variables assigned to the rows and columns of the square. If the Latin square
contains two treatments and one nuisance variable, it is called a Latin square confounded
factorial design. If the variables assigned to rows and columns represent treatments instead of
nuisance variables, the design is called a Latin square fractional factorial design (LSFF design).
Recall that the term factorial experiment refers to the simultaneous evaluation of two or more
crossed treatments. Thus, the Latin square in Figure 16.8-1 may be called a Latin square
design or a Latin square confounded factorial design or a Latin square fractional factorial
design, depending on the nature of the variables assigned to the rows and columns.
If A, B, and C represent three treatments, I show here that a standard 3 × 3 Latin square
with (AB2C2)0 as the defining contrast. The treatment combinations of the fractional factorial
design are as follows:
where the numbers refer to the level of treatments A, B, and C, respectively. A comparison of
these treatment combinations with those in Figure 16.8-1 reveals that they are identical. Thus,
a standard 3 × 3 Latin square corresponds to the treatment combinations in a one-third fraction
of a 33 factorial design with (AB2C2)0 as the defining contrast. A complete 33 factorial design
contains 27 treatment combinations. The Latin square, which is a one-third fractional factorial
design, contains only of these combinations.
In Section 8.2, I noted that a 3 × 3 Latin square has 12 arrangements. These 12 arrangements
can be generated by the family of 12 defining contrasts:
A 2 × 2 Latin square has two arrangements. These arrangements can be obtained from the
defining contrasts (ABC)0 and (ABC)1. A 2 × 2 Latin square corresponds to a one-half fraction
factorial design. The use of hyper-squares, where the maximum number of orthogonal squares
is superimposed on a Latin square, is an extreme form of fractionation. It can be shown that
these designs represent the smallest fractional factorial design in which treatments are not
aliased with other treatments.
The negative 1 in the designation LSFF-33−1 is redundant because by its nature, a 3 × 3 Latin
square is a one-third fractional factorial design. I denote this design by LSFF-33. Similarly, the
To draw valid inferences from experimental designs constructed from Latin squares, interactions
among row, column, and square variables must equal zero. This requirement is much more
likely to be met if the row and column variables represent nuisance or classification variables
than if they represent additional treatments as in the case of a fractional factorial design. In
Chapter 14, which deals with Latin square designs, this additivity requirement was stated
without explanation. I can use the alias concept to explain the problem posed by nonadditivity.
Consider the 3 × 3 Latin square in Figure 16.8-1. This square can be constructed by using the
(AB2C2)0 component of the three-treatment interaction as the defining contrast. The aliases
associated with treatment A are
Thus, treatment A is aliased with the (BC) and (ABC) components of the two- and three-
treatment interactions, respectively. Similarly,
A general principle can be stated with respect to the use of Latin square designs or Latin
squares as building blocks for more complex designs. Treatments are always aliased with
interaction components. Thus, to interpret F statistics for treatments that are significant, it is
necessary to assume that the aliased interactions (interaction components) are equal to zero.
In large Latin squares, a relatively small portion of all two-treatment interaction components is
aliased with treatments. Therefore, significant two-treatment interaction effects bias tests of
treatments less in large Latin squares than in small Latin squares. The situation is more
complex if a within-cell error term is not available. Under these conditions, a residual error term
is sometimes obtained by pooling all interaction mean squares not aliased with treatments. If
the pooled two- and three-treatment interaction components are not zero, they negatively bias
tests of treatments.
Subsequent sections of this chapter describe a number of fractional factorial designs that use a
Latin square as the building block design. Additional examples of Latin square fractional
factorial designs can be found in the thorough coverage by Winer et al. (1991, chap. 9).
The layout of Latin square fractional factorial designs is much simpler than the layout of
fractional factorial designs based on completely randomized and randomized block designs.
The latter designs are generally restricted to experiments that have two or three levels of each
treatment; LSFF designs do not have this restriction. In addition, LSFF designs are not limited
to experiments in which the number of levels of each treatment is a prime number. The use of
mixed levels for treatments poses no computational problem if, of course, at least three of the
treatment or classification variables have the same number of levels. The number of levels of
rows, columns, and Latin letters that make up the square must be equal. I want to emphasize
again that the use of a Latin square as a building block design requires a highly restrictive set
of assumptions with respect to interactions: All interactions effects are assumed to equal zero.
If interactions among variables that make up the Latin square are not zero, F statistics are
positively biased by the interactions.
If the design illustrated in Figure 16.8-1 of the previous section contains three treatments, it is
classified as a fractional factorial design. It is assumed that the p2 = 9 cells of the square in
Figure 16.8-1 contain nine random samples of n subjects (n > 1) from a common population.
Computational procedures for this LSFF-p3 are identical to those for a regular Latin square
design and are described in Section 14.3.
A LSFF-p3 design can be easily modified for research situations in which it is possible to use
matched subjects or repeated measures on the same subjects. The modified design is
diagrammed in Figure 16.9-1 and is designated by the letters LSFF-p.p2. In this design,
treatment A is a between-blocks treatment, whereas B and C are within-blocks treatments. This
is indicated in the designation scheme by placing a dot after the between-blocks treatment. I
use the same designation scheme for split-plot factorial designs. The use of the letter p in the
designation for this design indicates that treatments A, B, and C must have the same number
of levels. If repeated measurements are obtained, the design requires p random samples of n
subjects from a common population. The p samples are randomly assigned to the levels of
treatment A. The sequence of administration of the bkcl treatment combinations is randomized
independently for each subject. If matched subjects are used, p random samples of n sets of p
matched subjects are randomly assigned to the levels of treatment A. Then the p subjects
within each matched set are randomly assigned to the bkcl treatment combinations.
where
The layout of the LSFF-3.32 design and computational formulas is shown in Table 16.9-1.
The analysis of the data is summarized in Table 16.9-2. According to the summary, the null
hypothesis can be rejected for treatment C but not for treatments A and B. A partial check on
the assumption that all interactions among treatments are insignificant is given by
In this example, is seems safe to conclude that interactions among the treatments are zero.
Under this condition, the researcher may want to pool MSRES with MSBC × BL(A) to obtain a
better estimate of error variance. Note that the test of treatment A is less powerful than the tests
of treatments B and C. I have made the same point with respect to the between-blocks
treatments of split-plot factorial designs in Chapter 12.
In Section 14.3, an example involving a road test of four automobile tires was used to illustrate
a possible application of a Latin square design. The levels of A, B, and C corresponded,
respectively, to the rubber compounds used in the tire construction, wheel positions, and
automobiles. Two road tests using the same four automobiles were run, one after the other. In
the analysis described in Section 14.3, the data were treated as if the two replications of the
experiment were carried out under identical conditions. This is unrealistic because the two tests
were separated by an interval of time and the cars were older during the second test. The
within-cell error term included not only random error but also variation attributable to
temperature and other climatic changes as well as variation associated with mechanical wear of
the automobiles. A better alternative design that is described next treats the two replications of
the road test like two levels of a nuisance variable (D).
Figure 16.10-1 ▪ (a) Standard 4 × 4 Latin square and (b) diagram of an LSFF-432 design
based on the 4 × 4 Latin square.
The terms in the model, with the exception of δm, (αδ)jm, (βδ)km, and (γδ)lm, are defined in
Section 16.9. The term δm refers to the treatment effect that is a constant for all subjects within
population m. This design permits a researcher to evaluate the AD, BD, and CD interactions.
Each cell of the design that is diagrammed in Figure 16.10-1 contains one observation. The
design can be easily modified for the case in which each cell contains n observations. The
required modifications are described in a subsequent paragraph.
The layout of the LSFF-432 design, computational tables, and formulas is shown in Table
16.10-1. The analysis is summarized in Table 16.10-2. According to the analysis in Table 16.10-
2, the effects of rubber compounds, wheel positions, and replications are significant. The
superiority of the present analysis, in which the two road tests are treated as a fourth variable,
relative to the analysis for the LS-4 design that is summarized in Table 14.3-2 is readily
apparent. In the latter analysis, the within-cell variation includes, in addition to random error,
the effects of the replications (D).
researcher believes that only one treatment is likely to interact with the other treatments, an
LSFF-p3t design is a better choice than a complete factorial design. The LSFF-432 example
employed the same Latin square for both levels of treatment D. If it suits the researcher's
purpose, balanced sets of squares or independently randomized squares can be used. The
numbers of levels of treatments A, B, and C must be equal. The only restriction on the number
of levels of treatment D is that there must be at least two levels.
Comparisons among means follow the general procedures described in Sections 9.5 and 14.6.
The analysis shown in Table 16.10-1 can be modified for the case in which more than one
observation is obtained in each cell. An ABCDS summary table must be constructed. The
following modifications of the computational symbols as illustrated for [A], [CD], and [ABCD]
are required:
The LSFF-p3t design can be modified for the case in which matched subjects or repeated
measures on the same subjects are obtained. A diagram of this design is shown in Figure
16.10-2. The design requires pt random samples, each containing n blocks of subjects, if
repeated measures are obtained or pt random samples, each containing n b l o c k s o f
phomogeneous subjects, if matched subjects are used. In the former case, administration of
the bkcl combinations is randomized independently for each subject.
Computational formulas and degrees of freedom for this design are given in Table 16.10-3.
The design described in this section is appropriate for experiments that have five treatments. If
four of the treatments have p levels each, the fifth treatment must have p2 levels. For example,
if the four treatments have two levels each, the fifth treatment must have 22 = 4 levels. A
diagram of a LSFF-244 design is shown in Figure 16.11-1. The design corresponds to a one-
fourth replication of a CRF-22224 design.
Figure 16.11-1 ▪ (a) Standard 4 × 4 Latin square and (b) treatment combinations of an
The Greek letter ηo denotes one of the u levels of treatment E. It is assumed that all
interactions that do not appear in the model are zero. The design requires np4 subjects who
are randomly assigned to the p4 cells. Computational formulas and degrees of freedom for this
design are shown in Table 16.11-1.
A Graeco-Latin square can be used as a building block for fractional factorial designs. The
GLSFF-33 design diagrammed in Figure 16.12-1 is appropriate for experiments that use
repeated measures or matched subjects. Group z, where z = 1,…, p, denotes groups of n
blocks of subjects that are randomly assigned to the rows of the square. There are p levels of
gz. This design requires p random samples of n subjects if repeated measures are obtained or
np sets of p matched subjects if matching is used. In the latter case, the matched subjects
within a block are randomly assigned to the ajcl treatment combinations. If repeated measures
are obtained, the sequence of administration of the ajcl combinations is randomized
independently for each block.
where ζz and πi(z) denote, respectively, the effect of group z and the effect of subject i that is
nested within group z. The design requires the same number of observations as the LSFF-p.p2
design in Table 16.9-1. The advantage of a GLSFF-p3 design relative to an LSFF-p.p2 design is
that all effects in the former design are within-subjects effects. Computational formulas for a
GLSFF-p3 design are given in Table 16.12-1. It is assumed that the treatments represent fixed
effects and the blocks are random effects.
This chapter describes fractional factorial designs that are based on three building block
designs: completely randomized, randomized block, and Latin square designs. Their chief
advantages are as follows:
1.They permit the evaluation of a large number of treatments but employ only a fraction of the
total number of treatment combinations.
2.Fractional factorial designs lend themselves to sequential research in which flexibility in the
pursuit of promising lines of investigation is desirable.
1.The interpretation of the statistical analysis is complicated because treatments are aliased
with interactions. Fractional factorial designs require the assumption of zero interactions
among some or all treatments. This assumption is often unrealistic in the behavioral, social,
and medical sciences and education.
2.The requirement that all or most treatments have the same number of levels restricts the
design's usefulness.
3.The layout and computational procedures for fractional factorial designs are more complex
than those for other factorial designs.
1.Terms to remember:
a.alias (16.1)
b.screening experiment (16.1)
c. treatment-interaction confounding (16.1)
d.defining contrast (16.2)
e.resolution (16.2)
f. measurable interaction component (16.4)
g.additivity requirement (16.8)
2.[16.1] What is the distinguishing characteristic of fractional factorial designs?
3.[16.1] Compare the confounding in split-plot factorial, confounded factorial, and fractional
factorial designs.
*4.[16.2]
*a.Determine the alias pattern for a CRFF-25−1 design with (ABCDE) as the defining
contrast.
*b What is the resolution of this design?
*5.[16.2]
*a.Determine the alias pattern for a CRFF-25−2 design with (ABC) and (CDE) as the
defining contrasts.
*b.What is the resolution of this design?
6.[16.2]
a.Determine the alias pattern for a CRFF-25−2 design with (ABC) and (BDE) as the
defining contrasts.
b.What is the resolution of this design?
*7.[16.3] Determine the treatment combinations in a CRFF-25−1 design with (ABCDE) as the
0
defining contrast.
8.[16.3] Determine the treatment combinations in a CRFF-25−2 design with (ABC) a n d
0
(CDE)0 as the defining contrasts.
*9.[16.3] The effectiveness of a new medication for motion sickness was compared with that for
dimenhydrinate, which is widely used to prevent nausea and vomiting. The medications
represented the two levels of treatment A and were denoted, respectively, by a0 and a1.
Thirty-two male volunteers who reported that they were susceptible to motion sickness
participated in a 3-hour bus ride over hilly terrain. The subjects in treatment condition b0
had an unrestricted view of the road; those in treatment condition b1 were able to see the
road directly ahead, but their view to either side was restricted. The subjects in treatment
*10.
[16.4] For a CRFF-35−1 design, indicate the aliases for treatment A.
*a.Defining contrast is (ABCDE).
*b.Defining contrast is (ABCDE2).
*11.
[16.4] For a CRFF-35−2 design, indicate the aliases for treatment A.
*a.Defining contrasts are (ABC) and (BCD). (Hint: Don't forget the aliases of the two
generalized interactions.)
b.Defining contrasts are (ABC) and (CDE).
*12.
[16.2 and 16.4] Indicate the number of aliases of a treatment for the following designs.
*a.CRFF-24−1 with (ABC) as the defining contrast.
j. CRFF-53−2 with (AB2) and (BC3) as the defining contrasts. (Note: This design has four
17.
[16.9] Sixteen 10-year-olds read two stories over a period of 2 weeks. The main character in
one of the stories was a boy (treatment condition b0); the main character in the other story
was a girl (treatment condition b1). The age of the main character was given as either 10
years (treatment condition c0) or 14 years (treatment condition c1). Treatment A was the
gender of the participant in the experiment; boys were designated by a0 and girls by a1.
The treatment combinations of a 2 × 2 Latin square based on (ABC)0 as the defining
contrast were used in the experiment. The sequence of administration of the bkcl</i
combinations was randomized independently for each child. After reading a story, a child
responded to 20 questions that were designed to assess his or her identification with the
main character in the story. Two examples are, “Would you like to be (character's name)?”
and “Would you like to do the things (character's name) did?” The following data, number
of yes answers, for this LSFF-2.22 design were obtained:
*18.
[16.10] Thirty-six pilots made simulated approaches and landings using the standard
red/white two-bar Visual Approach Slope Indicator, VASI (treatment level a0); the three-bar
VASI (treatment level a1); and the Austrian T-VASIS (treatment level a2). The pilots flew
Convair 580 aircraft simulators with a computer-generated visual system simulating a
slightly reduced nighttime airport scene (treatment level c0), moderately reduced scene
(treatment level c1), and very reduced scene (treatment level c2). Treatment B was the
glide path: 2° (treatment level b0), 3° (treatment level b1), and 4° (treatment level b2). Two
580 simulators were used; they are denoted by d0 and d1. The dependent variable was a
weighted measure of accuracy; the higher the score, the more accurate an approach and
landing. The 36 pilots were randomly assigned to the treatment combinations of two 3 × 3
Latin squares based on (ABC)0 as the defining contrast. Two pilots were assigned to each
combination. The following data for this LSFF-332 design were obtained:
*a.Test the null hypotheses for treatments and interactions; let α = .05.
*b.Does the assumption of no interactions among treatments A, B, and C appear to be
tenable? Explain.
*c.Use the Fisher-Hayter statistic to test all two-mean null hypotheses for treatments A and
C.
d.Prepare a “results and discussion section” for Human Factors.
*19.
[16.5] Exercise 9 describes an experiment to evaluate a new medication for preventing
motion sickness.
*a.Write the cell means model for this design.
*b.Specify with matrix notation the hypotheses, H′μ= 0, that are used to compute the C′
coefficient matrices. Ignore treatment E when you compute the sums of squares.
*c.Specify μ′ for the cell means and the C′ coefficient matrices.
*d.Perform the analysis using the cell means model approach; let α = .05. This problem
requires a computer with a matrix package.
*The reader who is interested only in the classical sum-of-squares approach can, without loss
of continuity, omit this section.
http://dx.doi.org/10.4135/9781483384733.n16