0% found this document useful (0 votes)
19 views32 pages

I.6 Statistical Models For GenotypeEnvironment

The document discusses the statistical analysis of genotype-by-environment interaction (GEI) in plant breeding, emphasizing the importance of understanding how different genotypes perform under varying environmental conditions. It outlines the need for analytical tools to model GEI effectively, using examples from maize breeding programs to illustrate the complexities involved. The document also introduces various statistical models to analyze GEI and the significance of environmental characterizations in breeding decisions.

Uploaded by

RM Miau
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
19 views32 pages

I.6 Statistical Models For GenotypeEnvironment

The document discusses the statistical analysis of genotype-by-environment interaction (GEI) in plant breeding, emphasizing the importance of understanding how different genotypes perform under varying environmental conditions. It outlines the need for analytical tools to model GEI effectively, using examples from maize breeding programs to illustrate the complexities involved. The document also introduces various statistical models to analyze GEI and the significance of environmental characterizations in breeding decisions.

Uploaded by

RM Miau
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 32

I.

6 The statistical analysis of


multienvironment data: modelling
genotype-by-environment
interaction and its genetic basis

R Okono
Marcos Malosetti (WUR, The Netherlands), Jean-Marcel Ribaut (GCP, Mexico)
and Fred A van Eeuwijk (WUR, The Netherlands)
I.6 The statistical analysis of multienvironment
data: modelling genotype-by-environment
interaction and its genetic basis
Marcos Malosetti (WUR, The Netherlands), Jean-Marcel Ribaut (GCP, Mexico)
and Fred A van Eeuwijk (WUR, The Netherlands)

Introduction: phenotype, others exclusively under a restricted set of conditions


(specifically adapted genotypes). Specific adaptation
genotype and environment of genotypes is closely related to the phenomenon
The success of a plant breeding programme is reflected of genotype-by-environment interaction (GEI). GEI
in its ability to provide farmers with better phenotypes exists whenever the relative phenotypic performance
under a range of environmental conditions. A good, of genotypes depends on the environment, or in other
or better phenotype can be defined in terms of higher words, when the difference in performance between
production at harvest, or higher quality. The aim of a genotypes varies in dependence on the environment.
plant breeder is to develop genotypes with guaranteed
good phenotypes. To achieve this aim, it is necessary To illustrate the phenomenon of GEI, we can
to have an understanding of the causes behind a good consider two different genotypes that differ in the
phenotype. genetic machinery involved in tolerance to water-
limited conditions, while being equal for all other
In general, we tend to think of a phenotype from a static characteristics. If these two genotypes are exposed to
rather than a dynamic perspective. Thus, the phenotype a poorly watered environment, their performance will
is regarded as the state of a trait at a given moment in differ depending on the genetic properties related to
time. There are good reasons for this since, in general, tolerance of water-limited conditions. However, this
we are primarily interested in phenotypes such as grain genotypic difference will disappear in an environment
weight at maturity and not grain weight before maturity. that provides the right amount of water. So, the
However, it is important to consider that the final state difference in performance between the two genotypes
of a trait is the cumulative result of a number of causal thus depends on the environment, through the amount
interactions between the genetic make-up of the plant of water that it provides.
(the genotype) and the conditions in which that plant
developed (the environment). This is shown in Figure 1,
with the phenotype building up in time from the close Genotype Environment
interaction between genotype and environment.
Plants differ in the efficiency and adequacy with which
they capture and convert the environmental inputs and
stimuli into the tissues that constitute a product. A
plant’s capture and conversion abilities are determined
by the particular ensemble of genes present in the plant.
Environments differ in the amount and quality of inputs
and stimuli that they convey to plants including, eg,
the amount of water, nutrients or incoming radiation.
A primary objective in plant breeding is to match
genotypes and environments in such a way that improved
phenotypes are created. For example, a breeder might Figure 1.The phenotype develops over time as the
be interested in selecting genotypes that do well under outcome of cumulative causal interactions between
water stress conditions. genotype and environment. The arrow originating from the
genotype represents genetic information, say gene-expression, while
While there can be genotypes that do well across a the arrow representing the environmental flux refers to resources
wide range of conditions (widely adapted genotypes), and hazards (eg, water availability, physical and chemical soil
there are also genotypes that do relatively better than characteristics, incoming radiation, frost, heat, etc).

123
Marcos Malosetti et al

Some scenarios that can occur when comparing the LN96b with the larger range in the good environments
performances of pairs of genotypes across environments HN96b and NS92a.
are presented in Figure 2. The function describing the
phenotypic performance of a genotype in relation to GEI has also consequences for the correlations between
an environmental characterisation is called the ‘norm genotypic performances in different environments.
of reaction’ (Griffiths et al. 1996). The upper left plot
in Figure 2 shows the case where there is no GEI, the 10
genotype and the environment behave additively (this will
be developed later) and the reaction norms are parallel.
8
The remaining plots show different situations in which
GEI occurs: divergence, convergence, and the most

Yield (ton/ha)
critical one, crossover interaction. Crossover interactions 6
are the most important for breeders as they imply that
the choice of the best genotype is determined by the 4
environment at which the genotype is targeted.
Above, GEI was discussed in terms of the relative 2
difference between genotypic means. From a
different perspective, GEI can be regarded in terms of 0
heterogeneity of genetic variance and covariance, or

LN96a
LN96b
SS92a
SS94a
IS92a
IS94a
HN96b
NS92a
correlation. As a consequence of GEI, the magnitude
of the genetic variance as observed within individual
environments can change from one environment to Figure 3. Boxplot for yields for a maize F2 population
the next. It is frequently observed that in comparatively across eight environments. The box encloses observations
poorer environments the genetic variance is lower than in between the 25th and 75th quantiles, with the lines extending to the
minimum and maximum of observed values, except when extreme
better environments. Figure 3 illustrates the phenomenon
values occur, in which case outliers are indicated by crosses.
of heterogeneity of genetic variance across environments, Environment names are coded as: LN = low nitrogen, HN = high
showing box plots for a series of maize trials, as described Nitrogen, SS = severe water stress, IS = intermediate water stress, and
in greater detail below. In Figure 3, compare the smaller NS = no water stress. The number indicates the year of the trial, and
range of variation in the poor environments LN96a and the letters a and b the cropping season: a = winter, b = summer.

8 8
No interaction = additivity of G and E Divergence
7 7
Phenotypic performance

Phenotypic performance

6 6
5 5
4 4
3 3
2 2
1 1
0 0
Env 1 Env 2 Env 1 Env 2
8 8
7 Convergence 7 Crossover interaction
Phenotypic performance
Phenotypic performance

6 6
5 5
4 4
3 3
2 2
1 1
0 0
Env 1 Env 2 Env 1 Env 2
Figure 2. Genotype-by-environment interaction in terms of changing mean performances across environments.

124
Statistical models for GEI

When GEI is significant, the observed performance historically popular within the plant breeding community.
of a set of genotypes in one environment may not It then moves to more elaborated models in which
be very informative for the performance of the same additional information is used in the form of explicit
genotypes in another environment. Environments with environmental characterisation to model GEI. A final
similar characteristics will induce similar responses section is devoted to the integration of molecular marker
which, in turn, results in higher genetic correlations. information into GEI models, leading to the detection
Figure 4 shows that the correlation between the of quantitative trait loci (QTLs) and more specifically,
similar environments IS92a and IS94a is larger than the to the modelling of QTL by environment interaction
correlation between the dissimilar environments IS92a (QEI). The statistical methodology is illustrated using a
and HN96b. maize data set obtained from a series of drought and
nitrogen stress trials from the maize breeding program at
In conclusion, given the complexity of the mechanisms CIMMYT (Centro Internacional de Mejoramiento de Maíz
and processes underlying the phenotypic response y Trigo; the International Maize and Wheat Improvement
across diverse and changing environmental conditions Center; Ribaut et al, 1996; 1997). To encourage readers
– frequently in an unpredictable way – it is necessary to to carry out these statistical analyses themselves, GenStat
develop analytical tools to help breeders understand GEI. programs for the Discovery® version of this statistical
The use of adequate strategies to analyse GEI is a first package (Payne et al, 2003) are presented in Appendix I.
and important step towards more informed breeding
decisions. Good analytical methods are a prerequisite for
predicting the performance of genotypes as accurately
as possible. This chapter explores several strategies to Generating data to study
model GEI, starting with simple methods that have been genotype-by-environment
interaction
6 An obvious first step to investigate GEI is to obtain
5 phenotypic observations on a set of genotypes exposed
4 to a range of environmental conditions. The set of
genotypes can include advanced lines of a breeding
IS94a

3
programme, and old and/or new cultivars. It can also
2
consist of the segregating offspring generation of a
1 specific cross such as F2, a backcross, or a recombinant
0 inbred line (RIL) population.
6 Genotypes can be tested under different management
5 regimes that represent increasing levels of a particular
4 stress, or a combination of stresses. This type of
experiment is called a ‘managed stress trial’ and is
HS96a

3
appropriate when the researcher wishes to focus on
2
a particular type of stress. When performing managed
1
0
10
9
8
NS92a

7
6
5
4
2 3
4 5 6 7 0 1 2 3 4 5 6 0 1 2 3 4 5 6
IS92a IS94a HN96b
Figure 4. Scatter plot matrix for two stress environments (IS92a, and IS94a) and two non-stress
environments (HN96b and NS92a). Environment names are coded as in Figure 3.

125
Marcos Malosetti et al

stress trials, it is important to control the system in such Seeds harvested from each of 211 F2 plants were reserved
a way that all other factors influencing the phenotype are as F3 families. The F3 families were evaluated in managed
as homogenous as possible. Stress type and level can be stress trials in 1992, 1994 and 1996. In the winter of
difficult to implement, because the relationship between 1992, a managed water stress trial was conducted in
the phenotype and the stresses is generally complex, with Mexico, including no stress (NS), intermediate stress
genes and environmental stresses interacting throughout (IS), and severe stress (SS). In the winter of 1994, a
the various developmental phases. A common way for similar trial was conducted, but only included the IS
plant breeders to screen for genotypic reactions to and SS treatments. In the summer of 1996, the families
environmental factors is by ‘multienvironment trials’ were tested in a nitrogen stress trial with two levels:
(METs). In an MET, a number of genotypes are evaluated low (LN) and high nitrogen (HN). An extra LN trial was
at a number of geographical locations for a number of conducted in the winter of the same year. In total, the
years in the hope that the pattern of stresses that the families were evaluated in eight different environments,
genotypes experience is representative of the collection each environment consisting of year–stress treatment
of future growing environments of the genotypes that are combinations. DNA was extracted from each of the 211
eventually selected. F2 plants to produce a total of 132 restriction fragment
length polymorphism (RFLP) markers covering the 10
A convenient way to summarise the data, either coming maize chromosomes.
from managed stress trials or from METs, is to construct
a two–way table of means, with genotypes in the rows
and environments in the columns. Each cell of the table Models for genotype-by-
contains the mean of a particular genotype in a specific environment interaction:
environment. To identify genotypes and environments modelling the mean
unequivocally, we use indices, say the letter i for genotypes
(i = 1…I), and letter j for environments (j = 1…J). The additive model as a benchmark
The representation in Figure 5 shows a genotype-by- The phenomenon of GEI is of primary interest in plant
environment table of expected means, μij, with the breeding, and has resulted in a large body of literature
dimensions of the maize example that is used in this on models and strategies for analysis of GEI (see, for
chapter, with I = 211 genotypes, and J = 8 environments. example, the reviews in Cooper and Hammer, 1996; Kang
The models that will be presented in the following sections and Gauch, 1996; van Eeuwijk, et al, 1996; van Eeuwijk,
will assume as a starting point a genotype-by-environment 2006). A dominant feature of strategies used to describe
table of means as described above. and understand GEI is a heavy reliance on parameters
that are statistical rather than biological. This is not by
CIMMYT maize drought stress trials: coincidence, since historically, a large part of quantitative
example data genetics has relied on simple statistical models. A
notorious example is the well known model: P = G + E,
The models to be presented in this chapter are illustrated where P stands for phenotype, G for genotype and E
using data produced by the maize drought stress breeding for environment (Falconer and Mackay, 1996; Lynch and
programme of CIMMYT. A brief description of the data Walsh, 1998). A statistical formulation of this model for a
is given here, with a more detailed description available two–way table of means can be written as:
in the original publications (Ribaut et al, 1996; 1997). A
maize F2 population was generated by crossing a drought μij = μ + Gi + Ej + ij [1]
tolerant parent (P1) with a drought susceptible one (P2).

Environments j = 1 … J
Genotypes i = 1 … 211

LN96a LN96b SS92a SS94a IS92a IS94a HN96b NS92a


G001 μ1,1 μ1,2 μ1,3 μ1,4 μ1,5 μ1,6 μ1,7 μ1,8
G002 μ2,1 μ2,2 … … … … … μ2,8
… … …
G211 μ211,1 μ211,2 μ211,3 μ211,4 μ211,5 μ211,6 μ211,7 μ211,8
Figure 5. Schematic layout of a two–way table of means, with the mean corresponding to genotype i in
environment j in the cell identified by row i and column j.

126
Statistical models for GEI

From here onwards, in the model formulations, particular environment, and the results indicate that
random terms are underlined to emphasise the fact the environments differ significantly in their quality. In
that their effects are assumed to follow a normal general, differences between environmental main effects
distribution. Model 1 describes the random mean of are significant, and from the breeder’s point of view this
genotype i in environment j, μij, as the result of the is not a major concern. Breeders want to concentrate on
common fixed intercept term μ, a fixed genotypic main differences between genotypes. A significant genotypic
effect corresponding to genotype i, Gi, plus a fixed main effect indicates that genotypes differ in their
environmental main effect corresponding to environment average performance across environments, something
j, Ej, and finally the random term, ij, representing the certainly more interesting to breeders. Finally, it should
error term, typically normally distributed, with a mean of be mentioned that the residual in Table 1 corresponds
zero and constant variance, 2; ij ~ N(0,2). to the discrepancy between the predicted genotype-
by-environment means from an additive model and the
One remarkable feature of model 1 is that it predicts observed means.
that for any genotype the change in phenotypic mean
between two environments j and j* will always be equal There are two reasons for the disagreement between
to Ej-Ej*. As a consequence, the norms of reaction of the predicted values from an additive model and the
different genotypes will be parallel (as in the upper left observed means for environment-specific genotypic
plot of Figure 2). Another important aspect is that, performances: (i) a specific effect related to the particular
although the parameters in the model suggest that combination of genotype and environment; and (ii)
something intrinsically genetic and something intrinsically experimental error. The obvious way to disentangle one
environmental is determining the trait, the genotypic and cause from the other is to include an explicit term in the
environmental effects purely follow from a convenient model for the effect of specific genotype–environment
way of partitioning phenotypic variation from a statistical combinations. This term that is called ‘GEI’ and that is
point of view. In a balanced data set, the genotypic read as ‘genotype-by-environment interaction effect’ is
main effects can be estimated simply from the average double-indexed:
performance of the genotypes across environments.
Rather than being something inherently genotypic, this μij = μ + Gi + Ej + (GEIij + ij ) [2]
is entirely dependent on the set of environments that We can not separate GEI from error when using a two-
were used in the experiment. If a few environments are way table of means, because for that we would need
dropped, the genotypic effects of a set of genotypes can replicated observations on genotype-by-environment
be completely different. The same argument applies to combinations. Therefore, both the terms GEI and
the main environmental effects, which completely depend the error are written within brackets. When plot
on the set of genotypes used in the experiment. data are available, we can fit a full model, including
The results of the fit of an additive model to the maize replicates within trials (environments), to estimate all
data set are presented in Table 1. The results show the parameters in model 2, that is, main effects and GEI
that, according to the F test, there is a significant interaction effects. Use of model 2 implies estimation
environmental and genotypic main effect (the F statistic of as many parameters as there are genotype-by-
for environments equals 1466.5, and for genotypes environment combinations, something that is not
5.3, both of which are highly significant: P < 0.001). desirable in the interest of parsimony. Another limitation
As just mentioned, environments are characterised of the model is that it is not possible to estimate the
by the average performance of the genotypes in the genotypic performance in environments that are not
included in the trial. Accordingly, fitting model 2 could
tell us something about the amount of variation due to
Table 1. ANOVA table for the additive model (model 1), genotypic main effects in relation to GEI, by comparing
as applied to CIMMYT maize stress trials. DF = degrees of sums of squares or mean squares, but it does not bring
freedom, SS = sum of squares, MS = mean squares, F = F statistic, much progress towards understanding GEI.
P = cumulative upper probability associated to F.
Term DF SS MS F P The regression on the mean model
E 7 5679 811.2 1466.5 < 0.001 A more attractive alternative is to extend the additive
G 210 614 2.9 5.3 < 0.001 model (model 1) by incorporating terms that explain
 1470 813 0.6
as much as possible of the GEI. A popular strategy in
plant breeding is that proposed by Finlay and Wilkinson
Total 1687 7106 4.2

127
Marcos Malosetti et al

(1963), which describes GEI as a regression line on increase in environmental quality. Note that in model 3a,
the environmental quality. In the absence of explicit bi = 0, so that the average slope value is zero, while in
environmental information, the biological quality of an model 3b the average value of b is 1, meaning that b > 1
environment can be reflected in the average performance for genotypes with a higher than average sensitivity, and
of all genotypes in that environment. Good environments b < 1 for genotypes that are less sensitive than average.
will have a high average genotypic performance, and
bad environments will have a low average genotypic Table 2 gives the fit of model 3a to the maize example
performance. The GEI part is then described by data. The first two rows of the table, corresponding
genotype-specific regression slopes on the environmental to the genotypic and environmental main effects, are
quality, and the model can be written in the following identical to Table 1. The third row corresponds to the
equivalent ways: GEI effect in terms of the regression on environmental
quality, where quality is represented by the environmental
μij = μ + Gi + Ej + biEj + ij * [3a] mean. This regression is highly significant, according
to the F tests (F = 2.4, P < 0.001). The residual sum of
μij = Gi + biEj + ij * [3b] squares in Table 1 (SS = 813) has been divided into a
part explained by genotypic sensitivities to environmental
Models 3a and 3b are equivalent. Model 3b follows from quality (SSb = 230), and a residual (SS* = 583).
model 3a by taking μ + Gi = Gi and Ej + biEj = (1+bi)
Ej = biEj. Model 3b is easier to interpret because it By way of example, the fitted reaction norms of five
consists of a set of regression lines; each genotype has genotypes (out of the full set of 211 genotypes) have
a linear reaction norm with intercept Gi and slope bi. been given in Figure 6, together with the parameters
The explanatory environmental variable in these reaction estimated according to the parameterisation in model
norms is simply the environmental main effect Ej. Model
3a shows more clearly how GEI is tried to be captured Table 2. ANOVA table for the regression on the mean
by a regression on the environmental main effect, with model (model 3), as applied to CIMMYT maize stress trials
the hope that the term biEj will contain as much as Term DF SS MS F P
possible of the original GEI signal.
E 7 5679 811.2 1752.3 < 0.001
In the regression on the mean model, GEI is explained in G 210 614 2.9 6.3 < 0.001
terms of differential sensitivities to the improvement of B 210 230 1.1 2.4 < 0.001
the environment, with some genotypes (the ones with * 1260 583 0.5
larger values of bi) benefiting more than others from an Total 1687 7106 4.2

G b
G025 3.8 1.27
8
G045 4.0 0.99
G016 2.3 1.18
6
G012 2.5 1.01
Yield

4 G008 2.1 0.65

IS94a
0 SS92a
LN96b LN96a SS94a HN96b IS92a NS92a

-3 -2 -1 0 1 2 3 4
Environmental quality
Figure 6. Response curves of five maize genotypes in relation to environmental quality. The vertical dashed
line indicates the average environment. Next to genotype labels, the values for the two curve parameters are given, the
intercept (G = the response in the average environment) and the slope (b = the genotypic sensitivity to changes in the
environmental quality). Arrows point to the average yield of each environment as deviation from the overall mean.

128
Statistical models for GEI

3b (G and b). Figure 6 shows that, in the average where the GEI is now explained by K multiplicative
environment, genotypes G025 and G045 are better than terms (k = 1…K), each multiplicative term formed by
G008, G012 and G016. The estimates of the parameters the product of a genotypic sensitivity bik (also known
G can directly be observed in the plot as the value as ‘genotypic score’) with a hypothetical environmental
corresponding to environment quality equal to 0, i.e. the characterisation, zjk (also known as ‘environmental score’).
average environment (indicated by the dashed vertical Although genotypic and environmental scores are deemed
line). Although G045 does slightly better than G025 in to represent genetic and environmental qualities, they
the average environment, G025 is superior to G045 come from a mathematical procedure that maximises the
in the high-quality environments. This is because G025 variation explained by the products of the genotypic and
has a better ability to exploit improved environmental environmental scores. The first product term is the one
conditions, which is reflected in the higher genotypic that explains most of the variation, followed by the second
sensitivity of the former (bG025 = 1.27 > bG045 = 0.99). one, and so on. This is reflected in Table 3, which shows
A similar observation can be made with respect to G008 the application of the AMMI model to the maize example
versus G012 and G016. While G008 does relatively data. In the AMMI model, GEI is explained by two axes
better in low quality environments, it is clearly surpassed (principal component 1, PCA1, and principal component
by G012 and G016 in the best environments, since it is 2, PCA2) that are highly significant (F = 2.8 and 2.0
not capable of profiting from the better environmental respectively, both with an associated P < 0.001). The first
conditions (bG008 = 0.65, which is the lowest sensitivity axis (PCA1) explains the largest part (SSPCA1 = 242), the
among the five genotypes). second one explains a little less (SSPCA2 = 173), with a total
explained sum of squares for GEI of 242+173 = 415, a
In summary, the regression on the mean model describes clear improvement over the explained sum of squares in
GEI in terms of parameters that can be given some the regression on the mean model (SSb = 230).
biological meaning. In addition, and in contrast with the A desirable property of the AMMI model is that the
full interaction model (model 2), model 3 can be used to genotypic and environmental scores can be used to
predict the performance of genotypes in environments construct powerful graphical representations called
that were not present in the experiment. This is biplots (Gabriel, 1978) that help to interpret the GEI.
provided that the environment for which predictions Figure 7 present a biplot for the maize data. A first thing
are required can reasonably be placed within the to recognise is that both genotypes and environments
range of environments used in the original experiment. are present in the same plot; genotypes are represented
Nevertheless, the regression on the mean model suffers by open circles and environments by filled rectangles.
from the fact that the environmental characterisation A second important characteristic is the presence of
is based on a single dimension. Environmental quality environmental axes that allow approximations of GEI
can be hard to summarise within a single explanatory for individual genotypes in a given environment. These
variable. Therefore, a substantial amount of GEI can environmental axes pass through the origin and point
remain unexplained. In the next section, the regression in the direction of the corresponding environment
on the mean model will be extended by including symbol. To avoid a too large number of lines in the plot,
multidimensional environmental characterisations in the environmental axes have been drawn only from the
statistical model for the genotype-by-environment data. origin to the environmental symbol, but as is shown for
environment NS92a, axes can be prolonged. To help
The additive main effects and
multiplicative interaction model
The limitation of a single dimension in environmental Table 3. ANOVA table corresponding to application of
characterisation can be eased by employing a more AMMI2 model (model 4) to CIMMYT maize stress trials.
flexible model, in which more than one environmental PCA1 and PCA2 are the principal component axes 1 and 2,
quality variable is allowed. A popular model of this type respectively.
is the so-called ‘AMMI model’ – the additive main effects Term DF SS MS F P
and multiplicative interaction model (Gabriel, 1978;
Gauch, 1988; Gollob, 1968; Mandel, 1969). To emphasise E 7 5679 811.2 1752.3 < 0.001
the parallelism with model 3a, the AMMI model can be G 210 614 2.9 6.3 < 0.001
written as: PCA1 216 242 1.1 2.8 < 0.001
PCA2 214 173 0.8 2.0 < 0.001
K * 1040 398 0.4
μij = μ + Gi + Ej + bikzjk + ij * [4] Total 1687 7106 4.2
k=1

129
Marcos Malosetti et al

approximating the GEI, a scale can be indicated on the The results of model 5 fitted to the maize data are
environmental axes, as shown for the axis of NS92a presented in the form of a biplot in Figure 8. In GGE
(Graffelman and van Eeuwijk, 2005). biplots, genotypes are distributed according to the overall
performance in each environment. This is in contrast
Biplots facilitate the exploration of relationships between to Figure 7, which concentrated exclusively on GEI.
genotypes and/or environments. Genotypes that are From Figure 8 we see that high yielding genotypes are
similar to each other are closer in the plot than genotypes concentrated on the right hand side of the biplot as their
that are different. Similarly, environments that are more projections on the environmental axes would fall mostly
alike tend to group together as well. The angle between on the positive range of values (see for example genotype
the environmental axes is related to the correlation G91 that gave a performance of 2.5 above average in
between the environments. An acute angle indicates NS92a, while genotype G41 gave 3 below average). On the
positive correlation (eg, between LN96a and LN96b), a contrary, low yielding genotypes (as genotype G41) are
right angle indicates no correlation (eg, between HN96b concentrated on the left hand side of the biplot.
and NS92a), and an obtuse angle indicates negative
correlation (eg NS92a and LN96a). The projection of Factorial regression models
a genotype onto an environmental axis reflects the
relative performance of the genotype in that particular The models discussed so far assumed that we do not
environment (for GEI). For example the projection of have explicit information about the environments. While
genotype G91 on the NS92a axis gave a value of +2, such models can be useful to explain GEI, the biological
which is above 0, indicating a positive interaction with interpretation of their results is not always obvious. What
that environment (ie, a good adaptation to NS92a). do hypothetical environmental variables, as in AMMI, mean
Conversely, genotype G41 (on the left hand side of the in terms of quantifiable environmental characteristics such
plot) gave a negative value (of almost -3) which points to as temperature, water, nutrients etc? A straightforward
a negative interaction with environment NS92a (ie, not approach is to correlate environmental scores with
well adapted to this environment). Following a similar environmental covariables. However, if we do have explicit
procedure it is possible to conclude that while genotype information about the environment, the information can
G91 showed a positive adaptation to environment NS92a, be used directly in the model by including it in the form of
it is not well adapted to environments LN96a and LN96b explanatory variables. GEI is then described as differential
(note that the projection of this genotype on the LN96a genotypic sensitivity to explicit environmental factors such
and LN96b axes would fall in the range of negative as temperature, precipitation, water availability etc. Such
values). Biplots are useful tools to investigate patterns in models are known as ‘factorial regression models’(Denis,
GEI, because they can help to quickly identify interesting 1988; van Eeuwijk et al, 1996). Two examples of factorial
genotypes that are adapted to particular environments, regression models are given here. Model 6a includes a
and to classify environments in groups. single environmental covariable, while model 6b includes
multiple environmental covariables:
Plant breeders are interested in the whole of the genetic
variation and not exclusively in the GEI part. For that μij = μ + Gi + Ej + bi ZjK + ij * [6a]
reason, it is useful to have a modification of model 4 K
that writes the joint effects of the genotypic main effect μij = μ + Gi + Ej + bikZjk + ij * [6b]
and the GEI as a sum of multiplicative terms. Effectively, k=1
the two-way table of genotype-by-environment means
is exposed to a standard principal components analysis, Models 6a and 6b look identical to models 3a and 4,
with genotypes as objects and environments as variables but there is a substantial difference between them. In
(Yan et al, 2000). For this new model, 5, closely the models 6a and 6b, Zj represents an explicit environmental
same estimation and interpretation procedures hold as covariable and not a hypothetical environmental covariable
for model 4. Because genotypic scores now describe as in models 3a and 4 (it is capitalised as Z to highlight
genotypic main effects G and GEI together, this type of this difference). This distinction is critical since the
model is also known as the ‘Genotype main effects and interpretation of the GEI in models 6a and 6b is placed
GEI model’, or ‘GGE model’. The biplots are called ‘GGE into a more biological context. Instead of describing GEI
biplots’ (Yan et al, 2000). The model reads: as differential reactions to hypothetical environmental
covariables, factorial regression models help to identify
K
genotypes that are differentially sensitive to changes in
μij = μ + Ej + bikzjk + ij * [5] identified environmental quality components, for example,
k=1
in a particular nutrient, or in water availability.

130
Statistical models for GEI

(21.25%)
PCA2
2 HN96b

1 G041 NS92a 3
LN96a 2.5
LN96b
2
1.5
1
0 0.5
0
-0.5 G091
-1
-1.5
-2
-2.5 IS94a IS92a
-1 -3 SS94a
-3.5
-4 SS92a

-2 PCA1
-4 -3.5 -3 -2.5 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 2.5 3 (29.80%)

Figure 7. Biplot from an AMMI model used to describe GEI in the maize example data. Open circles represent genotypes,
filled rectangles environments.Vectors represent environment axes, which for simplification have been drawn only between the origin and
the corresponding environment symbol. The full axis is given for environment NS92a, for which also a scale is also included (in ton ha–1).
By means of example, the projection of two genotypes (G041 and G091) on the NS92a axis is indicated with a dashed line.

(13.65%) PCA2 -3
SS94a
1 G041 -2 IS94a

SS92a
0.5 -1
HN96b
LN96b
IS92a
0 0 LN96a
1
-0.5

2
-1
G091
3
-1.5
4
-2
5
NS92a
PCA1
-3 -2.5 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 2.5 3 (52.25%)
Figure 8. GGE biplot produced from the fit of a GGE model to the maize example data. See Figure 7 legend for
description of symbols and vectors.

131
Marcos Malosetti et al

Table 4 shows the results of a factorial regression model squares for genotypic and environmental effects against
fitted to the maize example data, in which GEI is explained the error term. However, choosing all terms as fixed
by differential genotypic sensitivities to the minimum for more complicated data – as MET data typically are
temperature during flowering (minTF, F = 1.7, P < 0.001) – leads to suboptimal estimation and testing procedures
and to the amount of radiation during grain filling for genetic effects. More realistic estimation and testing
(radiationGF, F = 1.2, P = 0.038). In many cases, different procedures can be obtained by taking genotype–related
combinations of explanatory variables could produce model terms as random as soon as enough genotypes are
closely similar models in terms of the amount of explained involved. As a rule of thumb, genetic model terms can
GEI. Therefore, to arrive at biologically meaningful be taken as random as soon as more than 10 genotypes
models, it is crucial to combine statistical criteria for are involved. Models with at least two fixed and at least
model selection with physiological knowledge about the two random terms are called mixed models. A review of
trait that is involved. the use of mixed models to analyse complex data sets in
plant breeding can be found in Smith et al (2005). For the
maize example data set, there are 211 genotypes. When
Mixed models for genotype- the genotypic main effects are taken as random, the
by-environment interaction: following mixed model equivalent of the additive model
can be defined as:
modelling genetic variances and
covariances μij = μ + Gi + Ej + + ij [7]

In the introduction, it is mentioned that GEI can be Gi ~ N(0, 2G) ij ~ N(0, 2)
regarded both in terms of differential mean responses
across environments and in terms of heterogeneity of It should be recalled that the term Gi is underlined to
genetic variation and covariation between environments. indicate that it is a random term; its distribution needs
While the models presented in the previous sections to be specified, and is taken by default to be normal,
focused on modelling the mean response, the models with zero mean and a variance specific to the term. In
presented in this section focus on the modelling of GEI model 7, two variance components are needed, one
in terms of heterogeneity of variances and covariances. corresponding to the random genotypic main effects, 2G,
This section switches to the framework of so-called and a second one corresponding to the residual (which
mixed models. Rather than going into the details of mixed includes true GEI and error). An important consequence
model theory, it concentrates on presenting the main of including genotypes as random is that a genetic
characteristics of a few, relatively simple yet powerful, variance–covariance structure is automatically imposed
mixed models that can be used to model GEI in terms of on the data (in our case, the genotype-by-environment
heterogeneity of variance and covariance. A more detailed means). To help visualise such a structure, Figure 9
description of mixed models can be found in the literature shows a representation of the variance–covariance
elsewhere (Galwey, 2006; Verbeke and Molenberghs, matrix between observations on two genotypes in eight
2000). environments (to stay close to the maize example).
The models discussed in the previous sections were all The diagonal of the variance–covariance matrix (dark
examples of fixed effects models, because all terms except shaded cells) contains the total variance for an individual
the residual term were fixed. Testing of model terms genotypic observation in a particular environment,
in fixed models can be interpreted as comparing mean which is equal to the sum of the two sources of
variation: 2G + 2. The off–diagonals of the matrix
contain the covariance between genotypic observations,
Table 4. ANOVA table corresponding to application of a
which depends on the particular pair of genotype-by-
factorial regression model (model 6) to CIMMYT maize
stress trials.
environment observations that is considered. It is equal
to 2G when taking observations on the same genotype
Term DF SS MS F P in different environments (dashed cells), and it is equal
E 7 5679 811.2 1752.3 < 0.001 to 0 when taking observations on different genotypes
G 210 614 2.9 6.3 < 0.001 (clear cells). This means that, in model 7, similarities
G.minTF 210 172 0.8 1.7 < 0.001 (or covariation, and therefore correlation) between
G.radiationGF 210 124 0.6 1.2 0.038 observations made on the same genotype in different
* 1050 517 0.5 environments are assumed, but covariation between
Total 1687 7106 4.2

132
Statistical models for GEI

observations from different genotypes (regardless terms, a second with the estimates for the variances of
whether the observation is done in the same or in the random terms, and a third with a goodness-of-fit
different environments) is assumed to be absent. In terms statistic, the deviance, that can be used to compare mixed
of correlations, considering the general definition of a models with equal fixed terms and differing random
correlation: terms. For the fixed effects (environments in this case),
covariance (x; y) Table 5 shows a Wald test statistic, the corresponding
r(x; y)= degrees of freedom (DF), and a P value. The Wald test
√var (x) √var (y) statistic is used to assess the significance of fixed effects
in the REML mixed model framework, with under the
where x and y represent genotypic observations in null hypothesis of no fixed effects, a distribution that is
different environments, model 7 imposes a constant approximately a Chi-square with DF equal to the number
correlation between environments, with the correlation of independent effects for the particular fixed term. In the
between any pair of environments j and j*, where maize example, the Wald test statistic for environments
we write Envj and Envj* when referring to those is 10265.3 and it has 8–1 = 7 degrees of freedom. This
environments, being equal to: Wald statistic has a very low tail probability in the
Chi-square distribution under the null hypothesis of no
2G 2G environmental effects (P < 0.001). So, it is concluded that
r(Envj; Envj*)= = there is a significant difference between environments.
√ G +   √ G +   2G + 2
2 2 2 2

Nowadays, some statistical packages can also provide an
F-distributed approximation to the Wald statistic.
Although mixed models can be fitted by standard least
squares procedures in some special situations, a more Table 5. REML output for a compound symmetry model
general way of fitting mixed models is by the method of (model 7), as fitted to CIMMYT maize stress trials.
residual maximum likelihood, or REML (Patterson and
Fixed terms Wald DF P
Thompson, 1971). Results of analyses based on REML
are presented in another way than the familiar ANOVA E 10265.3 7 < 0.001
tables. As an example Table 5 shows the results obtained Random terms Estimate SE
by fitting model 7 to the maize example data.
G 2
0.297 0.036
Table 5 does not contain sums of squares, mean squares
 2
0.553 0.020
nor F statistics. Instead, there is a table with three main
sections, one with the results for testing fixed model Deviance (DF) 1077.9 (1678)

2G + 2 2G

G001 Env1-8 2G


Genotype i r(Env j; Env j*) =
2G + 2


Genotype i and i* r(Env j; Env j*) =
G002 Env1-8  G + 2
2

Figure 9: Representation of the covariance matrix for environment-specific means for two genotypes (G001 and G002)
across eight environments (j = 1…8). Cells on the diagonal (dark shaded) contain the variance for the genotypic means (equal to:
2G+2). Means for the same genotype (i) but in different environments (j and j*) have covariance equal to 2G (cross-hatched cells), and
means for different genotypes (i and i*) have covariance equal to 0 (clear cells).

133
Marcos Malosetti et al

The estimates of the two parameters associated to the is a larger variation (eg, in environment 8, which is the
random terms in the model: 2G = 0.297 and 2 = 0.553 high-yielding NS92a) than in other environments (eg, in
are seen in the second part of Table 5. The magnitude environments 1 and 2, which are low-yielding, LN96a
of the variance components can be compared in order and LN96b). In addition, the heterogeneity of variance
to have an impression of the relative importance of introduces heterogeneous genetic correlations between
genotypic main effects (2G) in relation to GEI and error environments. For example, the correlation between
(2). In addition, the genetic correlation between any environment 1 and 2 is:
two environments is estimated as:
0.125
2G 0.297 r(Env1; Env2)= = 0.466
r(Envj; Envj*)= = = 0.358 √0.125 + 0.135 √0.125 + 0.152
2G + 2 
0.297 + 0.533

The last row in Table 5 presents the deviance (equal to and between environments 1 and 8 is:
-2 times the restricted loglikelihood), which is a measure 0.125
of how well the model fitted to the data. The better the r(Env1; Env8)= = 0.199
model, the lower the deviance. As will be seen later, √0.125 + 0.135 √0.125 + 1.399
the deviance can be used to compare different models
in order to select the best model for the data, provided In conclusion, model 8 accommodates heterogeneity
that the fixed part of the model remains unchanged. of variance between environments and, with it, allows
for heterogeneous correlations between environments,
Model 7 assumes a constant genetic covariance
which can be desirable when analysing environments that
between environments and a constant genetic variance
strongly differ (eg, with strong stress and without stress).
within environments, thereby determining a constant
genetic correlation between environments. In the The deviance for model 8 is 838.4 with 1671 DF, which
context of METs, the assumption of constant genetic is much lower than the one for model 7 (deviance
variance and genetic correlation across environments 1077.9 with 1678 DF). The deviance has dropped, but
is, in general, very unrealistic (see, eg, Figure 4). It at the expense of having to estimate more parameters
has already been mentioned that, because of GEI, the (nine instead of two parameters). Is the decrease in
genetic variance within environments can change from deviance large enough to consider model 8 a significant
one environment to the next. If GEI is present, a more improvement over model 7? Because model 7 and 8
realistic model would account for heterogeneity of are nested models (model 7 is a special case of model
genetic variance across environments which will, in 8 when the 2j are all the same), a deviance test can be
turn, cause heterogeneous genetic correlations between used to answer this question. Under the null hypothesis
environments. The model can be written as:
μij = μ + Gi + Ej + ij * [8] Table 6. REML output of a mixed model assuming
heterogeneity of genetic variance across environments
(model 8), as fitted to CIMMYT maize stress trials.
Gi ~ N(0, 2G) ij ~ N(0, 2 )
j
Fixed terms Wald DF P

In model 8, there is still a single genetic variance E 9759.4 7 < 0.001


component for genotypes, and therefore, a constant Random terms Estimate SE
genetic covariance between environments. However,
G 2
0.125 0.017
the variance for the term ij is assumed to depend on
21 0.135 0.018
the environment (ie, the variance component 2 is 22 0.152 0.019
j
indexed by j). Table 6 presents the results of fitting 23 0.551 0.057
model 8 to the maize data. Instead of two variance 24 0.704 0.072
components, there are now nine, one corresponding to 25 0.692 0.071
the variance component for genotypes (2G = 0.125), 26 0.672 0.069
and eight corresponding to a form of GEI for each of 27 0.761 0.078
the eight environments. The heterogeneity of variance 28 1.399 0.140
for ij reflects the fact that in some environments there Deviance (DF) 838.4 (1671)

134
Statistical models for GEI

of no difference in quality of the fits, the difference 2c2, and 2c3 respectively), and on the off-diagonals the
in deviance between the two models is Chi-square covariances between the groups (2c1c2, 2c1c3, and 2c2c3).
distributed with the number of degrees of freedom equal The full covariance matrix can be written as:
to the difference in the number of parameters between
the models. In the example, the difference in deviance is   2c1
1077.9 - 838.4 = 239.5, and the models differ by seven C = 2c1c2 2c2
parameters. The P value associated to 239.5 in a Chi-
square distribution with 7 DF is very small (P < 0.001), 2c1c3 2c2c3 2c3
so it is concluded that model 8 provides a significant
improvement in model fit. The results of fitting model 9 to the maize data are
presented in Table 7, where the estimates of the
In cases where the models are not nested, the parameters in the covariance matrix C can be found.
comparison can be done by the Akaike Information Figure 10 shows a representation of how the VCOV
Criterion (AIC) (Akaike, 1974). The AIC is calculated matrix is structured according to model 9.
for each model as AIC = deviance + 2p, where p is
the number of parameters in the model. For model The diagonals of C show that, on average, the genetic
7, AIC = 1077.9 + 2×2 = 1081.9, and for model 8 variation is lower in group 1 (the group of nitrogen
AIC = 838.4 + 2×9 = 856.4. The model that has the stress environments) than in group 2. It should be
lowest AIC value is the one that is chosen. Model 8 has noted that because group 3 is composed of a single
the lowest AIC value, which confirms the conclusion environment, 2c3 and 28 are confounded, so 2c3 can
based on the deviance test. not be estimated and is given the arbitrary value of 1. In
other words, for group 3 the genetic variation cannot
Model 8 assumes heterogeneous variances across be partitioned into a component due to the group
environments, in combination with a constant covariance and a residual and the total variance is estimated as
between environments. This latter constraint can be 1.000 + 0.736. The total variance in each of the other
relaxed by also allowing the genetic covariance between environments is equal to the sum of the group’s variance
environments to be heterogeneous. A possibility is plus the environment-specific residual variance. For
to estimate a covariance parameter for each pair of example, the variance in environment 1 is equal to 0.187,
environments, producing a variance-covariance model which is the sum of the variance of group 1, ie, 2c1 =
that is referred to as the ‘unstructured model’. A 0.042, and 21 = 0.145. Recalling that the covariance
somewhat simpler, but often equally effective strategy
consists of estimating covariances between groups of Table 7. REML output of a mixed model assuming
environments instead of between individual environments, heterogeneity of genetic covariance between groups of
in which the environments are first grouped in a number environments and heterogeneity of genetic variance (model
of clusters and then fitting the following model: 9), fitted to CIMMYT maize stress trials.

μi(c)j = μ + Gi(c) + Ej + i(c)j [9] Fixed terms Wald DF P


E 6268.8 7 < 0.001
Gi(c) ~ N(0, c) i(c)j ~ N(0, 2 ) Random terms Estimate SE
j
In model 9 a random genetic main effect is fitted that 2C 0.042 0.013
changes between groups of environments and that has 2C 0.439 0.053
a covariance matrix C, with group specific genetic 2C 1.000 -
variances on the diagonals and genetic covariances CC 0.109 0.019
CC 0.115 0.032
between groups on the off-diagonals. Model 9 retains
CC 0.551 0.077
the residual heterogeneity of model 8, which means 2 0.145 0.018
that environment specific genotypic effects are added 2 0.138 0.017
to the group (of environments) specific effects. To 2 0.446 0.051
illustrate model 9 using the same example, and based 2 0.508 0.057
on Figure 8, the environments were clustered in three 2 0.445 0.052
groups: group 1 = (LN96a, LN96b), group 2 = (SS92a, 2 0.428 0.050
SS94a, IS92a, IS94a, HN96b), and group 3 = (NS92a). 2 0.740 0.080
Therefore, the covariance matrix C will contain on the 2 0.736 0.169
diagonal the genetic variances for groups 1, 2, and 3 (2c1, Deviance (DF) 619.9 (1667)

135
Marcos Malosetti et al

between environments within the same group is given four extra parameters. The associated P value for
by 2c1, 2c2 and 2c3, and the covariance between 218.5 in a Chi-square distribution with 4 DF is very
environments in different groups by 2c1c2, 2c1c3 and low (P < 0.001), so it can be concluded that model 9
2c2c3, the correlation between any pair of environments is a significant improvement over model 8. The AIC
can be estimated. For example, the correlation between for model 9 (AIC = 619.9 + 2×13 = 645.9) is smaller
environments 1 and 2 is: than for model 8 (AIC = 856.4), which confirms this
conclusion. A summary of the comparison of models 7
0.042 to 9 is presented in Table 8.
r(Env ; Env )= = 0.229
√0.042 + 0.145 √0.042 + 0.138
1 2
Three different mixed models that can be used to
model GEI in terms of heterogeneity of variance
and between environments 1 and 8 is: and covariance between environments have been
formulated. First presented was the compound
0.115
r(Env ; Env )= = 0.202 symmetry model, which is the commonly used default
√0.042 + 0.145 √1.000 + 0.736
1 8
model when fitting a mixed model to a two–way table
of means. Then next two alternative models were
Finally, the deviance can be checked to evaluate whether presented that accommodated heterogeneity of genetic
the allowance for heterogeneity of covariance between variances across environments, and heterogeneity
environments improved the quality of the model. of genetic variances as well as covariances across
environments. More VCOV models are possible (Boer
The deviance for this model 9 is 669.9 with 1667 DF. et al, 2007; Malosetti et al, 2004), but their discussion is
The difference in deviance with model 8 is 218.5, with outside the scope of this chapter.

0.042
0.042 0.042
0.109 0.109 0.439
0.042
0.109 0.109 0.439 0.439
c = 0.109 0.439
0.109 0.109 0.439 0.439 0.439
0.115 0.551 1.000
0.109 0.109 0.439 0.439 0.439 0.439
0.109 0.109 0.439 0.439 0.439 0.439 0.439
0.115 0.115 0.551 0.551 0.551 0.551 0.551 1.000

0.042 0.145
0.042 0.042 0.138
0.109 0.109 0.439 0.446
0.109 0.109 0.439 0.439 0.508
vcov = 0.109 0.109 0.439 0.439 0.439 + 0.445
0.109 0.109 0.439 0.439 0.439 0.439 0.428
0.109 0.109 0.439 0.439 0.439 0.439 0.439 0.740
0.115 0.115 0.551 0.551 0.551 0.551 0.551 1.000 0.736

Figure 10.The variance–covariance (VCOV) matrices resulting from the estimated parameters from model 9. C is
the VCOV between groups and it is extended to the original eight environments (represented by the 8x8 matrix following the arrow). The
total VCOV is given by the sum of two matrices: the between group VCOV and the VCOV containing the environment-specific variances.

Table 8. Comparison of the goodness of fit for three different mixed models (models 7 to 9), as fitted to CIMMYT
maize stress trials. The columns ‘ deviance’ and ‘ DF’ indicate the differences in deviance and number of degrees of freedom
between subsequent models. The associated P values correspond to a Chi-square distribution with  degrees of freedom.
Model Deviance DF  deviance  DF P AIC
Model 7 1077.9 1678 1081.9
Model 8 838.4 1671 239.5 7 < 0.001 856.4
Model 9 619.9 1667 218.5 4 < 0.001 645.9

136
Statistical models for GEI

The analysis of a data set is an iterative process applicable to standard biparental populations, but can be
consisting of fitting and comparing alternative models adapted rather easily to other types of population, such
to identify a good model for the data under study. That as those occurring in association mapping (Malosetti et
process has been illustrated with a maize data set. The al, 2007).
next section goes one step further in the modelling
process by including molecular marker information, with Explanatory variables for differences
the ultimate objective of identifying genomic regions, between genotypes: genetic predictors
QTLs, that underlie genetic variation. Within the Most populations in QTL mapping originate from
context of METs, the use of such models is a powerful crosses between pairs of inbred lines. A segregating
tool to identify and understand the genetic basis of GEI, offspring population can be produced from an F1
that is, QTL by environment interaction (QEI). hybrid after one generation of selfing (F2), after several
generations of self-pollination (recombinant inbred
lines or RILs), or after crossing the F1 with one of the
QTL mapping in the context parental lines (backcross). In addition, by chromosome
of multienvironment trials: doubling of F1 gametes, a population of doubled haploid
modelling main effect QTLs lines can be generated. In all of these cases, two alleles
at most will segregate at each locus. Considering a
and QTL-by-environment locus referred to as ‘locus M1’, individuals can have the
interaction genotypes M1M1, M1m1, or m1m1, where it is assumed
that the M1 allele is the allele that comes from the
The initial part of this chapter presented models
paternal line, and the m1 allele comes from the maternal
that use either implicit or explicit environmental
line. By convention the locus names are given in italics
characterisations to understand GEI. This section
(so for example M1 refers to locus 1, and M1 and m1
presents models that use explicit genotypic information
refer to the paternal and maternal alleles at locus 1).
to describe GEI. Use of such information in statistical
The relative frequency of the genotypes in the offspring
models for GEI can help understand the basis of GEI
population will depend on the type of population; for
in terms of the action of genome regions, QTLs,
example, in an F2 the expected frequencies will be ¼, ½,
in their dependence on the environment, ie, QEI.
and ¼, respectively for M1M1, M1m1, and m1m1.
Nowadays, various molecular marker systems such as
RFLP, amplified fragment length polymorphism (AFLP), With the help of molecular markers, it can be revealed
diversity arrays technology (DArT), microsatellites, whether a particular individual is of the M1M1, M1m1, or
single nucleotide polymorphism (SNP), and others, m1m1 type. To detect QTLs and estimate their effects,
are available, providing information about the DNA it is necessary to translate the marker information
composition of an individual at specific chromosome into explanatory variables – genetic predictors. A
locations. That information can be employed in models straightforward way of constructing genetic predictors
for GEI. For example, within the framework of factorial is to create an explanatory variable that contains the
regression models, markers can be used as explanatory number of copies of one of the alleles, for example the
variables, which underlies regression–based approaches M1 allele. The genetic predictor will then take the value
for QTL mapping (Haley and Knott, 1992; Martínez and 2 whenever an individual is of the M1M1 genotype, ie,
Curnow, 1992). when the offspring individual has two alleles like those
of the paternal line. The genetic predictor will further
Elaborating upon factorial regression ideas, the following
take the value 1 when the offspring individual is M1m1,
section presents mixed models that can accommodate
and 0 when is m1m1. Using a simple regression model,
explicit genotypic information to describe GEI in terms
the slope for the regression of the genotypic means
of QTL and QEI effects (Malosetti et al, 2004; Boer
on a genetic predictor defined by the number of M1
et al, 2007; van Eeuwijk et al, 2007). The genotypic
alleles corresponds to the effect of a substitution of
information stemming from markers is introduced in
an m1 allele by an M1 allele at the given locus (Lynch
the statistical models in the form of so-called genetic
and Walsh 1998; Bernardo 2002). This effect is also
predictors. How to construct genetic predictors from
known as the additive genetic effect of the QTL allele.
marker information will be demonstrated. The QTL
By analogy, a dominance genetic predictor can be
models developed in this paper are, in the first place,
constructed by creating an explanatory variable with

137
Marcos Malosetti et al

values 0 when the offspring individual is M1M1 or m1m1, The calculation of genotypic probabilities conditional
and value 1 whenever it is M1m1. Table 9 shows how to on marker information provides the basis for all QTL
convert marker information into genetic predictors. mapping strategies; QTL mapping packages calculate these
probabilities behind the scenes. In addition, there are
With complete information on the marker genotypes, packages such as Grafgen1 (Servin et al, 2002) that can be
ie, codominant markers without missing values, the used to explicitly obtain such conditional probabilities.
construction of genetic predictors at marker positions With the estimated conditional probabilities, the genetic
consists of simply counting the number of alleles predictors at positions where no or partial marker
coming from a particular parent. For genomic positions information is available can be calculated by simply
in between marker loci (putative QTL positions), for inputting the conditional probabilities in expression
dominant markers, and for markers with missing values, 10. An analogous reasoning holds for the estimation of
the construction of genetic predictors requires more dominance genetic predictors:
effort. In a general formulation, the value for the additive
genetic predictor, Xadd, for an offspring individual can be Xdom = Pr (M1M1|markers) × 0
defined as the expected number of alleles coming from + Pr (M1m1|markers) × 1
the paternal line, the number of M1 alleles:
+ Pr (m1m1|markers) × 0. [10b]
Xadd = Pr (M1M1|markers) × 2 This section ends with a small example to illustrate the
+ Pr (M1m1|markers) × 1 construction of genetic predictors when there is missing
+ Pr (m1m1|markers) × 0, [10a] or incomplete marker information. Figure 11 presents
a hypothetical case in which six loci, M1 to M6 are
with Pr (M1M1|markers), Pr (M1m1|markers), and considered in an F2 population. All loci map to the same
Pr (m1m1|markers) the conditional probabilities of chromosome and at the positions indicated in Figure
the individual being of the M1M1, M1m1, or m1m1 type, 11. While for loci M1, M3, and M6 there is complete
respectively given the observed marker information. In the information, the information for loci M2, M4 and M5 is
case of complete information, the individual’s genotype is either incomplete because of the marker being dominant
known, so either Pr (M1M1|markers), Pr (M1m1|markers) (locus M2) because of a missing value (locus M5 individual
or Pr (m1m1|markers) will be equal to 1. For example, if 2) or because it is was not observed at all (locus M4; say, a
the individual is of type M1M1, Pr (M1M1|markers) = 1, and putative QTL position).
Pr (M1m1|markers) = Pr (m1m1|markers) = 0, leading to
Xadd = 2. In principle, conditional genotypic probabilities can be
calculated for a few marker loci using a pocket calculator,
In the case of incomplete information, although the but with many markers, the work becomes complicated
genotype of an individual may not be known with
certainty, information can be obtained from nearby 0 locus M1
markers to estimate the probability of the offspring
Individual 1 Individual 2 Individual 3
individual being of a particular genotype. This probability 7 locus M2
is a function of the observed genotypes at neighbouring locus M1 m1m1 M1m1 M1M1
markers and the expected recombination occurring 15 locus M3 locus M2 m2m2 M1? M1?
between those marker loci and the locus under evaluation locus M3 m3m3 M3m3 M3M3
(Lynch and Walsh, 1998). Efficient methods to calculate locus M4 ?? ?? ??
conditional genetic probabilities for the different types locus M5 m5m5 ?? M5m5
of population commonly used for plants have been 30 locus M4 locus M6 m6m6 M6m6 M6m6
proposed in the literature; see Jiang and Zeng (1997) for
an exhaustive overview. 38 locus M5

Table 9. Conversion table of marker genotypes to


additive genetic predictors (Xadd) and dominance 50 locus M6
genetic predictors (Xdom) for the locus M1. Figure 11. Example data consisting of six loci, M1 to M6,
located on one chromosome, with the genotypes for
Marker genotype Xadd Xdom three F2 individuals. The question mark (?) indicates unobserved
M1M1 2 0 alleles at the locus.
M1m1 1 1 1
Grafgen is freely available at http://fhospital.free.fr/fred/work/
m1m1 0 0 programs/grafgen/.

138
Statistical models for GEI

and tedious. A more efficient way is to use computer the same procedure, the additive and dominance genetic
packages such as Grafgen to perform the calculations.1 predictors can be calculated at every genomic position,
Table 10 shows the conditional genotypic probabilities including those positions for which no genetic information
at locus positions on chromosome 1, as calculated by was available, as is the case for the locus M4. The values
Grafgen; input files and required commands are given of the additive and dominance genetic predictors for this
in Appendix II. At positions where there is complete example are presented in Figure 12.
information, the conditional genotypic probability is
equal to 1 for one of the genotypes. For example,
locus M1 in individual 1 has Pr (m1m1|markers) = 1, Modelling genotype-by-environment
and Pr (M1m1|markers) = Pr (M1M1|markers) = 0, interaction in terms of QTL effects
indicating that given the marker information we are
certain that individual 1 is of the genotype m1m1 The inclusion of genetic predictors in a GEI model allows
and, according to expression 10, the additive genetic testing the hypothesis that the DNA at a particular
predictor for individual 1 is Xadd = 0. In addition, genome position has an effect on a phenotypic trait, and
because Pr(M1m1|markers) = 0, the dominance genetic that this effect is environment dependent, ie, that there
predictor for individual 1 is equal to Xdom = 0. In is QEI. Taking the phenotypic model 7 as a starting point,
the case of individual 2, Pr (M1m1|markers) = 1, and this model is extended to accommodate two new terms,
Pr (m1m1|markers) = Pr (M1M1|markers) = 0, which, in
turn, gives Xadd = 1, and Xdom = 1.
A similar reasoning can be followed for the other two 0 locus M1 Xadd
loci for which complete marker information is available Individual 1 Individual 2 Individual 3
(locus M3 and M6). Conversely, it is not possible to be locus M1 0 1 2
7 locus M2
locus M2 0 1.019 1.9890
sure about the genotype for locus M2 in individuals 2
locus M3 0 1 2
and 3, because they can be either heterozygous, M2m2, locus M4 0.0235 1.0000 1.3373
15 locus M3
or homozygous, M2M2. Intuitively, a higher chance is locus M5 0 1.0000 1
expected for individual 3 to be homozygous, M2M2, locus M6 0 1 1
than for individual 2. This is because in individual 3
Xdom
both neighbouring markers are homozygous for the 30 locus M4 Individual 1 Individual 2 Individual 3
allele coming from the first parent. The estimated
locus M1 0 1 0
conditional probabilities confirm this expectation, with 38 locus M5 locus M2 0 0.9891 0.0110
Pr(M2M2|markers) = 0.0109 (very low) in individual 2, and locus M3 0 1 0
Pr(M2M2|markers) = 0.9890 (very high) in individual 3. locus M4 0.0232 0.9004 0.6474
From the conditional probabilities in Table 10, the additive locus M5 0 0.9080 1
genetic predictor value for individual 2 at locus 2 can be 50 locus M6 locus M6 0 1 1
estimated as: Xadd = 0.0109 × 2+0.9891 × 1 = 1.0109, and Figure 12.Translation of the molecular marker
the additive genetic predictor value for individual 3 as: information into additive genetic predictors (Xadd, upper
Xadd = 0.9890 × 2+0.0110 × 1 = 1.9890. The dominance table) and dominance genetic predictors (Xdom, lower
genetic predictor of individuals 2 and 3 are equal to table), of six loci (M1 to M6) for three hypothetical F2
Xdom = 0.9891 and Xdom = 0.0110, respectively. Following individuals with information on five molecular markers.

Table 10. Conditional genotypic probabilities of three F2 individuals at different chromosome positions on a
hypothetical chromosome.
Individual 1 Individual 2 Individual 3
Position Marker Pr(MiMi) Pr(Mimi) Pr(mimi) Pr(MiMi) Pr(Mimi) Pr(mimi) Pr(MiMi) Pr(Mimi) Pr(mimi)
0 M1 0 0 1 0 1 0 1 0 0
7 M2 0 0 1 0.0109 0.9891 0 0.9890 0.0110 0
15 M3 0 0 1 0 1 0 1 0 0
30 M4 0.0001 0.0232 0.9767 0.0498 0.9004 0.0498 0.3449 0.6474 0.0077
38 M5 0 0 1 0.0460 0.9080 0.0460 0 1 0
50 M6 0 0 1 0 1 0 0 1 0

139
Marcos Malosetti et al

one for the additive genetic effect of a possible QTL the existence of additive QTL effects. It is still necessary
(Xiaddj), and a second for the dominance effect of the to find out whether they are environment specific, ie,
same locus (Xidomj): whether a QEI term is needed, or whether a model with
just main effect QTL expression would suffice. To this
μij = μ + Ej + Xiaddj + Xidomj + Gi + ij * [11], purpose, the environment–specific QTL effects (j) are
where Xiadd,and Xidomstand for the values of the additive partitioned into an additive main effect (Q) and QEI
and dominance genetic predictors of individual i at a effects (jQEI), leading to the following model:
position at which a QTL is postulated and tested for. μij = μ + Ej + XiaddQ + XiaddjQEI + Xidomj + Gi* + ij* [12]
The parameters j and j represent the additive and
dominance effects of this QTL. In model 11, both types If required, a similar partitioning of the QTL effects may be
of QTL effects are indexed by j, because environment- carried out for the dominance effects. As a result of the
specific effects are allowed for the additive and partitioning of the environment-specific QTL effects, there
dominance QTL effects. Residual genetic main effects (ie, is a Wald test for QTL main effect and a Wald test for QEI
genetic effects not explained by the QTL) contribute to (Table 11). The QEI effects should be tested, conditional
the random genetic effect, Gi*, and residual GEI (residual on the main effect being fitted into the model, ie, the QTL
QEI) contributes to *ij. The test for the presence of main effect should always precede the term for QEI. In
a QTL at a particular position is based on a Wald test the example, it is observed that the QEI interaction effect
(Verbeke and Molenberghs, 2000) that tests for the is highly significant (Wald = 88.0 on 7 DF, P < 0.001), so
environment-specific additive and dominance genetic it is concluded that QTL effects are dependent on the
effects being null across all environments: Ho: j = 0, and environment. Since there is significant QEI, no attempt will
Ho: j = 0, j = 1…J. By definition, dominance effects are be made to interpret the QTL main effect. When QEI is
deviations from additive effects. Therefore, dominance not significant, the model can be simplified by omitting the
effects should be tested on the condition that the additive QEI term, as the QTL main effect will suffice to describe
effects are already in the model. In practice, and to assure the QTL effect.
that the proper test is used, it is important to include
the term for additive genetic effects in the model before
the term for the dominance effects, and then use the A QTL mapping strategy for multi-
sequential Wald test (eg, in the GenStat output, this test environment trials based on mixed models
can be found under the heading ‘Sequentially adding terms
to fixed model’). The preceding section presented a number of models
that can be useful in the detection of QTLs for MET
For the maize example, Table 11 shows an example data. The present section now presents a strategy for a
of the application of model 11 to a particular genomic genome-wide scan for QTLs, ie, QTL mapping. This can be
position. The table indicates that the dominance effect regarded as a model selection process which, essentially,
at this genome position was not significant when aims to identify a model that describes the phenotypic
applying a test level,  (to be distinguished from the response in terms of QTL effects. However, it is not
additive QTL effects), of 0.05 (Wald statistic = 13.5 on known either how many QTLs are actually involved or
8 DF, P = 0.097), and, therefore, the null hypothesis of their effects. Therefore, a strategy is needed to explore
no dominance effects can be accepted. However, the efficiently the vast range of possible models to arrive at a
Wald statistic for the additive genetic effects was highly suitable one. There is no unique way of performing this
significant (Wald = 100.9, on 8 DF, P < 0.001), indicating search, because many strategies are eligible. An effective
strategy is presented here, consisting of the following
Table 11. Results of the test for fixed effects in a mixed steps: (i) find a good model for the phenotypic data; (ii)
model including a fixed environment–specific additive (j) perform a genome–wide scan for QTLs by simple interval
and dominance (j) QTL effect. The additive QTL effect is mapping (SIM); (iii) perform one or more rounds of
partitioned into a QTL main effect (Q), and a QEI effect (jQEI). composite interval mapping (CIM) with cofactors selected
Fixed terms Wald DF P from the SIM step; and (iv) fit a final multi–QTL model
to estimate QTL effects. Each step is illustrated using the
E 10875.5 7 < 0.001 maize example data. In the Appendix, a sketch is provided
Additive effect (j) 100.9 8 < 0.001
of the code that performs the different steps in GenStat
Q 12.8 1 < 0.001
jQEI 88.0 7 < 0.001
Discovery® (Payne et al, 2003) which, can serve as a
Dominance effect (j) 13.5 8 0.097 template for user–defined programs.

140
Statistical models for GEI

Step 1: Identify the best model for the overoptimistic conclusions, it is necessary to use some
phenotypic data kind of correction for multiple testing, such as the
A number of models can be fitted (for example models Bonferroni correction. If n tests are performed, the
7 to 9), and compared based on the comparison of the Bonferroni correction defines a comparison-wise type
AIC values. The selected mixed model will represent the I error rate P* = P / n, which assures an experiment-
starting point from which to develop a QTL model. Table wise type I error rate P. For example, to accept a
8 gives the AIC for three candidate models for the maize maximum of 5% of false rejections in the whole of the
example data. Model 9 had the lowest AIC and was, experiment (genome–wide), the threshold P* that
therefore, chosen as the basic phenotypic model. needs to be used for an individual test at particular
position is equal to P* = 0.05 / n (in the maize example
Step 2: Genome-wide QTL scan, simple interval P* = 0.05 / 423 = 0.0001). A disadvantage of the
mapping Bonferroni correction is that it is too conservative,
After choosing the phenotypic model, a genome-wide which means that some QTLs may go undetected. It
scan is performed fitting single QTL models across the also assumes that tests are independent, which in the
genome at marker and in between marker positions, case of QTL mapping is not true because tests at nearby
ie, SIM. To perform SIM, we need to estimate genetic positions are correlated.
predictors that cover the genome. A reasonable Modifications to the Bonferroni correction in the
predictor coverage for most population types and context of QTL mapping have been proposed by
population sizes is obtained by calculating the genetic Cheverud (2001), and further modifications proposed by
predictors at a distance of 5 to 10cM. The genetic Li and Ji (2005). Both approaches essentially compensate
predictors are used to extend the phenotypic model and for the fact that, in QTL mapping, tests are correlated
to test for QTL effect at the predictor locations. Model by using an estimated effective number of tests (n*)
9 was selected for the maize data set, and the SIM scan instead of the actual number of tests (n) to calculate P*.
can then be done by fitting the following model at every For the maize example, the Li and Ji (2005) approach
genetic predictor position: produced a value of n* = 98, which gave the critical
μij = μ + Ej + Xiaddj + Xidomj + Gi* + ij * [13] P* = 0.05 / 98 = 0.0005, which is five times larger than
given by the Bonferroni correction.
The results of a genome-wide SIM scan can be visualised
as in Figure 13, where the P value of the Wald statistic
test is plotted (for convenience expressed as the –log10 of 7
the P value) for the additive and dominance effect along
one chromosome. Dominance is tested conditionally on 6
the additive effect being in the model. The horizontal
line indicates a threshold value, above which the null 5
hypothesis of no QTL is rejected. In Figure 13 the
profile for the dominance effect does not go beyond the 4
-log10(p)

threshold level but the profile for the additive effect does,
having a maximum around 70cM. The conclusion is that 3
there is a QTL with a significant additive effect on yield in
this part of the chromosome. Scanning the results across 2
the full set of chromosomes produces a list of putative
QTL positions that can be used in the following stage of 1
the QTL mapping.
0
SIM implies performing multiple tests along the genome,
one test at each putative QTL position. For example, 0 20 40 60 80 100 120 140
for the maize data genetic predictors were calculated Map position (cM)
at 423 chromosome positions, which means that model Figure 13. SIM chromosome scan for one chromosome.
13 was fitted 423 times (and, therefore, 423 tests The profiles correspond to the log10 of the P value associated
were performed). When performing multiple tests, to the null hypothesis of no additive QTL effect (solid line), and to
the frequency of false positives (ie, falsely rejecting the null hypothesis of no dominance QTL effect (dashed line). The
the null hypothesis) increases dramatically. To avoid horizontal line indicates a threshold value above which the null
hypothesis is rejected.

141
Marcos Malosetti et al

Step 3: Composite interval mapping at a final model. The final model for our example data
The power of QTL detection can be improved by is shown in Table 12. No dominance effect was found
reducing the background noise caused by QTLs outside significant. Table 12 show that all six QTLs from the CIM
the region under test. This is the principle of the CIM scan retained a significant effect in the multi-QTL model.
approach, simultaneously proposed by Jansen and Stam Further, by breaking down the QTL effects into QTL
(1994) and by Zeng (1994). What makes the difference main effects (qQ) and QEI effects (qQEI), it was possible
between SIM and CIM, is that when performing CIM the
model includes a number of cofactors (F) that correct for
the effects of the genetic background:
log10(p)
μij = μ + Ej + Xifcfj+ Xiaddj + Xidomj + Gi* + ij *
Chromosome 1 Chromosome 2
[14]
f F 15

In model 14 the term 


10
X c accounts for the effects
f F if fj
(additive and dominance) of QTLs outside the region 5
that is being tested (Xiadd and Xidom), reducing the
0
error variation and thereby improving the chances to Chromosome 3 Chromosome 4
detect a significant QTL. Various strategies exist for the 15
construction of a set of cofactors to be included in a CIM
scan. A pragmatic approach is to use the results from the 10
SIM scan, including the positions indicative of QTLs after 5
SIM as cofactors.
0
Another issue that needs to be addressed is that when Chromosome 5 Chromosome 6
testing in a region close to a cofactor, it is necessary to 15
exclude the particular cofactor from the model to avoid
co-linearity problems. A popular solution is to choose a 10
window around an evaluation position and, if a cofactor 5
falls inside that window, then the cofactor is excluded
from the model. Window size affects the results of a CIM 0
scan, and there are no clear–cut recommendations about Chromosome 7 Chromosome 8
which window size to use. For the present example, all 15
cofactors that are on the chromosome being evaluated 10
were excluded, a strategy known as restricted CIM.
5
The results of the restricted CIM scan for the maize
data are presented in Figure 14. The profiles point to six 0
Chromosome 9 Chromosome 10
regions for QTLs (all with additive effects; dominance
was not significant in any of the cases). One QTL is on 15
each of chromosomes 1, 2, 6 and 10, and two are on 10
chromosome 3.
5
Step 4: establishing a final QTL model 0
In a subsequent modelling step, the QTLs for all positions 0 80 160 240 0 80 160 240
that were found significant in the restricted CIM scan are Map position (cM)
included simultaneously in the mixed model:
Figure 14. Profiles corresponding to the restricted
μij = μ + Ej + Xiq jq+ Xiq
add dom
jq + Gi* + ij* [15] composite interval mapping scan for QTLs in a maize
q Q
add
q Q
dom population. The profiles represent log10 of the associated P value
of the null hypothesis of no additive QTL (solid line) and dominance
Model 15 is a multi–QTL model constructed by inclusion QTL (dashed line) effects present at a specific position. When the
of the full set of QTLs identified in the previous CIM profile goes beyond the horizontal line (threshold value) there is
scan. QTLs with non-significant effects will be removed an indication of a significant QTL. A maximum profile value beyond
the threshold indicates the most plausible location of the QTL
using Wald tests (conditional on all other QTLs) to arrive
(indicated by arrows).

142
Statistical models for GEI

to investigate whether QTL effects were consistent HN96b, allele Q1 will be the choice when selecting for
across environments. All QTLs had significant QEI, so the other environments. The effects of Q2 to Q6 were
environment–specific QTL effects were estimated. inconsistent in the size of effects but not in their sign,
indicating that the favourable allele came from the second
The estimated QTL effects are given in Table 13, with
parent for Q2, Q4, and Q5 (2, 4, and 5 with a negative
QTL effects declared significant when zero is outside
sign), but from the first parent for Q3, and Q6 (3, and 6
the confidence interval of the estimated QTL effect
with a positive sign). This is interesting information at the
(CI = estimate ± 2se, with se the average standard
moment of selecting complementary lines that combine
error obtained from the REML analysis). Note that now
the good alleles coming from the first parent with the
we refer to the different QTLs as Q1, Q2, Q3, etc, and
good alleles coming from the second parent.
following the same convention the paternal and maternal
alleles at QTL Q1 are indicated by Q1 and q1 respectively.
Modelling QTL effects in relation to
The first QTL, Q1, had a significant effect of 0.551
ton·ha-1 in environment SS92a, which means that for
environmental information
each replacement of q1 by Q1, a yield increase of about An interesting possibility with the QTL models
half a ton is expected. The effect of the same QTL in presented here is that they allow the inclusion of
environment HN96b had a negative sign (0.263 ton·ha-1), environmental information to explain QTL effects in
which means that rather than an increase, a decrease in terms of sensitivities to environmental factors. Similarly
yield is expected for the same substitution of q1 by Q1. to GEI models in which environmental information can
It is remarkable that the effects of Q1 (1) were not only be integrated to describe GEI effects, QEI models can
inconsistent across environments in terms of the size of integrate environmental information to describe QEI
the effects, but also in terms of the sign of the effect. effects. Expressing QTL effects in terms of sensitivities
to a particular environmental factor allows prediction
Inconsistency in size and sign of QTL effects underlies of the effect of the QTL under any condition within
crossover interactions, the most important case of GEI the range of the original experiments. In addition, the
(see bottom right panel of Figure 2). From the breeder’s inclusion of environmental information can help unravel
point of view, the crossover QEI means that, while allele the physiological mechanisms that are behind the action
q1 has to be selected when breeding for the environment of a particular QTL.
The final QTL model for the maize example data
Table 12. Effects in a multiple QTL mixed model with six consisted of six QTLs, contained in the set Qadd, with
QTLs with environment–specific effects. For each QTL, a environment-specific additive effects:
separate test of the QTL main effect and QEI effect is presented.
μij = μ + Ej + Xiq jq + Gi* + ij*
add
Wald tests for individual QTLs are conditional on the presence of [16]
all other QTLs. q Q
add

Fixed terms Wald DF P


E 7265.2 7 < 0.001 Table 13. QTL effect estimates (ton ha-1) for individual
1 95.8 8 < 0.001 environments (significant effects in bold). A positive
 1Q 0.5 1 0.480 sign indicates that the superior allele comes from P1 (drought-
 1QEI 95.3 7 < 0.001 susceptible genotype), and a negative sign indicates the superior
2 39.8 8 < 0.001 allele comes from P2 (drought-tolerant genotype). Effects with an
 2Q 2.3 1 0.129 absolute value larger than 2 x avg(se) [avg(s.) = average standard
 2QEI 37.5 7 < 0.001 error] were declared significant. The average standard error is given
3 31.7 8 < 0.001 in the last row.
 3Q 8.3 1 0.004
 3QEI 23.4 7 0.001 Environment 1 2 3 4 5 6
4 26.6 8 < 0.001 LN96a 0.014 -0.019 0.021 -0.202 -0.090 0.064
 4Q 12.0 1 < 0.001 LN96b 0.022 0.139 0.156 -0.062 0.113 0.054
 4QEI 14.6 7 0.042 SS92a 0.551 -0.153 0.118 -0.019 -0.002 0.098
5 31.9 8 < 0.001 SS94a 0.198 -0.135 0.070 -0.233 0.008 0.394
 5Q 0.4 1 0.527 IS92a 0.408 -0.384 0.130 -0.274 0.044 0.369
 5QEI 31.5 7 < 0.001 IS94a 0.389 -0.063 0.183 -0.138 0.068 0.291
6 53.3 8 < 0.001 HN96b -0.263 0.129 0.390 -0.040 -0.166 0.616
 6Q 4.3 1 0.038 NS92a 0.448 -0.413 -0.036 -0.137 -0.286 0.310
 6QEI 49.1 7 < 0.001 Avg(se) 0.087 0.101 0.086 0.095 0.081 0.086

143
Marcos Malosetti et al

It can now be investigated as to whether the variation in The estimate for the parameter , when the covariable
effects of those QTLs is related to changes in one or more is the minimum temperature during flowering equals
external environmental variable (note the analogy with the 0.045 ton ha-1 °C-1. Replacing the allele Q1 by allele q1,
factorial regression models discussed for GEI, models 6a will cause an increase of 0.045 ton ha-1 for each degree
and 6b). Figure 15 presents a scatter plot of the Q1 effects Celsius of increase in the minimum temperature during
across environments versus the minimum temperature flowering.
during flowering time. The plot shows a negative
relationship between the QTL effect and temperature. The example assumed a simple linear relationship
between the QTL effect and a single environmental
Assuming a simple linear relationship between the effect of covariable, but more complex explanatory models can
a QTL and a given environmental covariable, it is possible be constructed. For example, it is possible to include
to test for that relationship using the following model: higher order terms to model the response curve (eg, a
quadratic term), to use spline formulations, or to include
more than one environmental covariable in the model. A
X
add add
μij = μ + Ej + iq
jq + Xi (q* + q*Zj + ajq*) + Gi* + ij* [17]
q Q, q=q* close interaction with physiologists is crucial to explore
and select biologically sound models.
Note that for the sake of simplicity, in model 17, the
regression of environment-specific QTL effects on
environmental covariables is developed for a single QTL Table 14. Wald test statistic and parameter estimates
of environment–specific QTL effects (Q1 of the maize
only, Qq*. However, the procedure can be applied equally
example) expressed in terms of the QTL sensitivity to
to other QTLs with environment–specific effects. In the minimum temperature during flowering time. The
model 17, the effect of Qq* is expressed in relation to an QTL sensitivity to the temperature is given by the slope parameter
environmental covariable (Z), where the effect of the QTL (), and the intercept term () can be interpreted as the QTL
is equal to: jq* = q* + q*Zj + ajq*. Zj represents the value effect in the average environment.
of the covariable Z for environment j. When Zj is centred Fixed term Wald DF P
around zero, the parameters of the QTL effects can be
 5.5 1 0.019
interpreted as follows: q* corresponds to the effect of
 18.1 1 < 0.001
Qq* in the average environment (that is, when Z = 0); q*
corresponds to the change of the QTL effect per unit of Estimate SE Units
change of the covariable’s value; and the random term  0.220 0.071 ton·ha-1
ajq* corresponds to the residual (unexplained) QTL effect,  -0.045 0.011 ton·ha-1·°C
with aj~N(0,2aq*). Applying model 17 and choosing Q1 to
represent Qq*, while taking minimum temperature during
flowering time as covariable, leads to Table 14. Q1 reacts Conclusions
significantly to changes in the minimum temperature This chapter has presented an inventory of statistical
during flowering (the Wald test for  is highly significant). models that can be useful to plant breeding practitioners
who are dealing with GEI. For each application, the
0.6 specific breeding context determines whether such
models are appropriate or inappropriate. No procedure
0.4 provides a gold standard for data analysis. On the
QTL effect (ton ha-1)

contrary, data analysis is an iterative process that


requires a critical attitude towards the models that are
0.2 being used. The process involves the task of choosing the
most appropriate model to answer the questions that the
0.0 researcher has posed. The models presented here are
not intended as the ultimate set of models from which a
-0.2 researcher should choose. Following the editors in the
introduction to the contents of the book, the purpose of
10 12 14 16 18 20 22 this chapter is to present a set of commonly used models
Min temp flowering (ºC) not as a cookbook for data analysis, but as a background
Figure 15. Q1 effects (ton ha-1) in eight that can assist each cook (= researcher) in developing
environments versus the minimum temperature his/her own recipe (= data analysis).
(ºC) during flowering in those environments.

144
Statistical models for GEI

References
Akaike H (1974). A new look at the statistical Haley CS and Knott SA (1992). A simple Ribaut J-M, Jiang C, Gonzalez de Leon D,
model identification. IEEE Transactions on regression method for mapping quantitative Edmeades GO and Hoisington DA
Automatic Control, AC-19:716-723. trait loci in line crosses using flanking (1997). Identification of quantitative
Bernardo R (2002). Breeding for quantitative markers. Heredity 69:315–324. trait loci under drought conditions in
traits in plants. Stemma Press, Woodbury, Jansen RC and Stam P (1994). High resolution tropical maize. 2. Yield components and
Minnesota, USA. 369 pp. of quantitative traits into multiple loci via marker-assisted selection strategies.
Boer MP, Wright D, Feng L, Podlich DW, Luo interval mapping. Genetics 136:1447–1455. Theoretical and Applied Genetics
L, Cooper M and van Eeuwijk FA (2007). A Jiang CJ and Zeng ZB (1997). Mapping 94:887–896.
mixed-model quantitative trait loci (QTL) quantitative trait loci with dominant and Servin B, Dillmann C, Decoux G and
analysis for multiple-environment trial missing markers in various crosses from two Hospital F (2002). MDM: A program
data using environmental covariables for inbred lines. Genetica 101:47–58. to compute fully informative genotype
QTL-by-environment interactions, with an Kang MS and Gauch HG eds (1996). Genotype-by- frequencies in complex breeding
example in maize. Genetics 177:1801–1813. environment interaction. CRC Press Inc, Boca schemes. Journal of Heredity 93:227–228.
Cheverud JM (2001). A simple correction for Raton, Florida, USA, 416 pp. Smith AB, Cullis BR and Thompson R
multiple comparisons in interval mapping Li J and Ji L (2005). Adjusting multiple testing in (2005). The analysis of crop cultivar
genome scans. Heredity 87:52–58. multilocus analyses using the eigenvalues of a breeding and evaluation trials: an
Cooper M and Hammer GL, eds (1996). Plant correlation matrix. Heredity 95:221–227. overview of current mixed model
adaptation and crop improvement. CAB Lynch M and Walsh B (1998). Genetics and analysis approaches. Journal of Agricultural Science
International, Wallingford, UK. 636 pp. of quantitative traits. Sinauer Associates Inc, 143:449–462.
Denis JB (1988). Two-way analysis using Sunderland, MA, USA, 980 pp. van Eeuwijk FA (2006). Genotype by
covariables. Statistics 19:123–132. Malosetti M, van der Linden CG, Vosman B, van environment interaction: basics and
Falconer DS and Mackay TFC (1996). Eeuwijk FA (2007). A mixed-model approach beyond. In: Plant Breeding: The Arnell
Introduction to quantitative genetics (4th to association mapping using pedigree Hallauer International Symposium.
edition). Longman, Harlow, UK, 464 pp. information with an illustration of resistance (Lamkey K and Lee M, eds). Blackwell
Finlay KW, and Wilkinson GN (1963). The to Phytophthora infestans in potato. Genetics Publishing, Oxford, UK, pp 155–170.
analysis of adaptation in a plant-breeding 175:879–889. van Eeuwijk FA, Denis JB and Kang MS
programme. Australian Journal of Agricultural Malosetti M, Voltas J, Romagosa I, Ullrich SE (1996). Incorporating additional
Research 14:742–754. and van Eeuwijk FA (2004). Mixed models information on genotypes and
Gabriel K (1978). Least squares approximation including environmental covariables for environments in models for two-way
of matrices by additive and multiplicative studying QTL by environment interaction. genotype by environment tables. In:
models. Journal of the Royal Statistical Society: Euphytica 137:139–145. Genotype-by-environment interaction
Series B 40:186–196. Mandel J (1969). The partitioning of interaction (Kang MS and Gauch HG, eds). CRC
Galwey NW (2006). Introduction to mixed in analysis of variance. Journal of Research of Press Inc, Boca Raton, Florida, USA,
modelling: beyond regression and analysis of the National Bureau of Standards, Mathematical pp 15–50
variance. John Wiley & Sons Ltd, Chichester, Sciences 73B:309–328. van Eeuwijk FA, Malosetti M and Boer MP
West Sussex, UK, 366 pp. Martínez O and Curnow RN (1992). Estimating (2007). Modelling the genetic basis of
Gauch HG (1988). Model selection and the locations and the sizes of the effects of response curves underlying genotype
validation for yield trials with interaction. quantitative trait loci using flanking markers. x environment interaction. In: Scale
Biometrics 44:705–715. Theoretical and Applied Genetics 85:480–488. and complexity in plant systems research.
Gollob H (1968). A statistical model which Patterson HD and Thompson R (1971). Recovery Gene-plant-crop relations. (Spiertz JHJ,
combines features of factor analysis of inter-block information when block sizes Struik PC and van Laar HH, eds).
and analysis of variance techniques. are unequal. Biometrika 58:545–554. Springer, Dordrecht, The Netherlands,
Psychometrika 33:73–115. Payne RW, Murray DA, Harding SA, Baird DB, pp 115–126.
Graffelman J and van Eeuwijk FA (2005) Soutar DM and Lane P (2003). Introduction. Verbeke G and Molenberghs G (2000).
Calibration of multivariate scatter plots In: GenStat for Windows (7th Edition). VSN Linear mixed models for longitudinal data.
for exploratory analysis of relations within International, Oxford, UK. Springer-Verlag, New York, 568 pp.
and between sets of variables in genomic Ribaut J-M, Hoisington DA, Deutsch JA, Jiang C Yan W, Hunt LA, Sheng Q and Szlavnics Z
research. Biometrical Journal 47:863-879 and Gonzalez de Leon D (1996). Identification (2000). Cultivar evaluation and mega-
Griffiths AJF, Miller JH, Suzuki DT, Lewontin of quantitative trait loci under drought environment investigation based on the
RC and Gelbart WM (1996). An introduction conditions in tropical maize. 1. Flowering GGE biplot. Crop Science 40:597–605.
to Genetic Analysis. WH Freeman and parameters and the anthesis-silking interval. Zeng ZB (1994). Precision mapping
Company, New York, USA, 916 pp. Theoretical and Applied Genetics 92:905–914. of quantitative trait loci. Genetics
136:1457–1468.

145
Marcos Malosetti et al

Appendix I

Genstat® Code To Fit Models 1 To 9

\ Open the file: MET maize.gsh


\*******************************************************
\ Model 1 (Fixed additive model)
\*******************************************************
model Y=yield
fit [print=model,summary,accumulated; fprobability=yes] E+G

\*******************************************************
\ Model 2 (Full interaction model)
\ note that because the data set do not contain information from replicates we cannot fit this model
\*******************************************************
model Y=yield
fit [print=model,summary,accumulated; fprobability=yes] E+G+G.E

\*******************************************************
\ Model 3 (Finlay-Wilkinson model)
\*******************************************************

\ To produce the ANOVA table in TABLE 2 (model 3b)


model Y=yield
fit [print=model,summary,accumulated; fprobability=yes] E+G+G.E_index

\ To produce parameters G and b (model 3c)


model Y=yield
fit [print=model,summary,accumulated,estimates; fprobability=yes; constant=omit] G+G.E_index

\*******************************************************
\ Model 4 (AMMI model)
\*******************************************************

ammi [print=aovtable; nroots=2] data=yield; genotypes=G; environments=E

\*******************************************************
\ Model 5 (GGE model)
\*******************************************************

pointer [values=LN96a,LN96b,SS92a,SS94a,IS92a,IS94a,HN96b,NS92a] ENV


unstack [dataset=env] stackedvector=8(yield); datasetindex=1...8; unstackedvector=ENV[1...8]
biplot [print=singular,scores; standardize=centre; method=principal] ENV

146
Statistical models for GEI

\*******************************************************
\ Model 6 (Factorial regression model)
\*******************************************************

\ First center the environmental covariables around zero


calculate minTF=MINTF-mean(MINTF)
calculate radiationGF=RADG-mean(RADG)

model Y=yield
fit [print=model,summary,accumulated; fprobability=yes] E+G+G.minTF+G.radiationGF

\*******************************************************
\ Model 7 (Mixed additive model)
\*******************************************************

vcomponents [fixed=E] random=G


reml [print=model,components,wald,deviance,covariancemodels] Y=yield

\*******************************************************
\ Model 8 (Mixed additive model with environment specific genetic variance)
\*******************************************************

vcomponents [fixed=E; experiments=E] random=G


reml [print=model,components,wald,deviance,covariancemodels] Y=yield

\*******************************************************
\ Model 9 (Mixed additive model with grouped environments)
\*******************************************************

vcomponents [fixed=E; experiments=E] random=G.group


vstructure [terms=G.group] model=unstructured; factor=group; initial=!(0.04,0.11,0.44,0.12,0.55,1)
reml [print=model,components,wald,deviance,covariancemodels] Y=yield

147
Marcos Malosetti et al

Appendix II

Example Grafgen input files and commands to calculate conditional


genetic probabilities
To calculate conditional genotypic probabilities, Grafgen requires two input files: (1) the genotyping coding system
file ‘codsys.ex’, and (2) the infile (*.inf) (see layout below). Assuming the codsys.ex file is in the working directory of
Grafgen and an input file called ‘inputfile.inf’ in the root of the D drive, the following command can be typed in to
calculate genotypic probabilities every 1cM:
grafgen –x 1 –T0 d:\outputfile d:\inputfile.inf
Note: the input file ‘inputfile.inf’ is in the D:\ directory and an output file called ‘outputfile.dat’ will be produced and
placed in the D:\ directory.

1) codsys.ex file:

Diploid species
0 = aa
1 = Aa or aA (phase unknown)
2 = AA
3 = not AA (a allele dominant)
4 = not aa (A allele dominant)
5 = unknown
6 = Aa
7 = aA

148
Statistical models for GEI

2) *.inf file:

149
Marcos Malosetti et al

Appendix III

Genstat® code for performing QTL mapping in four steps


\ Open the file: MET maize.gsh
\ Open the file: X_additive.gsh
\ Open the file: X_dominance.gsh
\ Open the file: info.gsh

\******************************************************
\ STEP 1: Fitting different mixed models
\******************************************************

\ model 7
vcomponents [fixed=E] random=G
reml [print=model,components,wald,deviance,covariancemodels] Y=yield

\ model 8
vcomponents [fixed=E; experiments=E] random=G
reml [print=model,components,wald,deviance,covariancemodels] Y=yield

\ model 9
vcomponents [fixed=E; experiments=E] random=G.group
vstructure [terms=G.group] model=unstructured; factor=group; initial=!(0.04,0.11,0.44,0.12,0.55,1)
reml [print=model,components,wald,deviance,covariancemodels] Y=yield

\******************************************************
\ STEP 2: Simple interval mapping
\******************************************************

calculate nloc=nvalues(Xadd)

for i=1...nloc
print !t(position),i; decimals=0
calculate add=newlevels(G; Xadd[i])
calculate dom=newlevels(G; Xdom[i])
vcomponents [fixed=E+add.E+dom.E; experiments=E] random=G.group
vstructure [terms=G.group] model=unstructured; factor=group; initial=!(0.04,0.11,0.44,0.12,0.55,1)
reml [print=wald] Y=yield
endfor

150
Statistical models for GEI

\******************************************************
\ STEP 3: Composite interval mapping
\******************************************************

\ here indicate the positions that are selected as cofactors


delete [redefine=yes] cofset
variate [values=32,60,73,123,277,407] cofset

\ here indicate the window within which cofactors will be removed


\ for restricted CIM use window of 1000
scalar window; 1000

calculate ncof=nvalues(cofset)
variate [nvalues=ncof] cof_chr,cof_pos
calculate \ cof_chr$[1...ncof],cof_pos$[1...ncof]=mkchr$[#cofset],mkpos$[#cofset]

delete [redefine=yes] add,dom


delete [redefine=yes] in,gap,chr2,pos2

for i=1...nloc
print !t(position),i; decimals=0
variate [values=#cofset,i] in
variate [values=#cof_chr,mkchr$[i]] chr2
variate [values=#cof_pos,mkpos$[i]] pos2
calculate aux1=nval(in)
calculate gap=abs(pos2-pos2$[#aux1])
calculate gap$[#aux1]=1000000
subset [condition=chr2.ne.mkchr$[i].or.gap.gt.window] in
calculate add[#in]=newlevels(G; Xadd[#in])
calculate dom[#in]=newlevels(G; Xdom[#in])

vcomponents [fixed=E+add[#in].E+dom[#in].E; experiment=E] random=G.group


vstructure [terms=G.group] model=unstructured; factor=group; initial=!(0.04,0.11,0.44,0.12,0.55,1)
reml [print=wald] Y=yield
endfor

\*********************************************************************
\ STEP 4: Fit final QTL model and test for QTL and QTLxE separatelly
\*********************************************************************

\ following the maize example we assume 6 QTLs, all of them having only additive effect
pointer [nvalues=6] QTL
calculate QTL[1...6]=newlevels(G;Xadd[32,73,126,161,277,407])

\ considering environment-specific QTL effects


vcomponents [fixed=E+QTL[].E; experiments=E] random=G.group
vstructure [terms=G.group] model=unstructured; factor=group; initial=!(0.04,0.11,0.44,0.12,0.55,1)
reml [print=model,wald,components,covariancemodels,effects] Y=yield

\ partitioning QTL main effects and QEI effects


vcomponents [fixed=E+QTL[]+QTL[].E; experiments=E] random=G.group
vstructure [terms=G.group] model=unstructured; factor=group; initial=!(0.04,0.11,0.44,0.12,0.55,1)
reml [print=model,wald,components,covariancemodels,effects] Y=yield

151

You might also like