1.1 Normal Linear Model: 1.1.1 Theoretical (Population) Model and Fitted (Sample) Model
1.1 Normal Linear Model: 1.1.1 Theoretical (Population) Model and Fitted (Sample) Model
When looking for a statistical model that relates a continuous response variable, y, with
a set of explanatory variables, x = (x1 , . . . , xp−1 ), the normal linear model is one of the
simplest. Even though this model might seem too naive and restrictive to be any useful,
it adapts well to a large range of modeling problems.
We devote most of our time to cover linear models instead of more sophisticated statis-
tical models, because the model building approach used for them and the way in which
linear models are used to predict or to understand y, are similar to the ways used for
any statistical model. Most of what we will learn here, applies when building and using
any type of statistical model.
A linear model splits the response, yi , into a part that can be explained through a
linear combination of the explanatory variables available, xi = (x1i , . . . , xp−1i ), that is
the signal, and a part that can not be explained by it, ϵi , that is recognized as the noise,
The noise captures measurement error together with the part of yi that could be explained
by variables, z1 , z2 , . . ., that are unknown, or known but not available for modeling. These
1
Chapter 1. Normal Linear Model 2
hidden variables can not be considered at the time of modeling, but one will need to
have them in mind when interpreting the final model.
A normal linear model is a statistical model that further assumes that the noise compo-
nent is independently distributed as:
ϵi ∼ Normal(0, σ 2 ). (1.2)
Hence, normal linear models assume that the yi ’s are conditionally independent and:
In statistics one further assumes that all the samples are generated using the same
“unique” values for β = (β0 , . . . , βp−1 ) and σ 2 that are “unknown”, but known to belong
to a parameter space usually assumed to be β ∈ Rp and σ 2 ∈ [0, ∞).
What distinguishes a statistical model from a probability model is that the statistical
model has a parameter space and the probability model doesn’t have it.
1. about the value of the βj ’s, that determine the relationship between the x’s and y,
and in particular whether βj could be equal to 0, when the goal is to explain the
relation between y and x,
2. about the value of E(y|x0 ) = β0 + β1 x10 + . . . + βp−10 xp−10 , or the value of yf (x0 ) =
E(y|x0 ) + ϵf , at certain point x0 , when the goal is to predict E(y|x) or y(x),
3. whether there are any observations that do not follow the pattern (model) of the
majority of the observations in the sample considered.
These questions related to the theoretical model will be answered through the fitted or
sample model, computed using a sample of observations, (yi , x1i , . . . , xp−1i ).
The fitted model splits the response yi into its predicted value (part explained by x), ŷi ,
and its residual or prediction error (part unexplained by x), ei = yi − ŷi ,
where b = (b0 , . . . , bp−1 ) are estimates of β = (β0 , . . . , βp−1 ) obtained from the sample.
Different from the theoretical model, this fitted model is not unique, because it depends
on the sample obtained, but it is known.
One answers questions about the theoretical model, (i.e. about β, E(y|x0 ) or yf (x0 )),
based on the fitted model, (i.e. through b or ŷ(x0 )). One makes distributional assump-
tions beyond the linearity assumption, to be able to link fitted and theoretical models.
Before using a normal linear model in a given setting, one needs to check whether the
assumptions made by the model are appropriate in that setting. In particular, one needs
to check whether the relationship between yi and xi is such that:
2. the variance is constant, V (yi |xi ) = σ 2 , (same value of σ 2 for all i),
4. ϵi and ϵj are independent for all i, j, (i.e., corr(ϵi , ϵj ) = 0). constant variance (2) and normally
distributed (3)
These assumptions are ordered from more important to less important, and they need
to be checked based on what is observed in the data available.
Example: Acid
The linearity assumption might seem too restrictive, but note that xji ’s can be non-
√
linear transformations of explanatory variables, like x2i = x21i , x2i = 1/x1i , x2i = x1i ,
or x2i = log x1i . One can also transform yi to help attain linearity. Hence, many instances
that seem to fall outside the linear model framework can be adapted to fall into it by
finding the right scales for explanatory and response variables.
In practice, often V ar(yi |xi ) is not constant because it increases with E(yi |xi ). When
that is the case, one might get around it by modeling log yi .
What are supposed to be normally distributed are the y values that share the same value
x, and not all the values of y (irrespective of the x at which y is observed).
When one has a single explanatory variable, x1 , one can check linearity and constancy
of variance through the bivariate plot between y and x1 , even though one needs to be
careful in situations like the one of the last example. Checking normality is trickier, and
one will need the help of tools presented in Section 6.
The independence assumption is harder to check (and to understand), and one tipically
only worries about it when modeling response variables ordered in time, like in the Old
Faithful example. We will come back to this assumption in Section 6 and Chapter 5.
When one has more than one explanatory variable, checking whether normal linear model
assumptions make sense and using the fitted model will be harder because:
Example: Birthweight
When what is categorical is the response variable, linear model assumptions do not make
sense anymore, and one needs to resort to binomial logistic regression models.
Example: Kyphosis
When one has a count response with low counts, one needs to resort to loglinear poisson
models. (For large counts, log yi is often well approximated by normal linear models).
Example: Challenger
Building a statistical model is an iterative process, in which model checking is the critical
stage that indicates whether the current model is useful enough, and if not, it suggests
ways to improve it.
To help understand the relationship between the theoretical and the fitted models, and
how one can answer questions about the first based on the second it helps to do a
simulation experiment and a resampling experiment.
A compact presentation of the normal linear model that uses matrices, states that a
sample (xi1 , xi2 , . . . , xip−1 , yi ), for i = 1, . . . , n, follows that model if:
where Y and ϵ are the n × 1 vectors with the yi and the ϵi , where X is the n × p matrix
formed by a first column of ones and one column for each explanatory variable, and
where I is the n × n identity matrix.
Y = Ŷ + e = Xb + e, (1.7)
where Ŷ and e are the n × 1 vectors with the ŷi and the ei .
1.2 Model fit
El mètode de mínims quadrats pren com a estimador de beta el vector betaˆ que minimitza la distància entre el vector
de valors observats i el vector dels valors estimats pel model.
1.2.1 Estimation of β by least squares
Residuals, ei = yi − ŷi , are the prediction errors, and the part of the response not captured
by the model. The smaller the |ei | = |yi − ŷi | for i = 1, . . . , n, the closer are the fitted
model predictions, ŷ = b0 + b1 x1i + . . . + bp−1 xp−1i , to the training sample observations.
There are many sensible ways of imposing that requirement, leading to as many different
model fit criteria and fitted models. Model fit criteria that make sense are,
∑n
1. to minimize i=1 |ei |, (least absolute deviation),
and any other criteria that uses a distance between observed and predicted values. By
default, statistical packages fit linear models using the least squares criteria.
Under the least squares criterion, the coefficients of the fitted model, b = (b0 , . . . , bp−1 ),
are obtained by minimizing:
∑
n
When the constant term, β0 , is not required in the model, the first column of 1’s is
excluded from X, but by default one always keeps the constant in the model.
′
An alternative way to see that b = (X X)−1 X ′ Y is by noting that under least squares,
Ŷ is the linear combination of the columns of X that minimizes the length of the vector
of residuals, and therefore the residual vector, e, has to be orthogonal to the columns of
′ ′
X, X e = X (Y − Xb) = 0.
′
The vector of fitted values is Ŷ = Xb = X(X X)−1 X ′ Y = HY , where H is the n × n
matrix that projects Y onto the subspace generated by the columns of X to yield Ŷ .
The vector of residuals is e = Y − Ŷ = (I − H)Y , where I is the n × n identity matrix.
In the case of a single explanatory variable, the least squares estimates are:
b0 = ȳ − b1 x̄, (1.9)
and ∑n
(x − x̄)(yi − ȳ)
∑n i
b1 = i=1 . (1.10)
i=1 (xi − x̄)
2
3. the fitted plane goes through the point (x̄1 , . . . , x̄p−1 , ȳ).
1. Ease of computation, because the bj are linear combinations of the yi . Under other
criteria there is no closed form expression for bj as a function of (yi , xi ).
2. Ease of analysis, because if the model is correct the bj will be normally distributed
and centered around the “true” value of βj with the smallest possible variance
among all linear unbiased estimates of βj , and that helps learn about βj from bj .
yi = β0 + ϵ, (1.11)
the least squares estimate of β0 is b0 = ȳ, while the least absolute deviation estimate
is the sample median of y.
5. Under the normal linear model the least squares estimate of β coincides with the
maximum likelihood estimate.
El mètode de màxima versemblança consisteix en escriure la versemblança com a funció de beta i trobar el vector betaˆ que
la maximitza. Sota les hipòtesis del model lineal, la component i-èssima del vector resposta segueix una adistribució N(µi ,
2 ), on µi = [X · beta]i . Como son independientes es el producto de las
1.2.3 Instances where alternative criteria are called for
Examples of instances where one should switch away from using least squares are:
1. When linearity holds but the variance is not constant, like when observations are
means of samples of different sizes. In that case transforming y might fix the
variance problem, but then linearity would not be in place anymore.
Instead, in that case one can switch to using weighted least squares with wi in-
versely proportional to the variance of yi . In that way one weights more obser-
vations with smaller variance, which deserve more credit. These estimates are
′ ′
b = (X V −1 X)−1 X V −1 Y , where V is the n × n diagonal matrix with vii = wi .
2. When one suspects that there might be outliers and/or groups of observations in
the training sample following different models.
In that case one can switch to criteria more robust than least squares, like the first
three in the list of criteria presented above. In particular, using the least median
squares criterion disregards up to one half of the observations and provides the
best fit for the majority group in the training sample.
Robust fits help unmask observations that do not follow the majority pattern,
and they help suggest ways of improving the model by including as explanatory
variables what distinguishes the majority group from these outliers.
Example: Hertzprung Russell
1.2.4 Estimation of σ 2
One needs a way to assess whether a linear model provides a very good fit like in the
acid example, a regular fit like in the weight versus height example, or a poor fit like
in the hand life line length example. In all these examples the use of a linear model is
appropriate, because the linear model assumptions are approximately correct, but the
quality of the fit is very different in each case.
With a single explanatory variable, x1 , one can assess how good the fit will be by just
looking at the bivariate plot of y and x1 , and one can quantify goodness of fit through
the correlation between x and y. With many explanatory variables, no single plot helps
do that and it is not clear how to quantify goodness of fit.
The goal is to compare the predictions about yi made through ȳ, without the help of
x’s, and the predictions about yi made through ŷi , with the help of the best linear
combination of x’s.
The term on the left of this ANOVA decomposition is the total sum of squares,
∑
n
SST = (yi − ȳ)2 , (1.14)
i=1
which is positive and only depends on y, and not on the fitted model. It measures how
bad the predictions made without the use of explanatory variables would be.
The two terms on the right of this decomposition are also positive, and both depend on
the fitted model. The first term is the sum of the squares of the residuals, measuring
what is not explained and how bad the predictions made with the linear model are,
∑
n ∑
n
SSR = SSN otE = (yi − ŷi ) =
2
e2i , (1.15)
i=1 i=1
and the second term is the sum of the squares explained by the model,
La Variabilidad Explicada (VE) mide la ∑
n
variabilidad entre los distintos grupos. Si es SSE = (ŷi − ȳ)2 , (1.16)
pequeña, es porque las medias son similares. i=1
measuring how better the predictions using explanatory variables are relative to the ones
made without them. Given that for a given response SST will be fixed, the larger SSE ,
the smaller SSR , and the better the fitted model.
Example: Acid
One summary statistic of the ANOVA decomposition that measures goodness of fit, is
the determination coefficient:
( )
SSE SSR
2
R = 100 = 100 1 − , (1.17)
SST SST
In the single explanatory variable case, R2 /100 is also equal to the square of the cor-
relation coefficient between y and x, Corr2 (x, y), and in the general case it is equal to
the square of the correlation between y and ŷ, Corr2 (y, ŷ). Therefore, R2 /100 can also
be interpreted as the square of the largest possible correlation between y and a linear
combination of the x’s.
R2 is useful as a model fit measure but not as a model selection criterion, because when
one adds a new explanatory variable in the model, SSR always decreases and R2 always
increases, irrespective of whether that new variable is useful or not. Hence, the largest R2
will always be attained by the model that uses all the explanatory variables available. In
fact, one can always get R2 = 100 by using as many (algebraically) linearly independent
explanatory variables as observations, because that leads to ei = 0 and SSR = 0.
As a consequence, R2 is useful to compare models for y of the same size, with the same
number of explanatory variables, but it is not useful to compare models of different sizes,
because in that case R2 always prefers larger and more complicated models.
To compare models of different sizes, one needs model selection criteria that incentive
goodness of fit, but at the same time penalize larger models. One model selection criteria
frequently used that does that, is the adjusted determination coefficient, defined as
( ) ( )
s2R SSR /(n − p)
Radj = 100 1 − 2 = 100 1 −
2
. (1.18)
sy SST /(n − 1)
2
The model that maximizes Radj coincides with the model that minimizes the residual
variance, sR = sy (1 − Radj /100), which also works as a model selection criteria. Alter-
2 2 2
Another summary statistic of the ANOVA decomposition is the ratio of explained and
unexplained mean squares,
s2 SSE /(p − 1)
F = E = , (1.19)
s2R SSR /(n − p)
which will be used to choose between the null model, y = β0 + ϵ, and the model under
consideration. The larger F , the stronger the relationship between the response and the
variables in the model, and the less likely that the null model is in place.
Under the normal linear model one assumes that data is generated by:
where ϵi are independent and Normal(0, σ 2 ), and where (β, σ 2 ) take the same unique but
unknown value for all the observations, only known to belong to the parameter space
Rp × [0, ∞). The corresponding fitted model,
is obtained using the least squares estimate of β. The goal is to answer questions about
the values of the βj that generate the data, based on bj . In particular, one needs to
decide whether βj could be equal to 0, in which case one could simplify the model.
When fitting the linear model, only the linearity assumption was used. Here it is where
one takes advantage of the other distributional assumptions made.
The link between βj and bj follows from the fact that if the population model is “correct”
′ ′
and one fits the model by least squares, then b = (X X)−1 X Y and:
′
b|X ∼ Normal(E(b|X) = β, V (b|X) = σ 2 (X X)−1 ). (1.22)
The least squares b has the smallest variance among all linear unbiased estimates of β.
σ2 Σx2i
b0 |x ∼ Normal(E(b0 |X) = β0 , V ar(b0 |x) = ), (1.23)
n Σ(xi − x̄)2
σ2
b1 |x ∼ Normal(E(b1 |X) = β1 , V ar(b1 |x) = ), (1.24)
Σ(xi − x̄)2
and:
−σ 2 x̄
Cov(b0 , b1 |x) = . (1.25)
Σ(xi − x̄)2
Exercise: Simulate samples from a simple linear model with different choices of β, of
n and of the range of values of x1i , and check that the distribution of the b = (b0 , b1 )
obtained matches this result. How do the variances and covariance of b0 and b1 change
with the choice of x1i ? How can one minimize V (b1 |x)? Why are b0 and b1 dependent?
Question: What is b1 and its distribution when one fits the model yi = β1 x1i + ϵi ?
If the normal linear model is “correct” and one fits it by least squares, then:
σ2 2
s2R |X ∼ χ , (1.26)
n − p n−p
where s2R = SSR /(n − p), which means that E(s2R ) = σ 2 . One typically estimates the
′
ˆ
variance-covariance of b through V (b|X) = s2R (X X)−1 .
Statistical packages provide estimates of V ar(bj |x), denoted by s2bj , and with it one can
compute 100(1 − α)% confidence intervals for βj through:
1-alpha/2 = alpha/2 (simetria)
α/2 α/2
(bj − tn−p sbj , bj + tn−p sbj ), (1.27)
Example: Hand life line. An approximate 95% confidence interval for β1 is:
models are “correct”, then the true value for βj will fall inside of about 95 of these
intervals. One never knows which are the about 5 intervals that fail to enclose the true
βj , but when they fail one knows that the true βj will not be too far from the interval.
Hence, the 95% stands for the long run success rate that one should expect if one
keeps computing confidence intervals in this manner using models that are approximately
“correct”. The 95% i not the probability that any given confidence interval for βj happens
to enclose the true value of βj .
If bj is close to a then it is likely that βj could be equal to a, but if bj is far from a then
it is likely that βj is not equal to a. There are two approaches to decide whether bj is
far enough from a to reject βj = a:
1. One can compute a 95% confidence interval for βj , and reject βj = a with a 95%
confidence if a is not in that interval. The further a falls away from that interval,
the stronger the conviction that βj = a is not consistent with data.
A 95% confidence interval for βj provides the set of all βj = a that can not be
rejected with a 95% confidence, based on the evidence in the sample.
2. One can also use the fact that if βj = a is true and the model is correct, then the
number of standard deviations that separate bj from βj = a,
|bj − a|
|tj | = , (1.29)
sbj
t_j >= qt(df, 1-alpha/2) para rechazar (lo mismo q este fuera de [-2, 2]) o pvalor
has a |tn−p | distribution, and therefore |tj | will be smaller than 2 close to 95% of
the time. When |tj | is much larger than 2 one will reject βj = a, and when |tj | is
smaller than 2 one will not reject it.
One should always measure distance between bj and βj = a using the standard
deviation of bj as the unit of measure. Any βj = a that is more than two stan-
dard deviations away from bj can be rejected, and the more standard deviations
separating bj from a, the stronger the evidence against βj = a. In particular,
|bj |
|tj | = , (1.30)
sbj
helps decide whether βj = 0. If |tj | = |bj |/sbj is much larger than 2, then bj is
much more than two standard deviations away from 0, and βj is unlikely to be 0.
One should never provide a fitted model, ŷi = b0 + b1 x1i + . . . + bp−1 xp−1i , without
providing the standard deviations of its coefficients, sbj . To decide whether one can
remove xj from the model what matters is not wether |bj | is large or small, but how
many standard deviations separate bj from 0, |tj | = |bj |/sbj .
Example: Height vs father height with data only for boys. An approximate 95% confi-
dence interval for β1 is:
which does neither include β1 = 0, consistent with |t1 | = 5.74 > 2, nor β1 = 1.
Examples: Final and midterm exams. Father and mother height and weight.
Observing |tj | = |bj |/sbj to be much smaller than 2 entitles one to remove one variable,
xj , at a time, but not to remove several variables with a small |tj | all at once. By
removing one variable from the model everything changes and, in particular, the |tj |
statistic of a variable left in the model might change from being smaller to being larger
than 2, and xj could go from not being useful to being useful.
To check whether one can remove q variables simultaneously, or to compare two linear
models with one model nested into the other, one computes the statistic:
SSR0 −SSR
q
F = SSR
, (1.31)
n−p
where SSR0 is the sum of the squares of the residuals of the restricted model, (obtained
by removing q variables or imposing q linear restrictions on the coefficients of the full
model), and SSR is the sum of the squares of the full model, and where p is the number
of βj ’s in the full model. The larger F , the larger the increase in the residual sums of
squares from removing the variables, and the less desirable removing these variables is.
To know whether F is small enough to justify the removal of the q variables at once,
(or imposing the q restrictions on the coefficients), one uses the fact that if they can be
removed because their population coefficients are all simultaneously equal to 0, and if
the model is correct, this F statistic is F (q, n − p) distributed.
Statistical software provides the tail area (p-value) resulting from comparing the value
of the F statistic observed with the F (q, n − p) distribution. The larger the value of F ,
the smaller that tail area, and the less desirable it is to remove the variables.
The F = s2E /s2R statistic obtained as a subproduct of the ANOVA decomposition is the
statistic that results when one tests whether one can remove all the q = p−1 explanatory
variables from the model at once, and be left with the null model, yi = β0 + ϵi .
1.5 Prediction
A good point prediction in both instances is ŷ(x0 ) = b0 + b1 x10 + . . . + bp−1 xp−10 because
a good guess for βj is bj and a good guess for ϵ0 is 0.
El PI son intervals que contenen el valor predit amb el nivell de confiança que s’hagi fixat. Els intervals de predicció son
intervals per a prediccions concretes
The prediction interval for y(x0 ) will be wider than the confidence interval for E(y|x0 ),
because predicting y(x0 ) is harder and more uncertain. Most often one needs to predict
y(x0 ), and so one needs to use the wider prediction interval.
El CI es calculen pels paràmetres de la distribució. En els models lineals el CI es calculen pels valors esperats de y.
With a single explanatory variable, one can compute a 100(1 − α)% confidence interval
for E(y|x0 ) through:
( √ )
1 (x0 − x̄)2
+∑
α/2
ŷ(x0 ) ± tn−p sR , (1.34)
n (xi − x̄)2
En aquesta mateixa situació, el CI serà un interval que contindrà el pes esperat de les persones que medeixen 1, 76 amb
una confinaça igual a 1-alpha
Chapter 1. Normal Linear Model 18
while a 100(1 − α)% prediction interval for y(x0 ) can be computed through:
( √ )
1 (x − x̄) 2
ŷ(x0 ) ± tn−p sR 1 + + ∑
α/2 0
. (1.35)
n (xi − x̄)2
n−p ≈ 2.
The 95% confidence intervals can be approximated by using t.025
Both intervals are centered at ŷ(x0 ), they are most narrow at x0 = x̄, and they grow
wider with (x0 − x̄)2 , and so the most precise predictions are close to x0 = x̄. The
uncertainty in assuming that ϵ0 is 0 and that β0 is b0 has the same impact for all x0 , but
the uncertainty in assuming that β1 is b1 impacts predictions the least at x = x0 , and
the consequence of a bad guess for β1 becomes worse the further x0 is away from x̄.
In general, for more than one explanatory variable a 100(1 − α)% confidence interval for
′
E(y|x0 ) at x0 = (1, x01 , . . . , x0p−1 ) can be computed as:
( √ )
α/2 ′ ′
ŷ(x0 ) ± tn−p sR x0 (X X) x0 ,
−1 (1.36)
while a 100(1 − α)% prediction interval for y(x0 ) can be computed as:
( √ )
α/2 ′ ′
ŷ(x0 ) ± tn−p sR 1 + x0 (X X)−1 x0 . (1.37)
To extrapolate is to predict the response outside the region where one has collected the
data used to fit the model. Sometimes one will have to do that, but in that case one
needs to be very careful. To extrapolate is dangerous because the prediction errors are
larger, but mainly because there is no way to check whether the model used is correct
in an area where one does not have any data.
To answer questions about the population model (about the βj ’s) based on the sample
model (the bj ’s), and to be able to trust prediction intervals, one needs to make sure
that the assumptions made are acceptable. That is, one needs to check whether:
ϵi = yi − (β0 + β1 x1i + . . . + βp−1 xp−1i ) = yi − E(yi |xi ), i = 1, . . . , n, (1.38)
are close to being independent and Normal(0, σ 2 ) and, in particular, whether the linearity,
the constancy of variance and the normality assumptions hold. One can not check that
directly, because the ϵi are unknown.
Instead, one checks whether that is approximately the case by analyzing the best esti-
mates that we have for ϵi , which are the residuals:
ei = yi − (b0 + b1 x1i + . . . + bp−1 xp−1i ) = yi − ŷi , i = 1, . . . , n. (1.39)
Residuals are crucial for checking whether there is something wrong with the model,
because they have all the information about the relationship between yi and xi that is
in the data but is not captured by the fitted model.
Residuals work as a sort of a magnifying glass that exaggerates the lackings of the model
and helps discover ways to fix it. One should explore residuals in any way that allows
them to tell you what is left to be explained by the model.
3. identify observations that are unusual, either because they are poorly explained by
the model, or because they have a lot of influence on the fitted model.
Residuals are prediction errors, and if they are predictable, one can improve the model.
Residual analysis is the engine that drives the model building process. One should never
settle with a statistical model without analyzing its residuals.
Example: Acid
2. Standardized (or internally studentized) residuals: esi = (ei − ē)/sei = ei /sei , where
sei is an estimate of the standard deviation of ei .
Standardized residuals are dimensionless. When the model is correct, esi is Student-
t distributed, and one expects about 95% of them to fall in (−2, 2), and the re-
maining 5% to fall close to that range. Any observation with |eis | much larger than
2 is poorly explained by the model, and it is considered to be an outlier.
The problem with regular and standardized residuals is that they compare yi with
ŷi , which are predicted values computed using (xi , yi ). That is a bit like cheating
in an exam, because you use yi to predict yi , and it plays in favor of the model.
Given that esi ≈ kei , regular and standardized residual plots are similar. Also, when
sample size is not too small and yi is not a strong outlier, then ŷi and sei are similar to
ŷ(i) and se(i) , and the plots with esi and with esd
i are also similar.
The most efficient way to analyze residuals to help improve the model is graphically.
Residual analysis is specially helpful when one deals with many explanatory variables,
but it can also be useful when there is only a single explanatory variable.
Three type of default residual plots that are always useful are:
1. Residuals versus predicted values. When the model has a single explanatory vari-
able, x1 , plotting e against x1 and plotting e against ŷ = b0 + b1 x1 is basically the
same. With several explanatory variables, ŷ works as the best linear combination
of the explanatory variables in the model and it works as a summary of them.
Finding a relationship between what is left to be explained (residual) and what is
explained (predicted value), indicates that the linearity assumption fails. One can
try to fix that by transforming y and/or some of the x’s.
Finding that the variability of the residuals grows with the predicted value is quite
frequent, and it indicates that the variance is not constant. One can sometimes fix
that by modeling the logarithm of y instead.
When both linearity and constant variance fail, one hopes that the same transfor-
mation of y will fix both problems. This plot also helps identify outliers.
3. Residuals versus explanatory variables included and not included in the model. It
helps find what is wrong with the model, if something is wrong with it.
On top of these default plots, one should use any other graphic that might help recycle
any information left about the relationship between y and x. In particular, when data is
ordered in time (or in space), one might use time (space) plots of the residuals, or other
graphics that help check whether there is any temporal (spatial) dependency left that
can be used to improve the model.
Cautionary Remark : One should be careful when interpreting bivariate plots. When
projecting multivariate relationships on a bivariate subspace, a lot of information about
that relationship is lost, and outliers might be hidden. Bivariate plots are like shadows,
and one faces two dangers when using them.
1. Danger 1: The fact that the plot of yi against x2i does not show any relationship
between them, does not mean that x2 might not be a crucial variable in the model
for y when combined with another variable, x1 .
Bivariate plots between y (or e) and explanatory variables by themselves will never
be evidence enough to drop an explanatory variable from consideration.
Example: Hamilton.
2. Danger 2: The fact that the plot of yi against x2i shows a clear relationship, does
not necessarily mean that x2 has to be part of the model for y. There could be
another variable x1 , (hidden or not), that drives both y and x2 at the same time,
and y and x2 might be related only because they both change with x1 . If that x1 is
available and included in the model, x2 will not be needed in the model anymore.
Bivariate plots between y and explanatory variables by themselves will never be
evidence enough to decide that an explanatory variable is required in the model.
Observations can be unusual because they are poorly explained by the model, because
certain values of the explanatory variables are very different from the ones of the majority,
or because they have a strong influence on the fitted model.
It is customary to mark with an R all observations with |esi | > 2, even though
standardized residuals with an absolute value larger than 2 but close to 2 are not
any indication that one should worry about them.
When esi >> 2, then yi >> ŷi , while when esi << −2, then yi << ŷi .
When |esi | is much larger than two, the i-th observation is not properly explained
by the model, which might be because:
Most often, large residuals happen because of failures of the model, in which case
finding outliers is an opportunity to improve the model. Sometimes that can be
attained by identifying missing variables that distinguish the outliers from the
observations well explained by the model, and sometimes by just finding a better
transformation for y and/or for some of the explanatory variables.
One should never remove an outlier from the analysis, unless one is sure that
there is something wrong with that observation, or one clearly understands what
makes that observation different from the other observations. If one removes an
outlier without having a good reason for that, one will never know the range of
applicability of the model obtained.
la matriu de variàncies i covariàncies del vector de valors predits és sd^2 vegades la hat matrix.
2. Distance in X-space: It is measured through hii , the i-th diagonal term of the
′ ′
projection matrix, H = X(X X)−1 X , which is called leverage or hatvalue. When
there is a single explanatory variable,
H es la hatmatrix per la que cal multiplicar el vector Y per a obtindre Yˆ . Atès que Yˆ = Xbetaˆ, i betaˆ = (XtX)^-1Xt
1 (xi − x̄)2
hii = + ∑n , (1.40)
i=1 (xi − x̄)
n 2
which is dimensionless and between 1/n and 1, and which measures the distance
between xi and x̄. When hii > 3p/n, the i-th observation is considered to be
unusually far from the other observations in the space of explanatory variables.
That by itself is no indication that there is anything wrong with the observation.
3. Influence on the fitted model : One measures the influence of the i-th observation
on the fitted model through Cook distance, that compares the vector of predicted
values obtained using all n observations, Ŷ , with the predicted values obtained
using all the observations except the i-th one, Ŷ(i) , and it also compares b and b(i) ,
which are the coefficients estimated with and without the i-th observation,
′ ′ ′
(Ŷ − Ŷ(i) ) (Ŷ − Ŷ(i) ) (b − b(i) ) X X(b − b(i) )
CDi = 2
= . (1.41)
psR ps2R
Cook distance for the i-th observation indicates how much does the fitted model
change when one removes that observation; When Ŷ and Ŷ(i) are very different,
CDi is large and the i-th observation is very influential.
What relates the Cook distance of an observation with the other two measures of
the degree of unusualness of an observation is the fact that:
hii 1
CDi = (esi )2 . (1.42)
1 − hii p
As a consequence of this, CDi will be large, and therefore (xi , yi ) will be very
influential, either because:
(a) (esi )2 is large, and therefore the i-th observation is an outlier, or because
(b) hii is large, and hence far in the X space, and (esi )2 is not too close to 0.
Some people plot CDi against (esi )2 , and CDi against hii , to find out which observa-
tions are the most influential, and why are they influential. One should only check
these plots with the final models, once one has already checked that everything
else is fine. If one finds that there is an observation that is a lot more influential
than the others, one might want to refit the model without that observation to
check how does the fitted model change.
Another measure of the influence of the i-th observation is DF F ITi = ŷi − ŷ(i) ,
and its standardized version. That compares the individual predicted value for the
i-th observation, and not the whole vector of predictions like Cook distance.
1.6.5 Examples
Simulated example
Example: Galapagos
One starts with a response, yi , and r explanatory variables, xi = (x1i , . . . , xri ), where
some of the variables could be functions of other variables. The model selection problem
consist in finding the best subsets of explanatory variables to build a linear model for y.
The problem is challenging because the number of possible models is 2r , which can be
very large, and because it is usually impossible to distinguish a single “best” model based
on the data available.
When selecting subsets of explanatory variables, one can end up with too many explana-
tory variables, which leads to over-fitting your training sample, and one can also end up
with too few explanatory variables, which leads to under-fitting it.
Residual analysis might warn against under-fitting, but it will only help recognize over-
fitting if the training sample has replicates. When there are replicates in the training
sample, one can test for the lack of fit of the model by comparing the variability estimated
using replicates with the variability estimated using the residuals of the model; If the
first estimate is much larger than the second, it is an indication of over-fitting.
Over-fitting the training sample is tempting, because it makes the |ei | “artificially” small.
∑ 2
One can actually get the SSR = ei as small as wanted by including enough explanatory
variables in the model. In the limit, one might actually feel tempted to interpolate by
using p = n − 1 explanatory variables, and in that way obtain SSR = 0.
1. the more explanatory variables in the model, the larger the prediction variance
tends to be, and the wider the prediction intervals, or because
2. having too many explanatory variables in the model complicates the interpretation
of the model, in part because of the dependency among explanatory variables.
Including variables not needed over-explains the training sample used to fit the model,
forcing their residuals to be artificially small, but the model might not generalize well
when used to predict future (out-of-sample) observations.
The Occam razor’s or parsimony principle, states that the simpler the explanation, the
more likely it is to be the right one. Given a subset of models for y with acceptable
residuals, one should prefer the models with the smallest number of variables.
Under-fitting happens when one misses important explanatory variables in the model.
To find the consequences of that, assume that the true model is:
Y = Xβ + Zγ + ϵ, (1.43)
but one does not know or can not measure the Z’s, and one fits:
Y = Xβ + ϵ (1.44)
′ ′
instead. In that case the least squares estimate for β, b = (X X)−1 X Y , is such that:
′ ′ ′ ′
E(b|X, Z) = (X X)−1 X E(Y |X, Z) = β + (X X)−1 X Zγ, (1.45)
and therefore b becomes a biased estimate for β. This bias can even lead to some bj
having a sign different from βj .
The role of this bias is to capture the part of the relationship between the Z and Y
that can be salvaged thanks to the relationship between the Z and the X. That is good
for prediction, but it makes life very hard when trying to interpret the fitted model to
explain the relationship between X and Y .
In practice one needs to worry about under-fitting a lot, because important variables
will be missing most of the time.
This dilemma between using too many variables, and thus increasing the prediction
variance, and missing important variables, and thus introducing bias in the predictions
and elsewhere, is recognized as the variance versus bias trade off.
Maximizing R2 or minimizing SSR , which is the same, can not be used to select models
because that always leads one to select the model using all the variables available, which
will most likely over-fit the training sample. R2 and SSR measure goodness of fit, and
they should only be used to compare models of the same size or degree of complexity.
Model selection criteria need to incentive goodness of fit while at the same time penalize
2
model complexity. One way to do that is by maximizing Radj or minimizing s2R , which
is equivalent because s2R = s2y (1 − (Radj
2
/100)).
There are many other model selection criteria, seeking a compromise between maximizing
goodness of fit and minimizing complexity. Depending on the relative weight given to
each one of these two goals, one ends up with different subsets of models. Model selection
criteria assessing goodness of fit through SSR and complexity through p are:
SSR
Cp = 2p − n + , (1.46)
s2R
where s2R is the residual variance of the model with all the explanatory variables in it,
and
BIC = Const + p log(n) + n log(SSR ), (1.48)
which are easy to compute and can be extended to other parametric models. More
general model selection criteria based on cross validation are presented next.
∑
Assessing the performance of a model through SSR = (yi − ŷi )2 , thus comparing yi
with a predicted value for yi computed using yi and xi is not “honest,” and it leads to
assessments that are too optimistic about the merit of the model. To avoid that, one
needs to compare yi with predictions of yi that do not “cheat” by using yi to predict yi .
That can be done in different ways.
The one-shot cross validation approach splits the original sample into a training sub-
sample, to be used to select and fit models, and a validation or testing subsample, to be
used to check the performance of models. The idea is to compare the values yi in the
validation subsample with their predictions based on the model selected and fitted with
the training subsample. This approach is cheap computationally, but the result depends
on the way in which the original sample is split in two.
Instead, leave-one-out cross validation compares yi with the prediction ŷ(i) obtained with
the model fitted with the subsample obtained by removing that i-th observation from
the original sample. Given that one repeats that comparison for all the observations in
the sample, this approach requires one to fit the model n times. One way to summarize
that comparison is through the PRESS statistic,
∑
n ∑
n
P RESS = SSRl.one.out = (yi − ŷ(i) ) =
2
(edi )2 , (1.49)
i=1 i=1
which is analogous to the SSR but using deleted instead of regular residuals. A straight-
forward generalization of the determination coefficient can be obtained by replacing SSR
by P RESS in the definition of R2 , leading to the definition of:
( )
P RESS
RP red = 100 1 −
2
. (1.50)
SST
Selecting models that maximize RP2 red , and therefore minimize P RESS, is an appealing
2
alternative to the use of Radj , AIC or BIC for that same task.
One computationally cheaper and more efficient way of implementing cross validation
is through k-fold cross validation. This approach splits the original sample into k (ap-
proximately) equal sized subsamples. Then, it uses each one of these subsamples as a
validation subsample, with the remaining k − 1 subsets playing the role of the training
subsample, used to predict that validation subsample. Then, one computes
∑
n
SSRkf old = (yi − ŷ(i)
kf old 2
), (1.51)
i=1
and: ( )
SSRkf old
2
Rkf old = 100 1 − , (1.52)
SST
which can again be used as model selection criteria. The default choice of k is 5 or 10,
and like in the one-shot cross validation case, the way in which the initial sample is split
into the k subsamples affects the results obtained.
When k is n, k-fold and leave-one-out cross validation coincide, and RP2 red = Rnf
2
old .
Before one starts the selection of the best subsets of x’s, one should first fit the model of
y with all the explanatory variables, and check the residuals to find out whether there
is a need to start transforming y and/or some of the x’s.
To help identify the best transformations for y or for the x’s, one can sometimes use
common knowledge about the phenomena being modeled.
Example: Wood.
Once identified the right scale for y, tools that might help strike the right balance between
over-fitting and under-fitting are:
1. Fitting the full model, with all the explanatory variables available and using the
|ti | = |bi /sbi | statistics to simplify the model, by removing one variable at a time.
To try to remove q variables at once, one needs to use the F statistic instead.
The order in which one removes variables determines the model obtained in this
way. This approach checks at most r models out of all possible 2r models, thus
likely missing models providing fits as good as the fit with the model obtained here.
Building a statistical model is like building a team, in the sense that one always
looks for the variable not in the model that best complements the variables already
in the model, which will often not be the variable most correlated with y. A useful
exercise is to monitor how the bj and tj = bj /sbj of a variable xj in the model
changes depending on the variables that team with xj in the model.
Example: Infant mortality.
2. Best subsets regression. It fits all 2r possible models, and presents the best models
of each size. The goal is not finding a single best model, but a subset of models
acceptable for predicting and/or explaining y.
Given that the models being compared will be of different sizes, the comparison
will have to be made based on model selection criteria as the ones presented above.
2
When checking how Radj or s2R change with model size for the best models of each
2
size, one typically finds that Radj increases and s2R decreases with increasing size
until it hits a maximum (minimum), and then it levels of. One should focus on the
2
subset of the smallest models with Radj (s2R ) close to their maximum (minimum).
One can also select models through the minimization of AIC, BIC, PRESS or
SSRkf old , or through the maximization of RP2 red or RKf
2
old .
When the goal is to predict, one might settle with models a bit larger than when
the goal is to explain y.
When the goal is to explain y, what is important is to find what is common to all
models that provide good fits, which could be many.
Averaging the predictions of several models often performs better than using pre-
dictions based on a single model.
Best subsets regression can only be implemented if the number of explanatory
variables available, r, is not too large. When r = 10 there are 210 = 1024 models,
when r = 20 there are 220 ≈ 106 models, and when r = 30 there are 230 ≈ 109
models. With much more than 30 variables the number of models becomes too
large, and one can not fit and rank them all in a reasonable amount of time.
Example: Somatic type
Example: Galapagos
∑
p−1
SSR + λ b2j ,
j=1
Unlike the approaches described above, ridge regression yields a model that includes
all the variables that one starts with, because no bRdg
j (λ) will be 0 other than for
λ = ∞, when b0 is equal to ȳ and all other bj are 0.
Instead, Lasso estimates the coefficients through the bLasso (λ) that minimizes:
∑
p−1
SSR + λ |bj |,
j=1
5. Cross validation. The model selection tools described above are tailored mostly
for linear and generalized linear modeling. Instead, the cross validation approach
to model selection applies to any type of statistical model, parametric or not.
To build a linear model that includes continuous variables together with a categorical
variable with k categories, one needs to follow the next three step process.
1. Choose the baseline category, and create k − 1 indicator variables, one for each one
of the remaining categories.
2. Include all k − 1 indicator variables in the model, together with the products of the
continuous explanatory variables with these indicator variables, called interactions.
3. Simplify the model by removing the terms that are not necessary, with the help of
tests based on |tj | and F statistics, but without removing any linear term unless
one has removed first all the product terms that involve them.
To interpret the effect of categorical variables on the response, it is useful to write the
sub-models that result for each category combination, and compare their slopes.
Coding the k categories variable with 1 to k, and including that single coded variable in
the model as if the categorical variable was discrete is one of the most frequent mistakes
made by unsophisticated modelers. What is wrong with doing that?
A categorical variable with k categories defines k groups, and it breaks down the initial
sample into k subsamples, one for each group. By fitting a model on the whole sample,
including all k − 1 indicator variables together with all the product terms, as described
above, the predicted values are the same as the ones obtained by splitting the sample
into these k subsamples, and fitting k linear models separately, one for each subsample.
One includes k − 1 indicator variables and not k, leaving a baseline category without
variable, because one does not need that extra one. Besides, if one includes k indicator
′
variables instead, unless one removes β0 from the model, one can not invert X X, and
therefore one can neither compute b = (X ′ X)−1 X ′ Y nor V (b|X) = σ 2 (X ′ X)−1 .
If one does not include the product terms in the model, one assumes that the slopes of
the continuous variables are the same for the different categories (groups), and therefore
that the planes are parallel. If one can remove all these product terms because all the
corresponding |tj |’s are small, one learns that the planes for different groups are actually
parallel, but that is different from assuming parallelism from scratch.
Removing linear terms without having been able to remove first all the product terms
that involve them is not advisable, because that does not reduce the complexity of the
model and because when one does that, the predictions of the model do not coincide
anymore with the ones obtained by splitting the sample into k subsamples and fitting one
linear model to each one of them. That also explains why one rarely removes the constant
term, even though |t0 | = |b0 |/sb0 might be small, and why one rarely removes the xj term
of a model that includes the x2j term, irrespective of the value of |tj | = |bj /sbj |.
The F statistic used to check whether one can remove q variables at once is specially
useful here, because it allows one to check whether one can remove at once all the product
terms and/or all the indicator variables involving a given categorical variable.
Sometimes one does not have enough observations to fit the full model, with all the
indicator variables and all the interactions. In that case, one can guess which interaction
terms are less likely to be active, and skip using them in the model. An alternative is to
identify categories that might behave similarly and aggregate them, thus reducing the
number of categories and of indicator variables and of interaction terms in the model.
When one has categorical variables, best subsets, stepwise and shrinkage methods should
be used with care, taking advantage of the possibility of forcing linear terms to be in all
the models while one is checking which interaction terms will be needed.
The comparison of two means based on two independent samples, and the comparison
of k means based on k independent samples, can be handled by fitting a linear model
with a single explanatory variable that is categorical, and therefore a model with k − 1
indicator variables and no product terms.
By fitting the model by least squares one reproduces the standard comparison of means
analysis, and one can also use all the tools available for linear modeling.
The analysis of the data obtained from carrying out two-level factorial experiments can
also be carried out by fitting a linear model. When this analysis does not include any
variables other than the ones controlled at two levels in the experiment, by re-coding
the two levels or categories of all variables through −1 and 1 instead of 0 and 1, all the
′
columns of the X matrix become orthogonal, and the X X matrix becomes diagonal,
which brings computational, analytical and interpretational advantages.
In particular, when the columns of X are orthogonal, the matrix (X ′ X)−1 is 1/n times
the identity matrix, and that means that bj depends only on xj , that V (bj |X) is the
same for all j, and that Cov(bj , bk |X) = 0 for all j, k.
Linear models are built either to predict y or to understand the relationship between
y and certain variables with the goal of explaining y. The use of the fitted model for
prediction is straightforward. Instead, explaining y by interpreting fitted models requires
a lot more care. There are four reasons why that interpretation is specially difficult.
1. There is usually not a single best model that fits the available sample clearly bet-
ter than all other models. The model selection process typically identifies several
“acceptable” models, each with a different set of variables and of estimated co-
efficients, and each telling a different story about the relationship between y and
x’s. One has to create a single story based on all these different and sometimes
2. Most of the time there is statistical dependency between the explanatory variables
in the model, (recognized as collinearity), and if one modifies the value of a variable,
xj , in the model, the value of other xk ’s also in the model will vary as well. Hence,
the estimated effect of varying xj on y will depend on the coefficient of xj , bj , and
on the coefficients bk of the other xk in the model that are correlated with xj .
Note that bj estimates how much would the response increase if xj increased by
one unit and all the other variables in the model stayed fixed.
Example: Price of apartment
Example: Salary NFL
Example: Galapagos
Example: Somatic type
Collinearity is a fact of life when one deals with observational data, and there is
nothing wrong with it. It just complicates the interpretation of the model.
If the sample is representative of the population, one can handle this issue by
fitting a model for each explanatory variable as a function of the other explanatory
variables in the model, and finding which variables are related with which.
To measure the degree of collinearity of xj with the rest of variables in the model,
one can use the determination coefficient Rj2 of the linear model of xj as a function
of the other p − 2 explanatory variables in the model. Another measure of the
degree of collinearity of xj is the variance inflation factor of xj ,
1
V IF (xj ) = , (1.53)
1 − Rj2
which is equal to 1 when xj is not related with the other variables, and it is ∞
when xj is a linear combination of them. The larger VIF(xj ) and Rj2 , the stronger
the statistical dependency between xj and the rest, and the harder it is to interpret
the role of xj when explaining y. The name VIF comes from the fact that:
σ2
V ar(bj |X) = V IF (xj ) ∑n . (1.54)
i=1 (xij − x̄j )
2
3. We know that if important variables, Z, are missing in the model, the coefficients
of the model fitted without them are biased, because:
′ ′
E(b|X, Z) = β + (X X)−1 X Zγ. (1.55)
Given that one will frequently have missing variables, one should expect that many
coefficients of the fitted models will have certain bias.
This bias appears because bj adapts itself to explain part of what is left unexplained
by the missing Z through the xj that are present in the model. That is good for
prediction, but it complicates a lot the interpretation of the fitted model.
Example: Price of apartment (bis)
Example: Cholesterol
Example: Sex and salary
This phenomena is the reason why one should make sure that one includes in the
model all the variables that might explain the response, and not just a few, even
though you might be interested only in the role played by these few. Including
explanatory variables in the model that are not of direct interest is necessary to
control for the effect of possible confounders that might bias the coefficients (the
effects) of the variables of interest if they were missing in the model.
This problem is a lot more difficult to handle than collinearity, because one can
learn about the existence of collinearity from the data, but one usually does not
know whether there are variables missing, and which are the missing variables.
One exercise that helps understand the repercussion of missing variables is to
remove an important variable from a model and check how the coefficients of the
remaining variables change.
4. To explain y, one needs to identify variables that cause y to change when they
change. The problem is that statistical models identify relationships between y
and xj ’s, but they can not tell whether these relationships are causal or not.
Non-causal relationships can be used to predict, but they are not that useful when
trying to understand y. The fact that xj ends up in the model for y does not
necessarily imply that the relationship between y and xj is causal. Causal and non-
causal relationships are practically undistinguishable based only on the information
in the data, and yet distinguishing them is crucial for the understanding of y.
Given that “counterfactual” experiments are almost never possible, the only way
to discard that one relationship is causal is to identify the true causal relationship.
What makes it so difficult to distinguish causality from pure correlation is the fact
that real causes are often missing variables (z) that drive both y together with
some of the explanatory variables available, xj . The fact that both y and xj ’s are
driven by the hidden (lurking) z, induces relationships between y and xj that make
it look like as if xj was explaining y. As soon as z is included in the model, xj
stops being necessary there, but most often the missing z does not end up in the
model because it is unknown. This hidden variable, z, might be time.
Third and fourth difficulties in this list are essentially the same one, because non-
causal relationships appear when real causal “explanatory” variables are missing,
and surrogate variables fill in for them to salvage part of what is missing.
Example: Lung cancer death rate
Example: Road accidents
Example: Crime rate
When y and xj are related, it could either because xj causes (drives) y, because y
drives xj , or because a hidden variable z drives y and xj at the same time. Data
alone can not tell which one of these three explanations is the correct one.
Putting everything together, bj explains the role played by xj together with part of the
role played by other xk ’s also in the model and are correlated with xj , and together with
the role played by all the variables zk not in the model but driving both xj and y.
When modeling the relationship between yi and xi = (x1i , . . . , xpi ), linear models are of-
ten useful because of the flexibility that comes with transforming y and/or the x’s. That
allows one to handle instances where non-linearities in the mean of y can be well ap-
proximated through nonlinear transformations of yi and linear combinations of nonlinear
transformations of x’s.
But there are instances where together with a data set one gets an inherently non-
linear model and the request to fit that model and check whether the model is a good
approximation for the data. That non-linear model is often suggested by un underlying
theory put in place to explain the relationship between y and the x’s.
A response variable, yi , and a set of explanatory variables, xi = (x1i , . . . , xpi ), are said
to follow a normal non-linear model when they are conditionally independent and:
1
Chapter 2. Normal Non-linear Model 2
The only assumption that changes with respect to the linear model is that E(yi |xi ) =
f (xi ; θ) now is an intrinsically non-linear function of xi in the sense that at least one
of its derivatives with respect to θ is a function of θ. Hence here the model can not be
posed as a linear model by transforming y and x’s.
Example: Rumford
Example: Puromyci
θ1 ci
vi = E(vi |ci ) + ϵi = + ϵi , (2.4)
θ2 + ci
which is an example of the Michaelis-Menten model. Even though in this example,
1 1 θ2 1
= + , (2.5)
E(vi |ci ) θ1 θ1 ci
Example: Cortisol
θ2 − θ1
yi = E(yi |xi ) + ϵi = θ1 + + ϵi , (2.6)
1 + exp xiθ−θ
4
3
2.2 Model fit
Once one has gathered a sample of observations, one needs to fit the model:
Given that the assumptions about the noise are the same as for the linear model, it
makes sense to use again the least squares criteria and estimate θ̂ by minimizing:
∑
n ∑
n
SSR (θ̂) = e2i = (yi − f (xi ; θ̂))2 , (2.8)
i=1 i=1
To help solve this minimization problem, statistical packages often require an initial guess
value for θ̂. Attaching meaning to the parameters, or approximating the non-linear model
through a linear model, often helps obtain these initial guesses.
Example: Rumford
Ti = θ̂1 − θ̂2 exp (−θ̂3 ti ) + ei , (2.9)
where (θ̂1 , θ̂2 , θ̂3 ) is found by minimizing:
∑
n
SSR (θ̂1 , θ̂2 , θ̂3 ) = (Ti − (θ̂1 − θ̂2 exp (−θ̂3 ti )))2 . (2.10)
i=1
To provide initial guess values for θ̂ one can use the fact that θ1 = E(T |t = ∞) is the
horizontal asymptote of E(T |t), and it should be close to the room temperature. Also,
E(T |t = 0) is θ1 − θ2 , and therefore −θ2 should be close to E(T |t = 0) − E(T |t = ∞).
The value for θ3 is harder to guess.
Example: Puromyci
θ̂1 ci
vi = + ei , (2.11)
θ̂2 + ci
where initial guesses for θ̂1 and θ̂2 can be found using the fact that θ1 = E(v|c = ∞) is
the horizontal asymptote of E(v|c) and θ2 is the value of c such that E(v|c = θ2 ) = θ1 /2.
One can also obtain initial guess values by fitting the linear model 1/vi = b0 +b1 (1/ci )+ei
and solving for b0 = 1/θ̂1 and b1 = θ̂2 /θ̂1 .
Example: Cortisol
θ̂2 − θ̂1
yi = θ̂1 + + ei , (2.12)
1 + exp xiθ̂−θ̂3
4
where θ1 and θ2 are the floor and the ceiling of E(y|x), and θ3 is such that E(y|x = θ3 ) =
(θ1 + θ2 )/2, and it is also the inflexion point for E(y|x).
To answer questions about the theoretical model, θ, based on the fitted model, θ̂, one uses
the fact that if the model is correct and one fits it by least squares, then the distribution
of θ̂j is approximately Normal(θj , V ar(θ̂j )).
Hence one can obtain approximate 100(1 − α)% confidence intervals for θj through:
α/2 α/2
(θ̂j − tn−p sθ̂j , θ̂j + tn−p sθ̂j ).
n−p ≈ 2.
To obtain approximated 95% intervals, one can use t.025
In the setting of this chapter, the main purpose is usually to check whether the non-
linear model is consistent with the data. Also, confidence intervals for θ and prediction
intervals will make sense only if the model is correct.
Like in linear modeling, here the model will be checked by looking at whether there is
any information left in the residuals, ei = yi − ŷi . If the model is correct, residuals should
be close to normally distributed with mean 0 and variance close to constant.
One checks whether that is the case by plotting the residuals against the fitted value
and the normal probability plot of the residuals. Plotting residuals against variables
of interest might also help, and when observations are ordered in time or in space one
should check whether there is any time or space dependency left in the residuals.
If the model is wrong, one needs to go back to the theoretical modeling stage, find out
what was wrong with the hypotheses made, fix them, and suggest a new model.
One can also look for alternative non-linear models compatible with what is observed.
One can compare the fit of different non-linear models with the same number of pa-
rameters through their SSR . To compare non-linear models with different number of
parameters, one needs to resort to cross validation or to model selection criteria combin-
ing SSR with some term that penalizes the number of parameters.
Example: Peptides
Chapter 3
Like in linear modeling, one starts with a set of p − 1 explanatory variables, xj , but here
the response, y, is categorical. We present first the case of k = 2 categories, dealt with
binary logistic models, and then the general case, dealt with nominal logistic models.
Poisson log-linear models, useful for count response data, are also presented.
When the response is categorical with two categories, like success/failure, normal linear
models do not apply because the normality assumption does not make sense anymore.
The next five examples are all of this kind. In the first two the response is presented
in the 0/1 format, in the third and fourth examples the response is in the event/trial
format, and in the last example it is in the response/frequency format.
Example: Kyphosis
Example: Birthweight
Example: Nicotine
1
Chapter 3. Categorical Response Models 2
For these problems, one resorts to the binary logistic model, which is a statistical model
that assumes that the number of successes, yi , for a given value of xi = (x1i , . . . , xp−1i ),
are conditionally independent and distributed as:
where ni is the number of binary observations made at xi , and where the probability of
success or of 1’s at xi is modeled as:
eβ0 +β1 x1i +...+βp−1 xp−1i
π(xi ) = , (3.2)
1 + eβ0 +β1 x1i +...+βp−1 xp−1i
where β = (β0 , . . . , βp−1 ) is only known to belong to Rp . The relationship between π(xi )
and xi is not linear, and such that 0 ≤ π(xi ) ≤ 1 for all xi . When this model holds:
π(xi ) Pr(1|xi )
log = log = log Odds(xi ) = β0 + β1 x1i + . . . + βp−1 xp−1i , (3.3)
1 − π(xi ) Pr(0|xi )
which will be useful when interpreting the coefficients of the model.
3. yi |xi ∼ binomial,
The variance of yi is not constant, because it depends on π(xi ) and hence on xi . This
variance is largest close to π(xi ) = 1/2, and it is smallest when π(xi ) is close to 0 or 1.
logit or canonical link
This is the default population (theoretical) model for binary response data, and it is
recognized as the logistic model with logit link. Instead, under the probit link,
where Φ(·) is the cumulative distribution function for the Normal(0, 1), and under the
Gompit (complementary log-log) link,
which are two functions of xi that are also bounded between 0 and 1. An advantage of
using the logit instead of the probit or the Gompit links, is that under the logit link the
coefficients can be interpreted through log odds ratios.
3.2 Model fit
In fixed effects models one assumes that the value of β ∈ Rp that generates the data is
unique and unknown. Once one collects data, one needs to compute the fitted model,
eb0 +b1 x1i +...+bp−1 xp−1i
π̂(xi ) = , (3.6)
1 + eb0 +b1 x1i +...+bp−1 xp−1i
which will be known, but will not be unique because it depends on the sample available.
Defining residuals as ei = yi − ŷi , and fitting the model by minimizing the sum of the
squares of these residuals, here is not an efficient way to proceed. Least squares treats
observations as if they were all equally reliable, which makes sense when fitting normal
models with constant variance, but here not all observations are equally reliable.
Since the observations with large or small π(xi ) have smaller variances, they are more
reliable and should get more weight than the observations with π(xi ) close to .5. That is
why here one usually fits models using either one of two alternative criteria that minimize
distances between observed and predicted values other than their Euclidean distance.
as a function of b = (b0 , . . . , bp−1 ), where ŷi = ni π̂(xi ) is the predicted value for yi ,
and where the denominator is an estimate of the variance of yi . Hence, here one
minimizes a weighted sum of the squares of the residuals, with the inverse of the
variance as the weight. The larger the variance, the less reliable the observation,
and the smaller the weight. Note that this is in fact minimizing the sum of the
square of standardized residuals.
Solving these two minimization problems requires solving a set of p non-linear equations.
By default, statistical packages minimize the deviance.
The deviance statistic, also called residual deviance, plays the same role played by SSR
for linear models. In particular, by adding one new variable in the model the residual
deviance always decreases, irrespective of whether that variable is useful or not, and
therefore the residual deviance is useful as model fit criteria, but not as model selection
criteria. By analogy one can also define a determination coefficient, R2 , measuring the
percentage of the total deviance captured by the model.
To answer questions about βi through bi , one uses the fact that, if the model is correct,
the distribution of the bj ’s is well approximated through:
Hence one can approximate 100(1 − α)% confidence intervals for βj , through
α/2 α/2
(bj − tn−p sbj , bj + tn−p sbj ), (3.10)
One can decide whether βj = a by checking whether a is in the confience interval for
βj , or using the number of standard deviations separating bj from a, |bj − a|/sbj . The
statistic zj = bj /sbj to decide whether βj could be 0 is labeled by zj instead of tj .
One can compare one binary logistic model, M0 , nested into another one, M1 , through
the same type of F test presented for linear models, after replacing the residual sum of
squares of the two models, SSRi , by their residual deviances, ResDevi .
Another way of choosing between M0 and M1 , where M0 is the restricted and M1 the
unrestricted model, is through the likelihood ratio test, that relies on the fact that if the
simpler model, M0 , is correct, then ResDev0 − ResDev1 is χ2q distributed, where q is the
difference between the number of parameters of M1 and of M0 .
One result not available for linear models, that is useful to determine whether a logistic
model is acceptable or not, states that if the model is correct, the residual deviance is
approximately χ2n−p distributed, with an expected value of n − p.
Hence, when the residual deviance is much larger than what is expected from a χ2n−p
distribution, the model is not correct, often because important variables are missing in
the model. Statistical packages provide the tail area (p-value) for the residual deviance
relative to the χ2n−p distribution. The smaller that tail area, the less trustful the model.
One can check whether a logistic model is acceptable by checking whether its residual
deviance is consistent with a χ2n−p distribution. If the residual deviance of a model is not
“too far” from E(χ2n−p ) = n − p, one knows that the model is acceptable. One checks
the residuals only when the residual deviance is “too far” from n − p for the model to
be valid. In that case, the residual analysis helps find where does the model fail.
To carry out a residual analysis, one first needs to agree on a definition of residual. If
the model is fit through the Pearson statistic, it is natural to use Pearson residuals,
yi − ni π̂(xi )
ePi = √ , (3.11)
ni π̂(xi )(1 − π̂(xi ))
which makes the Pearson statistic into the sum or the squares of Pearson residuals. If
one fits the model by minimizing the deviance, then one uses deviance residuals:
√
yi ni − yi
ei = sign(yi − ni π̂(xi )) 2(yi log
D
+ (ni − yi ) log ), (3.12)
ni π̂(xi ) ni (1 − π̂(xi ))
which makes the deviance statistic into the sum of the squares of deviance residuals,
∑
n
Deviance(Y, Ŷ ) = ResDev = (eD 2
i ) . (3.13)
i=1
If the model is not correct because the deviance (Pearson) statistic is too large to be χ2n−p
distributed, it is because the square of some of its residuals are too large. By identifying
which residuals are “too large”, one finds where the model is failing and one maybe
learns ways to improve it. If the model is correct, one expects about 95% of its residuals
to fall within (−2, 2), (here residuals and their standardized version are similar), and
hence any observation with a residual far from this range will be an outlier.
One can define deleted deviance and Pearson residuals by replacing π̂(xi ) by the leave-
one-out cross validated estimate for it, and the sum of the squares of these deleted
residuals leads to a PRESS like and to a RP2 red like model selection criteria.
Finally note that the normal plot of residuals, and the plot of residuals against π̂(xi ) or
against log (π̂(xi )/(1 − π̂(xi )), are not as useful as in linear modeling; The residuals for
all the 1’s are positive, and the residuals for all the 0’s are negative, and the aspect of
these plots when the model is correct is different from the one in linear modeling.
Example: Challenger
Example: Eagles
All the discussion around model selection for linear models applies here as well. In par-
ticular, one can use model selection criteria striking a compromise between maximizing
goodness of fit, measured through residual deviance, and minimizing complexity, mea-
2
sured through p, like Radj , AIC or BIC. One can also resort to cross validation based
model selection criteria like PRESS, RP2 red or RKf
2
old .
Best subsets and stepwise regression methods, together with shrinkage estimation meth-
ods are also useful as model selection tools for logistic regression.
All the discussion about the difficulties of interpreting linear models applies as well when
it comes to interpreting fitted logistic models. In particular, one needs to be careful
because different models often provide reasonable fits for your sample and each model
tells a different story, because the dependency among explanatory variables needs to be
taken into account, because missing relevant variables introduce biases in the estimates
of the coefficients βj ’s of the variables in the model, and because data alone can not
distinguish between correlation and causality.
As in linear modeling, bj captures the role played by xj , the role played by other xk in
the model correlated with xj , and the role played by variables zr not in the model but
driving both xj and y. It will always be difficult to tell all these effects apart.
Here, on top of these difficulties, one needs to face the fact that π̂(xi ) is not linear in xi .
We illustrate what bj means under the binary logistic model with the logit link through
the two explanatory variables case, where
and so where:
π̂(x1 ,x2 +1)
1−π̂(x1 ,x2 +1)
ˆ
Odds(x 1 , x2 + 1)
b2
e = = , (3.15)
π̂(x1 ,x2 ) ˆ
Odds(x 1 , x2 )
1−π̂(x1 ,x2 )
and:
ˆ
Odds(x 1 , x2 + 1)
b2 = log . (3.16)
ˆ
Odds(x 1 , x2 )
So bj is the estimated logodds ratio when xj increases by one unit and the value of all
other explanatory variables do not change.
There is a one to one relationship between probability and odds. When π(x) = 0
then Odds(x) = 0. When π(x) = 1/5 then Odds(x) = 1/4. When π(x) = 1/2 then
Odds(x) = 1. When π(x) = 4/5 then Odds(x) = 4. When π(x) = 1 then Odds(x) = ∞.
When bj > 0, by increasing the value of xj while keeping the other variables in the model
constant, one increases the Odds(x) and the probability of success, π(x). When bj < 0,
the odds and the probability of success decrease with increasing xj .
One needs training to feel confortable interpreting odds, odds ratios and log odd ratios
and hence to understand what a given value for bj means. In the special case where
π̂(x1 , x2 ) is very small, like when modeling the probability of a rare disease, the inter-
pretation is easier because in that case:
π̂(x1 , x2 + 1)
eb2 ≈ , (3.17)
π̂(x1 , x2 )
and:
π̂(x1 , x2 + 1)
b2 ≈ log . (3.18)
π̂(x1 , x2 )
The interpretation of the coefficients under the probit and the Gompit links is even more
complicated, because it can not be made through logodds ratios.
A J × K contingency table is a table of counts with J rows and K columns. This type of
data is extremely frequent in psychology, sociology, political science and related areas.
A useful way of looking at J ×K contingency tables is to recognize that rows and columns
represent two categorical variables with J and K categories each. The purpose of the
analysis is to find out whether these two variables are related or not, and if they are, to
learn about the kind of relationship between them.
Sometimes one has J ×K ×I tables, that is, I different J ×K contingency tables. In that
case, one has three categorical variables with J, K and I categories, and the purpose is
to learn about the relationship between one of them and the other two.
Example: Hospitals
In any such example, one just needs to recognize which categorical variable plays the role
of response, and which one or ones play the role of explanatory variables, fit a binary or
a nominal logistic model, and interpret it.
If the response is the column factor, then rows and columns are not related, not asso-
ciated, or independent, when the row category does not have any effect on the column
probabilities, and therefore when the row “probability profiles” for all rows are the same.
When the response is categorical with more than two categories, one resorts to the
nominal logistic model, which is a straightforward generalization of the binary logistic
model. We illustrate it through the special case where y has three categories, A/B/C.
Under that model one assumes that yi given xi is conditionally independent and multi-
When the response, y, is a count variable supported in {0, 1, 2, . . .}, and the counts are
not very small, the normal linear model is usually a good approximation for the the
relationship between the logarithm of y and the x’s.
Example: Galapagos
with β = (β0 , . . . , βp−1 ) only known to belong to Rp . The relationship between λ(xi ) =
E(yi |xi ) and xi is such that λ(xi ) ≥ 0 for all xi . When this model holds,
The model is checked through the analysis of deviance residuals, defined as:
√
yi
ei = sign(yi − ŷi ) 2(yi log − (yi − ŷi )).
D
(3.26)
ŷi
When most of the counts are not too small, the loglinear Poisson model fit is often similar
to the corresponding normal linear model fit for log yi . Switching from the Normal
assumption to the Poisson assumption one better matches the support of yi , but then
one assumes that V ar(yi |xi ) = E(yi |xi ), which is not the case in many examples.
Loglinear Poisson models can also be used for the analysis of contingency tables, leading
to analysis equivalent to the one based on nominal logistic model.