0% found this document useful (0 votes)
7 views63 pages

1.1 Normal Linear Model: 1.1.1 Theoretical (Population) Model and Fitted (Sample) Model

Uploaded by

Juan Pablo
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views63 pages

1.1 Normal Linear Model: 1.1.1 Theoretical (Population) Model and Fitted (Sample) Model

Uploaded by

Juan Pablo
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 63

Chapter 1

Normal Linear Model

When looking for a statistical model that relates a continuous response variable, y, with
a set of explanatory variables, x = (x1 , . . . , xp−1 ), the normal linear model is one of the
simplest. Even though this model might seem too naive and restrictive to be any useful,
it adapts well to a large range of modeling problems.

We devote most of our time to cover linear models instead of more sophisticated statis-
tical models, because the model building approach used for them and the way in which
linear models are used to predict or to understand y, are similar to the ways used for
any statistical model. Most of what we will learn here, applies when building and using
any type of statistical model.

1.1 Normal linear model

1.1.1 Theoretical (Population) model and fitted (sample) model

A linear model splits the response, yi , into a part that can be explained through a
linear combination of the explanatory variables available, xi = (x1i , . . . , xp−1i ), that is
the signal, and a part that can not be explained by it, ϵi , that is recognized as the noise,

yi = E(yi |xi ) + ϵi = β0 + β1 x1i + . . . + βp−1 xp−1i + ϵi . (1.1)

The noise captures measurement error together with the part of yi that could be explained
by variables, z1 , z2 , . . ., that are unknown, or known but not available for modeling. These

1
Chapter 1. Normal Linear Model 2

hidden variables can not be considered at the time of modeling, but one will need to
have them in mind when interpreting the final model.

A normal linear model is a statistical model that further assumes that the noise compo-
nent is independently distributed as:

ϵi ∼ Normal(0, σ 2 ). (1.2)

Hence, normal linear models assume that the yi ’s are conditionally independent and:

yi |xi ∼ Normal(β0 + β1 x1i + . . . + βp−1i xp−1i , σ 2 ). (1.3)

This is what we will call the theoretical or population model.

In statistics one further assumes that all the samples are generated using the same
“unique” values for β = (β0 , . . . , βp−1 ) and σ 2 that are “unknown”, but known to belong
to a parameter space usually assumed to be β ∈ Rp and σ 2 ∈ [0, ∞).

What distinguishes a statistical model from a probability model is that the statistical
model has a parameter space and the probability model doesn’t have it.

The questions that have to be typically answered in statistics are

1. about the value of the βj ’s, that determine the relationship between the x’s and y,
and in particular whether βj could be equal to 0, when the goal is to explain the
relation between y and x,

2. about the value of E(y|x0 ) = β0 + β1 x10 + . . . + βp−10 xp−10 , or the value of yf (x0 ) =
E(y|x0 ) + ϵf , at certain point x0 , when the goal is to predict E(y|x) or y(x),

3. whether there are any observations that do not follow the pattern (model) of the
majority of the observations in the sample considered.

These questions related to the theoretical model will be answered through the fitted or
sample model, computed using a sample of observations, (yi , x1i , . . . , xp−1i ).

The fitted model splits the response yi into its predicted value (part explained by x), ŷi ,
and its residual or prediction error (part unexplained by x), ei = yi − ŷi ,

yi = ŷi (xi ) + ei = b0 + b1 x1i + . . . + bp−1 xp−1i + ei , (1.4)

where b = (b0 , . . . , bp−1 ) are estimates of β = (β0 , . . . , βp−1 ) obtained from the sample.

PiE2, GCED Josep Ginebra


Chapter 1. Normal Linear Model 3

Different from the theoretical model, this fitted model is not unique, because it depends
on the sample obtained, but it is known.

One answers questions about the theoretical model, (i.e. about β, E(y|x0 ) or yf (x0 )),
based on the fitted model, (i.e. through b or ŷ(x0 )). One makes distributional assump-
tions beyond the linearity assumption, to be able to link fitted and theoretical models.

1.1.2 Examples and model assumptions

Before using a normal linear model in a given setting, one needs to check whether the
assumptions made by the model are appropriate in that setting. In particular, one needs
to check whether the relationship between yi and xi is such that:

1. it is linear, E(yi |xi ) = β0 + β1 x1i + . . . + βp−1i xp−1i ,

2. the variance is constant, V (yi |xi ) = σ 2 , (same value of σ 2 for all i),

3. all the yi observed at a given value, xi , are normally distributed,

4. ϵi and ϵj are independent for all i, j, (i.e., corr(ϵi , ϵj ) = 0). constant variance (2) and normally
distributed (3)
These assumptions are ordered from more important to less important, and they need
to be checked based on what is observed in the data available.

Example: Acid

Example: Life line length

Example: Weight vs height

Example: Height vs Father height, Weight vs Father weight

Example: Father vs mother height, Father vs mother weight

Example: Final vs midterm grades

The linearity assumption is crucial. It would be silly to fit a linear model to a relationship


that is clearly non-linear.

Example: Church perimeter versus area

PiE2, GCED Josep Ginebra


Chapter 1. Normal Linear Model 4

The linearity assumption might seem too restrictive, but note that xji ’s can be non-

linear transformations of explanatory variables, like x2i = x21i , x2i = 1/x1i , x2i = x1i ,
or x2i = log x1i . One can also transform yi to help attain linearity. Hence, many instances
that seem to fall outside the linear model framework can be adapted to fall into it by
finding the right scales for explanatory and response variables.

In practice, often V ar(yi |xi ) is not constant because it increases with E(yi |xi ). When
that is the case, one might get around it by modeling log yi .

Example: Brain versus body weight

Example: Duck count

What are supposed to be normally distributed are the y values that share the same value
x, and not all the values of y (irrespective of the x at which y is observed).

Example: Atmospheric pressure versus boiling temperature

When one has a single explanatory variable, x1 , one can check linearity and constancy
of variance through the bivariate plot between y and x1 , even though one needs to be
careful in situations like the one of the last example. Checking normality is trickier, and
one will need the help of tools presented in Section 6.

Example: Life expectancy versus income

The independence assumption is harder to check (and to understand), and one tipically
only worries about it when modeling response variables ordered in time, like in the Old
Faithful example. We will come back to this assumption in Section 6 and Chapter 5.

Example: Old faithful

Example: Herzprung Russell

When one has more than one explanatory variable, checking whether normal linear model
assumptions make sense and using the fitted model will be harder because:

1. checking linearity and constancy of variance through bivariate plots of y and xj is


close to impossible,

2. the amount of possible linear models available is a lot larger,

PiE2, GCED Josep Ginebra


Chapter 1. Normal Linear Model 5

3. one needs to cope with the dependency between explanatory variables,

4. it is not so easy to identify unusual observations.

Example: Infant mortality

We will need to find ways to incorporate categorical explanatory variables.

Example: Birthweight

When what is categorical is the response variable, linear model assumptions do not make
sense anymore, and one needs to resort to binomial logistic regression models.

Example: Kyphosis

When one has a count response with low counts, one needs to resort to loglinear poisson
models. (For large counts, log yi is often well approximated by normal linear models).

Example: Challenger

Building a statistical model is an iterative process, in which model checking is the critical
stage that indicates whether the current model is useful enough, and if not, it suggests
ways to improve it.

1.1.3 Relationship between theoretical and fitted model

To help understand the relationship between the theoretical and the fitted models, and
how one can answer questions about the first based on the second it helps to do a
simulation experiment and a resampling experiment.

1. Simulation experiment: Check what is common and what is different between


models fitted to several samples simulated from a given theoretical model, like for
example y|x ∼ Normal(2 + 4x, σ 2 = 1). How are b0 and b1 distributed?

2. Subsampling or resampling experiment: Check what is common and what is differ-


ent between models fitted to subsamples from a given sample of (y, x). How are b0
and b1 distributed? When the subsamples are randomly created with replacement,
this is called bootstrapping.

PiE2, GCED Josep Ginebra


Chapter 1. Normal Linear Model 6

1.1.4 The normal linear model in matrix notation

A compact presentation of the normal linear model that uses matrices, states that a
sample (xi1 , xi2 , . . . , xip−1 , yi ), for i = 1, . . . , n, follows that model if:

Y = E(Y |X) + ϵ = Xβ + ϵ, (1.5)

where ϵ ∼ Normaln (0, σ 2 I), or what is the same:

Y |X ∼ Normaln (Xβ, σ 2 I), (1.6)

where Y and ϵ are the n × 1 vectors with the yi and the ϵi , where X is the n × p matrix
formed by a first column of ones and one column for each explanatory variable, and
where I is the n × n identity matrix.

The parameter space for (β, σ 2 ) is typically Rp × [0, ∞).

With that notation, the fitted linear model will be:

Y = Ŷ + e = Xb + e, (1.7)

where Ŷ and e are the n × 1 vectors with the ŷi and the ei .

PiE2, GCED Josep Ginebra


Clase 2: Model fit

Chapter 1. Normal Linear Model 7

1.2 Model fit
El mètode de mínims quadrats pren com a estimador de beta el vector betaˆ que minimitza la distància entre el vector
de valors observats i el vector dels valors estimats pel model.
1.2.1 Estimation of β by least squares

Residuals, ei = yi − ŷi , are the prediction errors, and the part of the response not captured
by the model. The smaller the |ei | = |yi − ŷi | for i = 1, . . . , n, the closer are the fitted
model predictions, ŷ = b0 + b1 x1i + . . . + bp−1 xp−1i , to the training sample observations.

There are many sensible ways of imposing that requirement, leading to as many different
model fit criteria and fitted models. Model fit criteria that make sense are,
∑n
1. to minimize i=1 |ei |, (least absolute deviation),

2. to minimize the median of |ei |2 , (least median squares),



3. to minimize n−k 2 2
i=1 e(i) for a given k, where e(i) are the ordered squared residuals,
hence discarding the k largest e2i , (least trimmed squares),

4. to minimize ni=1 e2i , (least squares),

5. to minimize ni=1 wi e2i , where wi are a set of weights, (weighted least squares),

6. to minimize ni=1 e4i , (least fourths),

and any other criteria that uses a distance between observed and predicted values. By
default, statistical packages fit linear models using the least squares criteria.

Under the least squares criterion, the coefficients of the fitted model, b = (b0 , . . . , bp−1 ),
are obtained by minimizing:

n

Least squares SSR = (yi − b0 − b1 x1i − . . . − bp−1 xp−1i )2 , (1.8)


i=1

which requires solving a set of p linear equations, leading to b = (X X)−1 X ′ Y , where Y
is the n × 1 vector with values of the response, and X is the n × p matrix where the i-th
row is the value of the explanatory variables for the i-th observation, preceded by 1.

When the constant term, β0 , is not required in the model, the first column of 1’s is
excluded from X, but by default one always keeps the constant in the model.


An alternative way to see that b = (X X)−1 X ′ Y is by noting that under least squares,
Ŷ is the linear combination of the columns of X that minimizes the length of the vector

PiE2, GCED Josep Ginebra


Chapter 1. Normal Linear Model 8

of residuals, and therefore the residual vector, e, has to be orthogonal to the columns of
′ ′
X, X e = X (Y − Xb) = 0.


The vector of fitted values is Ŷ = Xb = X(X X)−1 X ′ Y = HY , where H is the n × n
matrix that projects Y onto the subspace generated by the columns of X to yield Ŷ .
The vector of residuals is e = Y − Ŷ = (I − H)Y , where I is the n × n identity matrix.

To fit a model, no explanatory variable can be a linear combination of other explanatory



variables in the model, because that would not let one invert X X and compute b.

In the case of a single explanatory variable, the least squares estimates are:

b0 = ȳ − b1 x̄, (1.9)

and ∑n
(x − x̄)(yi − ȳ)
∑n i
b1 = i=1 . (1.10)
i=1 (xi − x̄)
2

When one fits a linear model by least squares,

1. the sample mean of the residuals is always equal to 0, ē = 0,

2. the vector of residuals is orthogonal to the columns of the X matrix,

3. the fitted plane goes through the point (x̄1 , . . . , x̄p−1 , ȳ).

1.2.2 Reasons for choosing least squares

Some reasons for choosing the least squares criteria are:

1. Ease of computation, because the bj are linear combinations of the yi . Under other
criteria there is no closed form expression for bj as a function of (yi , xi ).

2. Ease of analysis, because if the model is correct the bj will be normally distributed
and centered around the “true” value of βj with the smallest possible variance
among all linear unbiased estimates of βj , and that helps learn about βj from bj .

3. When used on the constant model,

yi = β0 + ϵ, (1.11)

the least squares estimate of β0 is b0 = ȳ, while the least absolute deviation estimate
is the sample median of y.

PiE2, GCED Josep Ginebra


Chapter 1. Normal Linear Model 9

4. Least squares is “reasonably robust” under the presence of outliers.

5. Under the normal linear model the least squares estimate of β coincides with the
maximum likelihood estimate.
El mètode de màxima versemblança consisteix en escriure la versemblança com a funció de beta i trobar el vector betaˆ que
la maximitza. Sota les hipòtesis del model lineal, la component i-èssima del vector resposta segueix una adistribució N(µi ,
2 ), on µi = [X · beta]i . Como son independientes es el producto de las
1.2.3 Instances where alternative criteria are called for

Examples of instances where one should switch away from using least squares are:

1. When linearity holds but the variance is not constant, like when observations are
means of samples of different sizes. In that case transforming y might fix the
variance problem, but then linearity would not be in place anymore.
Instead, in that case one can switch to using weighted least squares with wi in-
versely proportional to the variance of yi . In that way one weights more obser-
vations with smaller variance, which deserve more credit. These estimates are
′ ′
b = (X V −1 X)−1 X V −1 Y , where V is the n × n diagonal matrix with vii = wi .

2. When one suspects that there might be outliers and/or groups of observations in
the training sample following different models.
In that case one can switch to criteria more robust than least squares, like the first
three in the list of criteria presented above. In particular, using the least median
squares criterion disregards up to one half of the observations and provides the
best fit for the majority group in the training sample.
Robust fits help unmask observations that do not follow the majority pattern,
and they help suggest ways of improving the model by including as explanatory
variables what distinguishes the majority group from these outliers.
Example: Hertzprung Russell

1.2.4 Estimation of σ 2

One typically estimates σ 2 through:



n
σˆ2 = s2R = SSR /(n − p) = e2i /(n − p), (1.12)
i=1

where n is the number of observations, and p the number of coefficients, βi . As it will


be seen, σˆ2 = s2R is an unbiased estimator of σ 2 .

PiE2, GCED Josep Ginebra


Chapter 1. Normal Linear Model 10

1.3 ANOVA decomposition and goodness of fit


descomponer la variabilidad total observada

1.3.1 ANOVA decomposition

One needs a way to assess whether a linear model provides a very good fit like in the
acid example, a regular fit like in the weight versus height example, or a poor fit like
in the hand life line length example. In all these examples the use of a linear model is
appropriate, because the linear model assumptions are approximately correct, but the
quality of the fit is very different in each case.

With a single explanatory variable, x1 , one can assess how good the fit will be by just
looking at the bivariate plot of y and x1 , and one can quantify goodness of fit through
the correlation between x and y. With many explanatory variables, no single plot helps
do that and it is not clear how to quantify goodness of fit.

The goal is to compare the predictions about yi made through ȳ, without the help of
x’s, and the predictions about yi made through ŷi , with the help of the best linear
combination of x’s.

If one fits the model by least squares, it turns that


not explained + explained

n ∑
n ∑
n
(yi − ȳ) =
2
(yi − ŷi )2 + (ŷi − ȳ)2 , (1.13)
i=1 i=1 i=1

which is a fact that follows from e = Y − Ŷ being orthogonal to Ŷ − ȳ1, where 1 is a


vector of ones, and then using Pythagoras theorem.

The term on the left of this ANOVA decomposition is the total sum of squares,

n
SST = (yi − ȳ)2 , (1.14)
i=1

which is positive and only depends on y, and not on the fitted model. It measures how
bad the predictions made without the use of explanatory variables would be.

The two terms on the right of this decomposition are also positive, and both depend on
the fitted model. The first term is the sum of the squares of the residuals, measuring
what is not explained and how bad the predictions made with the linear model are,

n ∑
n
SSR = SSN otE = (yi − ŷi ) =
2
e2i , (1.15)
i=1 i=1

PiE2, GCED La Variabilidad No Explicada (VNE) o


Josep Ginebra
residual, mide la variabilidad dentro de los
grupos y es debida al error experimental.
Chapter 1. Normal Linear Model 11

and the second term is the sum of the squares explained by the model,
La Variabilidad Explicada (VE) mide la ∑
n
variabilidad entre los distintos grupos. Si es SSE = (ŷi − ȳ)2 , (1.16)
pequeña, es porque las medias son similares. i=1

measuring how better the predictions using explanatory variables are relative to the ones
made without them. Given that for a given response SST will be fixed, the larger SSE ,
the smaller SSR , and the better the fitted model.

Example: Acid

Example: Hand life line

Example: Weight vs height, and weight vs siblings

Example: Final vs midterm


s_R^2 es un estimador sense biaix de la variància del model
This ANOVA decomposition is often presented in an ANOVA table, next to the mean
squares defined as s2E = SSE /(p − 1) and s2R = SSR /(n − p), where p is the number of
coefficients in the model. L’error quaràtic mig (s_R^2) és la mitjana aritmètica dels errors al quadrat,
entenent per error la diferència entre el valor observat i el predit pel model.
(n-p) perque es perd un grau de llibertat per a cada paràmetre ajustat

1.3.2 Goodness of fit measures

One summary statistic of the ANOVA decomposition that measures goodness of fit, is
the determination coefficient:
( )
SSE SSR
2
R = 100 = 100 1 − , (1.17)
SST SST

which can be interpreted as the percentatge of variability of y explained by the model.

In the single explanatory variable case, R2 /100 is also equal to the square of the cor-
relation coefficient between y and x, Corr2 (x, y), and in the general case it is equal to
the square of the correlation between y and ŷ, Corr2 (y, ŷ). Therefore, R2 /100 can also
be interpreted as the square of the largest possible correlation between y and a linear
combination of the x’s.

R2 is useful as a model fit measure but not as a model selection criterion, because when
one adds a new explanatory variable in the model, SSR always decreases and R2 always
increases, irrespective of whether that new variable is useful or not. Hence, the largest R2

PiE2, GCED Josep Ginebra


Chapter 1. Normal Linear Model 12

will always be attained by the model that uses all the explanatory variables available. In
fact, one can always get R2 = 100 by using as many (algebraically) linearly independent
explanatory variables as observations, because that leads to ei = 0 and SSR = 0.

As a consequence, R2 is useful to compare models for y of the same size, with the same
number of explanatory variables, but it is not useful to compare models of different sizes,
because in that case R2 always prefers larger and more complicated models.

To compare models of different sizes, one needs model selection criteria that incentive
goodness of fit, but at the same time penalize larger models. One model selection criteria
frequently used that does that, is the adjusted determination coefficient, defined as
( ) ( )
s2R SSR /(n − p)
Radj = 100 1 − 2 = 100 1 −
2
. (1.18)
sy SST /(n − 1)
2
The model that maximizes Radj coincides with the model that minimizes the residual
variance, sR = sy (1 − Radj /100), which also works as a model selection criteria. Alter-
2 2 2

native model selection criteria will be presented in Section 7.

Another summary statistic of the ANOVA decomposition is the ratio of explained and
unexplained mean squares,
s2 SSE /(p − 1)
F = E = , (1.19)
s2R SSR /(n − p)
which will be used to choose between the null model, y = β0 + ϵ, and the model under
consideration. The larger F , the stronger the relationship between the response and the
variables in the model, and the less likely that the null model is in place.

PiE2, GCED Josep Ginebra


Chapter 1. Normal Linear Model 13

1.4 Inference about the coefficients of the model

Under the normal linear model one assumes that data is generated by:

yi = β0 + β1 x1i + . . . + βp−1 xp−1i + ϵi = E(yi |xi ) + ϵi , (1.20)

where ϵi are independent and Normal(0, σ 2 ), and where (β, σ 2 ) take the same unique but
unknown value for all the observations, only known to belong to the parameter space
Rp × [0, ∞). The corresponding fitted model,

yi = b0 + b1 x1i + . . . + bp−1 xp−1i + ei = ŷi + ei , (1.21)

is obtained using the least squares estimate of β. The goal is to answer questions about
the values of the βj that generate the data, based on bj . In particular, one needs to
decide whether βj could be equal to 0, in which case one could simplify the model.

When fitting the linear model, only the linearity assumption was used. Here it is where
one takes advantage of the other distributional assumptions made.

1.4.1 Distribution of b and of s2R

The link between βj and bj follows from the fact that if the population model is “correct”
′ ′
and one fits the model by least squares, then b = (X X)−1 X Y and:

b|X ∼ Normal(E(b|X) = β, V (b|X) = σ 2 (X X)−1 ). (1.22)

The least squares b has the smallest variance among all linear unbiased estimates of β.

When having a single explanatory variable, this leads to:

σ2 Σx2i
b0 |x ∼ Normal(E(b0 |X) = β0 , V ar(b0 |x) = ), (1.23)
n Σ(xi − x̄)2

σ2
b1 |x ∼ Normal(E(b1 |X) = β1 , V ar(b1 |x) = ), (1.24)
Σ(xi − x̄)2
and:
−σ 2 x̄
Cov(b0 , b1 |x) = . (1.25)
Σ(xi − x̄)2
Exercise: Simulate samples from a simple linear model with different choices of β, of
n and of the range of values of x1i , and check that the distribution of the b = (b0 , b1 )

PiE2, GCED Josep Ginebra


Chapter 1. Normal Linear Model 14

obtained matches this result. How do the variances and covariance of b0 and b1 change
with the choice of x1i ? How can one minimize V (b1 |x)? Why are b0 and b1 dependent?

Question: What is the distribution of b0 when one fits the model yi = β0 + ϵi ?

Question: What is b1 and its distribution when one fits the model yi = β1 x1i + ϵi ?

If the normal linear model is “correct” and one fits it by least squares, then:

σ2 2
s2R |X ∼ χ , (1.26)
n − p n−p

where s2R = SSR /(n − p), which means that E(s2R ) = σ 2 . One typically estimates the

ˆ
variance-covariance of b through V (b|X) = s2R (X X)−1 .

1.4.2 Confidence interval for βj

Statistical packages provide estimates of V ar(bj |x), denoted by s2bj , and with it one can
compute 100(1 − α)% confidence intervals for βj through:
1-alpha/2 = alpha/2 (simetria)
α/2 α/2
(bj − tn−p sbj , bj + tn−p sbj ), (1.27)

n−p ≈ 2, one can approximate 95% confidence intervals for βj through:


and given that t.025

(bj − 2sbj , bj + 2sbj ). (1.28)

Example: Hand life line. An approximate 95% confidence interval for β1 is:

(−1.37 − 2 ∗ 1.60, −1.37 + 2 ∗ 1.60) = (−4.57, 1.83).

Example: Acid. An approximate 95% confidence interval for β1 is:

(0.322 − 2 ∗ 0.005, 0.322 + 2 ∗ 0.005) = (0.312, 0.332).

An approximate 95% confidence interval for β0 is:

(35.46 − 2 ∗ 0.63, 35.46 + 2 ∗ 0.63) = (34.20, 36.72).

Interpretation of confidence intervals: If one computes one hundred 95% confidence


intervals for βj in this manner based on 100 independent samples of (y, x), and the

PiE2, GCED Josep Ginebra


Chapter 1. Normal Linear Model 15

models are “correct”, then the true value for βj will fall inside of about 95 of these
intervals. One never knows which are the about 5 intervals that fail to enclose the true
βj , but when they fail one knows that the true βj will not be too far from the interval.

Hence, the 95% stands for the long run success rate that one should expect if one
keeps computing confidence intervals in this manner using models that are approximately
“correct”. The 95% i not the probability that any given confidence interval for βj happens
to enclose the true value of βj .

1.4.3 Tests about a single coefficient, βj

Sometimes one needs to decide whether data is compatible with βj = a. In particular,


one often needs to check whether βj = 0, to try to simplify the model.

If bj is close to a then it is likely that βj could be equal to a, but if bj is far from a then
it is likely that βj is not equal to a. There are two approaches to decide whether bj is
far enough from a to reject βj = a:

1. One can compute a 95% confidence interval for βj , and reject βj = a with a 95%
confidence if a is not in that interval. The further a falls away from that interval,
the stronger the conviction that βj = a is not consistent with data.
A 95% confidence interval for βj provides the set of all βj = a that can not be
rejected with a 95% confidence, based on the evidence in the sample.

2. One can also use the fact that if βj = a is true and the model is correct, then the
number of standard deviations that separate bj from βj = a,

|bj − a|
|tj | = , (1.29)
sbj
t_j >= qt(df, 1-alpha/2) para rechazar (lo mismo q este fuera de [-2, 2]) o pvalor
has a |tn−p | distribution, and therefore |tj | will be smaller than 2 close to 95% of
the time. When |tj | is much larger than 2 one will reject βj = a, and when |tj | is
smaller than 2 one will not reject it.
One should always measure distance between bj and βj = a using the standard
deviation of bj as the unit of measure. Any βj = a that is more than two stan-
dard deviations away from bj can be rejected, and the more standard deviations
separating bj from a, the stronger the evidence against βj = a. In particular,

|bj |
|tj | = , (1.30)
sbj

PiE2, GCED Josep Ginebra


Chapter 1. Normal Linear Model 16

helps decide whether βj = 0. If |tj | = |bj |/sbj is much larger than 2, then bj is
much more than two standard deviations away from 0, and βj is unlikely to be 0.

One should never provide a fitted model, ŷi = b0 + b1 x1i + . . . + bp−1 xp−1i , without
providing the standard deviations of its coefficients, sbj . To decide whether one can
remove xj from the model what matters is not wether |bj | is large or small, but how
many standard deviations separate bj from 0, |tj | = |bj |/sbj .

Example: Height vs father height with data only for boys. An approximate 95% confi-
dence interval for β1 is:

(0.629 − 2 ∗ 0.109, 0.629 + 2 ∗ 0.109) = (0.411, 0.847),

which does neither include β1 = 0, consistent with |t1 | = 5.74 > 2, nor β1 = 1.

Examples: Final and midterm exams. Father and mother height and weight.

1.4.4 Tests about groups of coefficients

Observing |tj | = |bj |/sbj to be much smaller than 2 entitles one to remove one variable,
xj , at a time, but not to remove several variables with a small |tj | all at once. By
removing one variable from the model everything changes and, in particular, the |tj |
statistic of a variable left in the model might change from being smaller to being larger
than 2, and xj could go from not being useful to being useful.

To check whether one can remove q variables simultaneously, or to compare two linear
models with one model nested into the other, one computes the statistic:
SSR0 −SSR
q
F = SSR
, (1.31)
n−p

where SSR0 is the sum of the squares of the residuals of the restricted model, (obtained
by removing q variables or imposing q linear restrictions on the coefficients of the full
model), and SSR is the sum of the squares of the full model, and where p is the number
of βj ’s in the full model. The larger F , the larger the increase in the residual sums of
squares from removing the variables, and the less desirable removing these variables is.

To know whether F is small enough to justify the removal of the q variables at once,
(or imposing the q restrictions on the coefficients), one uses the fact that if they can be

PiE2, GCED Josep Ginebra


Chapter 1. Normal Linear Model 17

removed because their population coefficients are all simultaneously equal to 0, and if
the model is correct, this F statistic is F (q, n − p) distributed.

Statistical software provides the tail area (p-value) resulting from comparing the value
of the F statistic observed with the F (q, n − p) distribution. The larger the value of F ,
the smaller that tail area, and the less desirable it is to remove the variables.

The F = s2E /s2R statistic obtained as a subproduct of the ANOVA decomposition is the
statistic that results when one tests whether one can remove all the q = p−1 explanatory
variables from the model at once, and be left with the null model, yi = β0 + ϵi .

Examples: Grades exam versus order, Weight versus several variables

1.5 Prediction

Examples: Acid, Hand life line, Weight and height.

One might want to predict either the expected value of y at x0 ,


E(y|x0 ) = β0 + β1 x10 + . . . + βp−1 xp−10 , (1.32)
or the observed value of y at x0 ,
y(x0 ) = β0 + β1 x10 + . . . + βp−1 xp−10 + ϵ0 = E(y|x0 ) + ϵ0 . (1.33)
In the first instance one needs to guess β0 , . . . , βp−1 , while in the second instance one
needs to guess β0 , . . . , βp−1 together with ϵ0 . The best way to understand the difference
between these two predictions is through specific examples.

A good point prediction in both instances is ŷ(x0 ) = b0 + b1 x10 + . . . + bp−1 xp−10 because
a good guess for βj is bj and a good guess for ϵ0 is 0.
El PI son intervals que contenen el valor predit amb el nivell de confiança que s’hagi fixat. Els intervals de predicció son
intervals per a prediccions concretes
The prediction interval for y(x0 ) will be wider than the confidence interval for E(y|x0 ),
because predicting y(x0 ) is harder and more uncertain. Most often one needs to predict
y(x0 ), and so one needs to use the wider prediction interval.
El CI es calculen pels paràmetres de la distribució. En els models lineals el CI es calculen pels valors esperats de y.
With a single explanatory variable, one can compute a 100(1 − α)% confidence interval
for E(y|x0 ) through:
( √ )
1 (x0 − x̄)2
+∑
α/2
ŷ(x0 ) ± tn−p sR , (1.34)
n (xi − x̄)2

PiE2, GCED Josep Ginebra


Si fixem una alçada de 1, 76, el PI serà un interval que contindrà amb una confiança de 1-alpha el pes d’una persona de
1, 76 escollida a l’atzar.

En aquesta mateixa situació, el CI serà un interval que contindrà el pes esperat de les persones que medeixen 1, 76 amb
una confinaça igual a 1-alpha
Chapter 1. Normal Linear Model 18

while a 100(1 − α)% prediction interval for y(x0 ) can be computed through:
( √ )
1 (x − x̄) 2
ŷ(x0 ) ± tn−p sR 1 + + ∑
α/2 0
. (1.35)
n (xi − x̄)2

n−p ≈ 2.
The 95% confidence intervals can be approximated by using t.025

Both intervals are centered at ŷ(x0 ), they are most narrow at x0 = x̄, and they grow
wider with (x0 − x̄)2 , and so the most precise predictions are close to x0 = x̄. The
uncertainty in assuming that ϵ0 is 0 and that β0 is b0 has the same impact for all x0 , but
the uncertainty in assuming that β1 is b1 impacts predictions the least at x = x0 , and
the consequence of a bad guess for β1 becomes worse the further x0 is away from x̄.

In general, for more than one explanatory variable a 100(1 − α)% confidence interval for

E(y|x0 ) at x0 = (1, x01 , . . . , x0p−1 ) can be computed as:
( √ )
α/2 ′ ′
ŷ(x0 ) ± tn−p sR x0 (X X) x0 ,
−1 (1.36)

while a 100(1 − α)% prediction interval for y(x0 ) can be computed as:
( √ )
α/2 ′ ′
ŷ(x0 ) ± tn−p sR 1 + x0 (X X)−1 x0 . (1.37)

Examples: Weight and height, Final and midterm exam.

To extrapolate is to predict the response outside the region where one has collected the
data used to fit the model. Sometimes one will have to do that, but in that case one
needs to be very careful. To extrapolate is dangerous because the prediction errors are
larger, but mainly because there is no way to check whether the model used is correct
in an area where one does not have any data.

PiE2, GCED Josep Ginebra


Chapter 1. Normal Linear Model 19

1.6 Model checking through residual analysis

1.6.1 Goal of residual analysis

To answer questions about the population model (about the βj ’s) based on the sample
model (the bj ’s), and to be able to trust prediction intervals, one needs to make sure
that the assumptions made are acceptable. That is, one needs to check whether:
ϵi = yi − (β0 + β1 x1i + . . . + βp−1 xp−1i ) = yi − E(yi |xi ), i = 1, . . . , n, (1.38)
are close to being independent and Normal(0, σ 2 ) and, in particular, whether the linearity,
the constancy of variance and the normality assumptions hold. One can not check that
directly, because the ϵi are unknown.

Instead, one checks whether that is approximately the case by analyzing the best esti-
mates that we have for ϵi , which are the residuals:
ei = yi − (b0 + b1 x1i + . . . + bp−1 xp−1i ) = yi − ŷi , i = 1, . . . , n. (1.39)
Residuals are crucial for checking whether there is something wrong with the model,
because they have all the information about the relationship between yi and xi that is
in the data but is not captured by the fitted model.

Residuals work as a sort of a magnifying glass that exaggerates the lackings of the model
and helps discover ways to fix it. One should explore residuals in any way that allows
them to tell you what is left to be explained by the model.

One should recycle any information left in the residuals.

The main goals of residual analysis are to:

1. check the validity of the assumptions made,

2. suggest ways to improve the model,

3. identify observations that are unusual, either because they are poorly explained by
the model, or because they have a lot of influence on the fitted model.

Residuals are prediction errors, and if they are predictable, one can improve the model.

Residual analysis is the engine that drives the model building process. One should never
settle with a statistical model without analyzing its residuals.

PiE2, GCED Josep Ginebra


Chapter 1. Normal Linear Model 20

Example: Acid

Example: Hand life line

Example: Church perimeter

1.6.2 Types of residuals

There are three types of residuals:

1. Regular residuals: ei = yi − ŷi .


If the model is correct, the distribution of ϵ is Normal(0, σ 2 I) and the distribution
′ ′
of the vector e is Normal(0, σ 2 (I − H)), where H = X(X X)−1 X . Hence, the
individual ei are Normal(0, V (ei ) = σ 2 (1 − hii )), where V (ei ) is not constant but
it is not far from constant, because typically hii << 1.
Regular residuals can be made larger by changing the unit of measure of yi , and
one does not know whether |ei | is too large for the model to be acceptable for the
i-th observation, other than by comparing |ei | with the other |ej |.

2. Standardized (or internally studentized) residuals: esi = (ei − ē)/sei = ei /sei , where
sei is an estimate of the standard deviation of ei .
Standardized residuals are dimensionless. When the model is correct, esi is Student-
t distributed, and one expects about 95% of them to fall in (−2, 2), and the re-
maining 5% to fall close to that range. Any observation with |eis | much larger than
2 is poorly explained by the model, and it is considered to be an outlier.
The problem with regular and standardized residuals is that they compare yi with
ŷi , which are predicted values computed using (xi , yi ). That is a bit like cheating
in an exam, because you use yi to predict yi , and it plays in favor of the model.

i = (yi − ŷ(i) )/se(i) , where ŷ(i) and


3. Deleted (or externally studentized) residuals: esd
se(i) are computed using all observations except (xi , yi ).
Deleted residuals compare yi with ŷ(i) , which is a prediction that does not use
(xi , yi ), and one standardizes their difference without using (xi , yi ) either. In this
way, they provide a more honest assessment about the merits of the model.
If the model is correct, these standardized deleted residuals are also Student-t
distributed, and thus one expects about 95% of them to fall in (−2, 2).

PiE2, GCED Josep Ginebra


Chapter 1. Normal Linear Model 21

Given that esi ≈ kei , regular and standardized residual plots are similar. Also, when
sample size is not too small and yi is not a strong outlier, then ŷi and sei are similar to
ŷ(i) and se(i) , and the plots with esi and with esd
i are also similar.

1.6.3 Graphical analysis of residuals

The most efficient way to analyze residuals to help improve the model is graphically.
Residual analysis is specially helpful when one deals with many explanatory variables,
but it can also be useful when there is only a single explanatory variable.

Three type of default residual plots that are always useful are:

1. Residuals versus predicted values. When the model has a single explanatory vari-
able, x1 , plotting e against x1 and plotting e against ŷ = b0 + b1 x1 is basically the
same. With several explanatory variables, ŷ works as the best linear combination
of the explanatory variables in the model and it works as a summary of them.
Finding a relationship between what is left to be explained (residual) and what is
explained (predicted value), indicates that the linearity assumption fails. One can
try to fix that by transforming y and/or some of the x’s.
Finding that the variability of the residuals grows with the predicted value is quite
frequent, and it indicates that the variance is not constant. One can sometimes fix
that by modeling the logarithm of y instead.
When both linearity and constant variance fail, one hopes that the same transfor-
mation of y will fix both problems. This plot also helps identify outliers.

2. Normal probability plot of residuals. It is useful to assess normality, and to identify


outliers. If the normality assumption holds, residuals should be placed more or less
along a line. The histogram of the residuals is less useful to check normality.

3. Residuals versus explanatory variables included and not included in the model. It
helps find what is wrong with the model, if something is wrong with it.

Examples: Weight and height,

Example: Brain weight, body weight

Example: Atmospheric pressure versus boiling temperature of water

Example: Infant mortality

PiE2, GCED Josep Ginebra


Chapter 1. Normal Linear Model 22

On top of these default plots, one should use any other graphic that might help recycle
any information left about the relationship between y and x. In particular, when data is
ordered in time (or in space), one might use time (space) plots of the residuals, or other
graphics that help check whether there is any temporal (spatial) dependency left that
can be used to improve the model.

Example: Old faithful

Cautionary Remark : One should be careful when interpreting bivariate plots. When
projecting multivariate relationships on a bivariate subspace, a lot of information about
that relationship is lost, and outliers might be hidden. Bivariate plots are like shadows,
and one faces two dangers when using them.

1. Danger 1: The fact that the plot of yi against x2i does not show any relationship
between them, does not mean that x2 might not be a crucial variable in the model
for y when combined with another variable, x1 .
Bivariate plots between y (or e) and explanatory variables by themselves will never
be evidence enough to drop an explanatory variable from consideration.
Example: Hamilton.

2. Danger 2: The fact that the plot of yi against x2i shows a clear relationship, does
not necessarily mean that x2 has to be part of the model for y. There could be
another variable x1 , (hidden or not), that drives both y and x2 at the same time,
and y and x2 might be related only because they both change with x1 . If that x1 is
available and included in the model, x2 will not be needed in the model anymore.
Bivariate plots between y and explanatory variables by themselves will never be
evidence enough to decide that an explanatory variable is required in the model.

1.6.4 Quantitative analysis

Observations can be unusual because they are poorly explained by the model, because
certain values of the explanatory variables are very different from the ones of the majority,
or because they have a strong influence on the fitted model.

Next we describe how these three degrees of unusualness can be measured.

1. Degree of outlierness: It is measured through |esi |. If the model is acceptable, about


95% of these |esi | fall within (0, 2) and the remaining 5% fall close to that range.

PiE2, GCED Josep Ginebra


Chapter 1. Normal Linear Model 23

It is customary to mark with an R all observations with |esi | > 2, even though
standardized residuals with an absolute value larger than 2 but close to 2 are not
any indication that one should worry about them.
When esi >> 2, then yi >> ŷi , while when esi << −2, then yi << ŷi .
When |esi | is much larger than two, the i-th observation is not properly explained
by the model, which might be because:

(a) there is something “wrong” with the model, or


(b) there is something “wrong” with the observation.

Most often, large residuals happen because of failures of the model, in which case
finding outliers is an opportunity to improve the model. Sometimes that can be
attained by identifying missing variables that distinguish the outliers from the
observations well explained by the model, and sometimes by just finding a better
transformation for y and/or for some of the explanatory variables.
One should never remove an outlier from the analysis, unless one is sure that
there is something wrong with that observation, or one clearly understands what
makes that observation different from the other observations. If one removes an
outlier without having a good reason for that, one will never know the range of
applicability of the model obtained.
la matriu de variàncies i covariàncies del vector de valors predits és sd^2 vegades la hat matrix.
2. Distance in X-space: It is measured through hii , the i-th diagonal term of the
′ ′
projection matrix, H = X(X X)−1 X , which is called leverage or hatvalue. When
there is a single explanatory variable,
H es la hatmatrix per la que cal multiplicar el vector Y per a obtindre Yˆ . Atès que Yˆ = Xbetaˆ, i betaˆ = (XtX)^-1Xt
1 (xi − x̄)2
hii = + ∑n , (1.40)
i=1 (xi − x̄)
n 2

which is dimensionless and between 1/n and 1, and which measures the distance
between xi and x̄. When hii > 3p/n, the i-th observation is considered to be
unusually far from the other observations in the space of explanatory variables.
That by itself is no indication that there is anything wrong with the observation.

3. Influence on the fitted model : One measures the influence of the i-th observation
on the fitted model through Cook distance, that compares the vector of predicted
values obtained using all n observations, Ŷ , with the predicted values obtained
using all the observations except the i-th one, Ŷ(i) , and it also compares b and b(i) ,
which are the coefficients estimated with and without the i-th observation,
′ ′ ′
(Ŷ − Ŷ(i) ) (Ŷ − Ŷ(i) ) (b − b(i) ) X X(b − b(i) )
CDi = 2
= . (1.41)
psR ps2R

PiE2, GCED Josep Ginebra


Chapter 1. Normal Linear Model 24

Cook distance for the i-th observation indicates how much does the fitted model
change when one removes that observation; When Ŷ and Ŷ(i) are very different,
CDi is large and the i-th observation is very influential.
What relates the Cook distance of an observation with the other two measures of
the degree of unusualness of an observation is the fact that:
hii 1
CDi = (esi )2 . (1.42)
1 − hii p

As a consequence of this, CDi will be large, and therefore (xi , yi ) will be very
influential, either because:

(a) (esi )2 is large, and therefore the i-th observation is an outlier, or because
(b) hii is large, and hence far in the X space, and (esi )2 is not too close to 0.

Some people plot CDi against (esi )2 , and CDi against hii , to find out which observa-
tions are the most influential, and why are they influential. One should only check
these plots with the final models, once one has already checked that everything
else is fine. If one finds that there is an observation that is a lot more influential
than the others, one might want to refit the model without that observation to
check how does the fitted model change.
Another measure of the influence of the i-th observation is DF F ITi = ŷi − ŷ(i) ,
and its standardized version. That compares the individual predicted value for the
i-th observation, and not the whole vector of predictions like Cook distance.

1.6.5 Examples

Simulated example

Example: Infant Mortality

Example: Price of an Apartment in Barcelona

Example: NFL Salary

Example: Galapagos

PiE2, GCED Josep Ginebra


Clase 5: Model selection, Cross Validation

Chapter 1. Normal Linear Model 25

1.7 Model selection

One starts with a response, yi , and r explanatory variables, xi = (x1i , . . . , xri ), where
some of the variables could be functions of other variables. The model selection problem
consist in finding the best subsets of explanatory variables to build a linear model for y.

The problem is challenging because the number of possible models is 2r , which can be
very large, and because it is usually impossible to distinguish a single “best” model based
on the data available.

1.7.1 Variance versus bias trade off

When selecting subsets of explanatory variables, one can end up with too many explana-
tory variables, which leads to over-fitting your training sample, and one can also end up
with too few explanatory variables, which leads to under-fitting it.

Residual analysis might warn against under-fitting, but it will only help recognize over-
fitting if the training sample has replicates. When there are replicates in the training
sample, one can test for the lack of fit of the model by comparing the variability estimated
using replicates with the variability estimated using the residuals of the model; If the
first estimate is much larger than the second, it is an indication of over-fitting.

Over-fitting the training sample is tempting, because it makes the |ei | “artificially” small.
∑ 2
One can actually get the SSR = ei as small as wanted by including enough explanatory
variables in the model. In the limit, one might actually feel tempted to interpolate by
using p = n − 1 explanatory variables, and in that way obtain SSR = 0.

Nevertheless, over-fitting is a bad idea because:

1. the more explanatory variables in the model, the larger the prediction variance
tends to be, and the wider the prediction intervals, or because

2. having too many explanatory variables in the model complicates the interpretation
of the model, in part because of the dependency among explanatory variables.

Including variables not needed over-explains the training sample used to fit the model,
forcing their residuals to be artificially small, but the model might not generalize well
when used to predict future (out-of-sample) observations.

PiE2, GCED Josep Ginebra


Chapter 1. Normal Linear Model 26

The Occam razor’s or parsimony principle, states that the simpler the explanation, the
more likely it is to be the right one. Given a subset of models for y with acceptable
residuals, one should prefer the models with the smallest number of variables.

Example: Barella and Sust.

Under-fitting happens when one misses important explanatory variables in the model.
To find the consequences of that, assume that the true model is:

Y = Xβ + Zγ + ϵ, (1.43)

but one does not know or can not measure the Z’s, and one fits:

Y = Xβ + ϵ (1.44)
′ ′
instead. In that case the least squares estimate for β, b = (X X)−1 X Y , is such that:
′ ′ ′ ′
E(b|X, Z) = (X X)−1 X E(Y |X, Z) = β + (X X)−1 X Zγ, (1.45)

and therefore b becomes a biased estimate for β. This bias can even lead to some bj
having a sign different from βj .

The role of this bias is to capture the part of the relationship between the Z and Y
that can be salvaged thanks to the relationship between the Z and the X. That is good
for prediction, but it makes life very hard when trying to interpret the fitted model to
explain the relationship between X and Y .

In practice one needs to worry about under-fitting a lot, because important variables
will be missing most of the time.

This dilemma between using too many variables, and thus increasing the prediction
variance, and missing important variables, and thus introducing bias in the predictions
and elsewhere, is recognized as the variance versus bias trade off.

1.7.2 Model selection criteria

Maximizing R2 or minimizing SSR , which is the same, can not be used to select models
because that always leads one to select the model using all the variables available, which
will most likely over-fit the training sample. R2 and SSR measure goodness of fit, and
they should only be used to compare models of the same size or degree of complexity.

PiE2, GCED Josep Ginebra


Chapter 1. Normal Linear Model 27

Model selection criteria need to incentive goodness of fit while at the same time penalize
2
model complexity. One way to do that is by maximizing Radj or minimizing s2R , which
is equivalent because s2R = s2y (1 − (Radj
2
/100)).

There are many other model selection criteria, seeking a compromise between maximizing
goodness of fit and minimizing complexity. Depending on the relative weight given to
each one of these two goals, one ends up with different subsets of models. Model selection
criteria assessing goodness of fit through SSR and complexity through p are:

SSR
Cp = 2p − n + , (1.46)
s2R

where s2R is the residual variance of the model with all the explanatory variables in it,

AIC = Const + 2p + n log(SSR ), (1.47)

and
BIC = Const + p log(n) + n log(SSR ), (1.48)
which are easy to compute and can be extended to other parametric models. More
general model selection criteria based on cross validation are presented next.

1.7.3 Cross validation


Assessing the performance of a model through SSR = (yi − ŷi )2 , thus comparing yi
with a predicted value for yi computed using yi and xi is not “honest,” and it leads to
assessments that are too optimistic about the merit of the model. To avoid that, one
needs to compare yi with predictions of yi that do not “cheat” by using yi to predict yi .
That can be done in different ways.

The one-shot cross validation approach splits the original sample into a training sub-
sample, to be used to select and fit models, and a validation or testing subsample, to be
used to check the performance of models. The idea is to compare the values yi in the
validation subsample with their predictions based on the model selected and fitted with
the training subsample. This approach is cheap computationally, but the result depends
on the way in which the original sample is split in two.

Instead, leave-one-out cross validation compares yi with the prediction ŷ(i) obtained with
the model fitted with the subsample obtained by removing that i-th observation from
the original sample. Given that one repeats that comparison for all the observations in

PiE2, GCED Josep Ginebra


Chapter 1. Normal Linear Model 28

the sample, this approach requires one to fit the model n times. One way to summarize
that comparison is through the PRESS statistic,


n ∑
n
P RESS = SSRl.one.out = (yi − ŷ(i) ) =
2
(edi )2 , (1.49)
i=1 i=1

which is analogous to the SSR but using deleted instead of regular residuals. A straight-
forward generalization of the determination coefficient can be obtained by replacing SSR
by P RESS in the definition of R2 , leading to the definition of:
( )
P RESS
RP red = 100 1 −
2
. (1.50)
SST

Selecting models that maximize RP2 red , and therefore minimize P RESS, is an appealing
2
alternative to the use of Radj , AIC or BIC for that same task.

Leave-p-out cross validation is a generalization that compares p of the n observed values


of y (the validation subsample) with the predicted values obtained with the model fitted
with the remaining n − p observations (the training subsample). Ideally this would be
repeated as many times as ways of splitting a sample of n into subsamples of n − p and of
p observations, and one would average the results through statistics that mimic P RESS
and RP2 red . This is computationally prohibitive, and instead one uses a randomly selected
subset of ways of splitting a sample of n into subsamples of p and of n − p.

One computationally cheaper and more efficient way of implementing cross validation
is through k-fold cross validation. This approach splits the original sample into k (ap-
proximately) equal sized subsamples. Then, it uses each one of these subsamples as a
validation subsample, with the remaining k − 1 subsets playing the role of the training
subsample, used to predict that validation subsample. Then, one computes


n
SSRkf old = (yi − ŷ(i)
kf old 2
), (1.51)
i=1

and: ( )
SSRkf old
2
Rkf old = 100 1 − , (1.52)
SST

which can again be used as model selection criteria. The default choice of k is 5 or 10,
and like in the one-shot cross validation case, the way in which the initial sample is split
into the k subsamples affects the results obtained.

When k is n, k-fold and leave-one-out cross validation coincide, and RP2 red = Rnf
2
old .

PiE2, GCED Josep Ginebra


Chapter 1. Normal Linear Model 29

1.7.4 Tools for model selection

Before one starts the selection of the best subsets of x’s, one should first fit the model of
y with all the explanatory variables, and check the residuals to find out whether there
is a need to start transforming y and/or some of the x’s.

To help identify the best transformations for y or for the x’s, one can sometimes use
common knowledge about the phenomena being modeled.

Example: Wood.

Once identified the right scale for y, tools that might help strike the right balance between
over-fitting and under-fitting are:

1. Fitting the full model, with all the explanatory variables available and using the
|ti | = |bi /sbi | statistics to simplify the model, by removing one variable at a time.
To try to remove q variables at once, one needs to use the F statistic instead.
The order in which one removes variables determines the model obtained in this
way. This approach checks at most r models out of all possible 2r models, thus
likely missing models providing fits as good as the fit with the model obtained here.
Building a statistical model is like building a team, in the sense that one always
looks for the variable not in the model that best complements the variables already
in the model, which will often not be the variable most correlated with y. A useful
exercise is to monitor how the bj and tj = bj /sbj of a variable xj in the model
changes depending on the variables that team with xj in the model.
Example: Infant mortality.

2. Best subsets regression. It fits all 2r possible models, and presents the best models
of each size. The goal is not finding a single best model, but a subset of models
acceptable for predicting and/or explaining y.
Given that the models being compared will be of different sizes, the comparison
will have to be made based on model selection criteria as the ones presented above.
2
When checking how Radj or s2R change with model size for the best models of each
2
size, one typically finds that Radj increases and s2R decreases with increasing size
until it hits a maximum (minimum), and then it levels of. One should focus on the
2
subset of the smallest models with Radj (s2R ) close to their maximum (minimum).
One can also select models through the minimization of AIC, BIC, PRESS or
SSRkf old , or through the maximization of RP2 red or RKf
2
old .

PiE2, GCED Josep Ginebra


Chapter 1. Normal Linear Model 30

When the goal is to predict, one might settle with models a bit larger than when
the goal is to explain y.
When the goal is to explain y, what is important is to find what is common to all
models that provide good fits, which could be many.
Averaging the predictions of several models often performs better than using pre-
dictions based on a single model.
Best subsets regression can only be implemented if the number of explanatory
variables available, r, is not too large. When r = 10 there are 210 = 1024 models,
when r = 20 there are 220 ≈ 106 models, and when r = 30 there are 230 ≈ 109
models. With much more than 30 variables the number of models becomes too
large, and one can not fit and rank them all in a reasonable amount of time.
Example: Somatic type
Example: Galapagos

3. Stepwise regression. It is an algorithm implementing a process similar to the one


described in point 1. One starts with a given subset of explanatory variables in the
model, and at each step one introduces (removes) variables, one at a time, based
on the |tj | statistic that they will have (had).
For that, one selects a threshold value for |t| above which one includes variables, and
a threshold value for |t| below which one removes variables. One includes (removes)
the xj that will have the largest (had the smallest) |tj | among all variables outside
(inside) the model, as long as these |tj |’s are larger (smaller) than the thresholds.
Most often one starts at step 0 either with the null model and goes forward, or
with the model with all r variables and goes backward.
Some implementations drive the stepwise algorithm using model selection criteria
2
like AIC, BIC or Radj instead of using the |tj | statistics as described.
Different from best subsets regression, there is no limit in the number of variables
that can be handled with stepwise regression. The problem with stepwise regression
is that it only checks a small subset of all 2r possible models.

4. Shrinkage estimators. Ridge regression estimates the coefficients of the model


through the bRdg (λ) that minimizes:


p−1
SSR + λ b2j ,
j=1

where λ ≥ 0 is a tuning parameter. When λ = 0 one obtains the least squares


estimates. The larger λ, the closer the bRdg
j (λ) shrink towards 0, the smaller the
variances of bj and of predictions, and the larger their biases.

PiE2, GCED Josep Ginebra


Chapter 1. Normal Linear Model 31

Unlike the approaches described above, ridge regression yields a model that includes
all the variables that one starts with, because no bRdg
j (λ) will be 0 other than for
λ = ∞, when b0 is equal to ȳ and all other bj are 0.
Instead, Lasso estimates the coefficients through the bLasso (λ) that minimizes:


p−1
SSR + λ |bj |,
j=1

where λ ≥ 0 is again a tuning parameter. Different from what happens with


ridge regression, when λ is sufficiently large the penalty here forces some of the
coefficient estimates, bLasso
j (λ), to be equal to 0. Thus, lasso can be used to do
variable selection in a way analogous to the three approaches described above.
Both for ridge and lasso, λ is usually chosen by cross validation, minimizing
P RESS(λ) as a function of λ.

5. Cross validation. The model selection tools described above are tailored mostly
for linear and generalized linear modeling. Instead, the cross validation approach
to model selection applies to any type of statistical model, parametric or not.

PiE2, GCED Josep Ginebra


Clase 6:

Chapter 1. Normal Linear Model 32

1.8 Use of categorical explanatory variables

1.8.1 Mix of continuous and categorical explanatory variables

In linear modeling, y needs to be continuous or well approximated by a model for a


continuous variable, but the x’s can be of any type. In fact, often one has explanatory
variables that are categorical.
La interacció es un fen`omen que t e lloc entre
Example: Somatic type dues variables explicatives categ`oriques o entre
una variables explicativa categ`orica i una
num`erica.
Example: Birth weight

Example: Forbes and Hooker, Weight and Height

Example: Vuelta ciclista a España

To build a linear model that includes continuous variables together with a categorical
variable with k categories, one needs to follow the next three step process.

1. Choose the baseline category, and create k − 1 indicator variables, one for each one
of the remaining categories.

2. Include all k − 1 indicator variables in the model, together with the products of the
continuous explanatory variables with these indicator variables, called interactions.

3. Simplify the model by removing the terms that are not necessary, with the help of
tests based on |tj | and F statistics, but without removing any linear term unless
one has removed first all the product terms that involve them.

To interpret the effect of categorical variables on the response, it is useful to write the
sub-models that result for each category combination, and compare their slopes.

Coding the k categories variable with 1 to k, and including that single coded variable in
the model as if the categorical variable was discrete is one of the most frequent mistakes
made by unsophisticated modelers. What is wrong with doing that?

A categorical variable with k categories defines k groups, and it breaks down the initial
sample into k subsamples, one for each group. By fitting a model on the whole sample,
including all k − 1 indicator variables together with all the product terms, as described

PiE2, GCED Josep Ginebra


Chapter 1. Normal Linear Model 33

above, the predicted values are the same as the ones obtained by splitting the sample
into these k subsamples, and fitting k linear models separately, one for each subsample.

One includes k − 1 indicator variables and not k, leaving a baseline category without
variable, because one does not need that extra one. Besides, if one includes k indicator

variables instead, unless one removes β0 from the model, one can not invert X X, and
therefore one can neither compute b = (X ′ X)−1 X ′ Y nor V (b|X) = σ 2 (X ′ X)−1 .

If one does not include the product terms in the model, one assumes that the slopes of
the continuous variables are the same for the different categories (groups), and therefore
that the planes are parallel. If one can remove all these product terms because all the
corresponding |tj |’s are small, one learns that the planes for different groups are actually
parallel, but that is different from assuming parallelism from scratch.

Removing linear terms without having been able to remove first all the product terms
that involve them is not advisable, because that does not reduce the complexity of the
model and because when one does that, the predictions of the model do not coincide
anymore with the ones obtained by splitting the sample into k subsamples and fitting one
linear model to each one of them. That also explains why one rarely removes the constant
term, even though |t0 | = |b0 |/sb0 might be small, and why one rarely removes the xj term
of a model that includes the x2j term, irrespective of the value of |tj | = |bj /sbj |.

The F statistic used to check whether one can remove q variables at once is specially
useful here, because it allows one to check whether one can remove at once all the product
terms and/or all the indicator variables involving a given categorical variable.

Sometimes one does not have enough observations to fit the full model, with all the
indicator variables and all the interactions. In that case, one can guess which interaction
terms are less likely to be active, and skip using them in the model. An alternative is to
identify categories that might behave similarly and aggregate them, thus reducing the
number of categories and of indicator variables and of interaction terms in the model.

When one has categorical variables, best subsets, stepwise and shrinkage methods should
be used with care, taking advantage of the possibility of forcing linear terms to be in all
the models while one is checking which interaction terms will be needed.

PiE2, GCED Josep Ginebra


Chapter 1. Normal Linear Model 34

1.8.2 Comparison of means and factorial experiments

The comparison of two means based on two independent samples, and the comparison
of k means based on k independent samples, can be handled by fitting a linear model
with a single explanatory variable that is categorical, and therefore a model with k − 1
indicator variables and no product terms.

By fitting the model by least squares one reproduces the standard comparison of means
analysis, and one can also use all the tools available for linear modeling.

Example: Fertilizer for tomatoes

Example: Car speed

The analysis of the data obtained from carrying out two-level factorial experiments can
also be carried out by fitting a linear model. When this analysis does not include any
variables other than the ones controlled at two levels in the experiment, by re-coding
the two levels or categories of all variables through −1 and 1 instead of 0 and 1, all the

columns of the X matrix become orthogonal, and the X X matrix becomes diagonal,
which brings computational, analytical and interpretational advantages.

In particular, when the columns of X are orthogonal, the matrix (X ′ X)−1 is 1/n times
the identity matrix, and that means that bj depends only on xj , that V (bj |X) is the
same for all j, and that Cov(bj , bk |X) = 0 for all j, k.

1.9 Model interpretation

Linear models are built either to predict y or to understand the relationship between
y and certain variables with the goal of explaining y. The use of the fitted model for
prediction is straightforward. Instead, explaining y by interpreting fitted models requires
a lot more care. There are four reasons why that interpretation is specially difficult.

1. There is usually not a single best model that fits the available sample clearly bet-
ter than all other models. The model selection process typically identifies several
“acceptable” models, each with a different set of variables and of estimated co-
efficients, and each telling a different story about the relationship between y and
x’s. One has to create a single story based on all these different and sometimes

PiE2, GCED Josep Ginebra


Chapter 1. Normal Linear Model 35

contradicting partial stories about y, focusing on what is common to most of them.

2. Most of the time there is statistical dependency between the explanatory variables
in the model, (recognized as collinearity), and if one modifies the value of a variable,
xj , in the model, the value of other xk ’s also in the model will vary as well. Hence,
the estimated effect of varying xj on y will depend on the coefficient of xj , bj , and
on the coefficients bk of the other xk in the model that are correlated with xj .
Note that bj estimates how much would the response increase if xj increased by
one unit and all the other variables in the model stayed fixed.
Example: Price of apartment
Example: Salary NFL
Example: Galapagos
Example: Somatic type
Collinearity is a fact of life when one deals with observational data, and there is
nothing wrong with it. It just complicates the interpretation of the model.
If the sample is representative of the population, one can handle this issue by
fitting a model for each explanatory variable as a function of the other explanatory
variables in the model, and finding which variables are related with which.
To measure the degree of collinearity of xj with the rest of variables in the model,
one can use the determination coefficient Rj2 of the linear model of xj as a function
of the other p − 2 explanatory variables in the model. Another measure of the
degree of collinearity of xj is the variance inflation factor of xj ,
1
V IF (xj ) = , (1.53)
1 − Rj2
which is equal to 1 when xj is not related with the other variables, and it is ∞
when xj is a linear combination of them. The larger VIF(xj ) and Rj2 , the stronger
the statistical dependency between xj and the rest, and the harder it is to interpret
the role of xj when explaining y. The name VIF comes from the fact that:
σ2
V ar(bj |X) = V IF (xj ) ∑n . (1.54)
i=1 (xij − x̄j )
2

3. We know that if important variables, Z, are missing in the model, the coefficients
of the model fitted without them are biased, because:
′ ′
E(b|X, Z) = β + (X X)−1 X Zγ. (1.55)

Given that one will frequently have missing variables, one should expect that many
coefficients of the fitted models will have certain bias.

PiE2, GCED Josep Ginebra


Chapter 1. Normal Linear Model 36

This bias appears because bj adapts itself to explain part of what is left unexplained
by the missing Z through the xj that are present in the model. That is good for
prediction, but it complicates a lot the interpretation of the fitted model.
Example: Price of apartment (bis)
Example: Cholesterol
Example: Sex and salary
This phenomena is the reason why one should make sure that one includes in the
model all the variables that might explain the response, and not just a few, even
though you might be interested only in the role played by these few. Including
explanatory variables in the model that are not of direct interest is necessary to
control for the effect of possible confounders that might bias the coefficients (the
effects) of the variables of interest if they were missing in the model.
This problem is a lot more difficult to handle than collinearity, because one can
learn about the existence of collinearity from the data, but one usually does not
know whether there are variables missing, and which are the missing variables.
One exercise that helps understand the repercussion of missing variables is to
remove an important variable from a model and check how the coefficients of the
remaining variables change.

4. To explain y, one needs to identify variables that cause y to change when they
change. The problem is that statistical models identify relationships between y
and xj ’s, but they can not tell whether these relationships are causal or not.
Non-causal relationships can be used to predict, but they are not that useful when
trying to understand y. The fact that xj ends up in the model for y does not
necessarily imply that the relationship between y and xj is causal. Causal and non-
causal relationships are practically undistinguishable based only on the information
in the data, and yet distinguishing them is crucial for the understanding of y.
Given that “counterfactual” experiments are almost never possible, the only way
to discard that one relationship is causal is to identify the true causal relationship.
What makes it so difficult to distinguish causality from pure correlation is the fact
that real causes are often missing variables (z) that drive both y together with
some of the explanatory variables available, xj . The fact that both y and xj ’s are
driven by the hidden (lurking) z, induces relationships between y and xj that make
it look like as if xj was explaining y. As soon as z is included in the model, xj
stops being necessary there, but most often the missing z does not end up in the
model because it is unknown. This hidden variable, z, might be time.

PiE2, GCED Josep Ginebra


Chapter 1. Normal Linear Model 37

Third and fourth difficulties in this list are essentially the same one, because non-
causal relationships appear when real causal “explanatory” variables are missing,
and surrogate variables fill in for them to salvage part of what is missing.
Example: Lung cancer death rate
Example: Road accidents
Example: Crime rate
When y and xj are related, it could either because xj causes (drives) y, because y
drives xj , or because a hidden variable z drives y and xj at the same time. Data
alone can not tell which one of these three explanations is the correct one.

Putting everything together, bj explains the role played by xj together with part of the
role played by other xk ’s also in the model and are correlated with xj , and together with
the role played by all the variables zk not in the model but driving both xj and y.

PiE2, GCED Josep Ginebra


Chapter 2

Normal Non-linear Model

When modeling the relationship between yi and xi = (x1i , . . . , xpi ), linear models are of-
ten useful because of the flexibility that comes with transforming y and/or the x’s. That
allows one to handle instances where non-linearities in the mean of y can be well ap-
proximated through nonlinear transformations of yi and linear combinations of nonlinear
transformations of x’s.

But there are instances where together with a data set one gets an inherently non-
linear model and the request to fit that model and check whether the model is a good
approximation for the data. That non-linear model is often suggested by un underlying
theory put in place to explain the relationship between y and the x’s.

2.1 Normal non-linear model

A response variable, yi , and a set of explanatory variables, xi = (x1i , . . . , xpi ), are said
to follow a normal non-linear model when they are conditionally independent and:

yi |xi ∼ Normal(E(yi |xi ) = f (xi ; θ), V ar(yi |xi ) = σ 2 ), (2.1)

where f (xi ; θ) is a given non-linear function of xi that depends on a set of unknown


parameters, θ = (θ1 , . . . , θk ). That can also be stated as saying that:

yi = E(yi |xi ) + ϵi = f (xi ; θ) + ϵi , (2.2)

where ϵi are independent Normal(0, σ 2 ) distributed.

1
Chapter 2. Normal Non-linear Model 2

The only assumption that changes with respect to the linear model is that E(yi |xi ) =
f (xi ; θ) now is an intrinsically non-linear function of xi in the sense that at least one
of its derivatives with respect to θ is a function of θ. Hence here the model can not be
posed as a linear model by transforming y and x’s.

Example: Rumford

Ti = E(Ti |ti ) + ϵi = θ1 − θ2 exp (−θ3 ti ) + ϵi , (2.3)

which is an example of the asymptotic regression (convex) model.

Example: Puromyci
θ1 ci
vi = E(vi |ci ) + ϵi = + ϵi , (2.4)
θ2 + ci
which is an example of the Michaelis-Menten model. Even though in this example,
1 1 θ2 1
= + , (2.5)
E(vi |ci ) θ1 θ1 ci

1/vi can not be written as a linear model of 1/ci .

Example: Cortisol
θ2 − θ1
yi = E(yi |xi ) + ϵi = θ1 + + ϵi , (2.6)
1 + exp xiθ−θ
4
3

which is an example of the logistic growth model.

2.2 Model fit

Once one has gathered a sample of observations, one needs to fit the model:

yi = f (xi ; θ̂) + ei = ŷi + ei , (2.7)

where ei = yi − ŷi is the residual or prediction error.

Given that the assumptions about the noise are the same as for the linear model, it
makes sense to use again the least squares criteria and estimate θ̂ by minimizing:

n ∑
n
SSR (θ̂) = e2i = (yi − f (xi ; θ̂))2 , (2.8)
i=1 i=1

PiE2, GCED Josep Ginebra


Chapter 2. Normal Non-linear Model 3

which requires solving a system of p non-linear equations numerically.

To help solve this minimization problem, statistical packages often require an initial guess
value for θ̂. Attaching meaning to the parameters, or approximating the non-linear model
through a linear model, often helps obtain these initial guesses.

Example: Rumford
Ti = θ̂1 − θ̂2 exp (−θ̂3 ti ) + ei , (2.9)
where (θ̂1 , θ̂2 , θ̂3 ) is found by minimizing:

n
SSR (θ̂1 , θ̂2 , θ̂3 ) = (Ti − (θ̂1 − θ̂2 exp (−θ̂3 ti )))2 . (2.10)
i=1

To provide initial guess values for θ̂ one can use the fact that θ1 = E(T |t = ∞) is the
horizontal asymptote of E(T |t), and it should be close to the room temperature. Also,
E(T |t = 0) is θ1 − θ2 , and therefore −θ2 should be close to E(T |t = 0) − E(T |t = ∞).
The value for θ3 is harder to guess.

Example: Puromyci
θ̂1 ci
vi = + ei , (2.11)
θ̂2 + ci
where initial guesses for θ̂1 and θ̂2 can be found using the fact that θ1 = E(v|c = ∞) is
the horizontal asymptote of E(v|c) and θ2 is the value of c such that E(v|c = θ2 ) = θ1 /2.
One can also obtain initial guess values by fitting the linear model 1/vi = b0 +b1 (1/ci )+ei
and solving for b0 = 1/θ̂1 and b1 = θ̂2 /θ̂1 .

Example: Cortisol
θ̂2 − θ̂1
yi = θ̂1 + + ei , (2.12)
1 + exp xiθ̂−θ̂3
4

where θ1 and θ2 are the floor and the ceiling of E(y|x), and θ3 is such that E(y|x = θ3 ) =
(θ1 + θ2 )/2, and it is also the inflexion point for E(y|x).

2.3 Confidence intervals and tests for coefficients

To answer questions about the theoretical model, θ, based on the fitted model, θ̂, one uses
the fact that if the model is correct and one fits it by least squares, then the distribution
of θ̂j is approximately Normal(θj , V ar(θ̂j )).

PiE2, GCED Josep Ginebra


Chapter 2. Normal Non-linear Model 4

Hence one can obtain approximate 100(1 − α)% confidence intervals for θj through:
α/2 α/2
(θ̂j − tn−p sθ̂j , θ̂j + tn−p sθ̂j ).

n−p ≈ 2.
To obtain approximated 95% intervals, one can use t.025

One can choose between θj = a and θj ̸= a by checking whether θj = a is inside the


95% confidence interval for θj . One can also do it by checking whether the number of
standard deviations between θ̂j and a, |tj | = |θ̂j − a|/sθ̂j , is much larger than 2 or not.

2.4 Model checking

In the setting of this chapter, the main purpose is usually to check whether the non-
linear model is consistent with the data. Also, confidence intervals for θ and prediction
intervals will make sense only if the model is correct.

Like in linear modeling, here the model will be checked by looking at whether there is
any information left in the residuals, ei = yi − ŷi . If the model is correct, residuals should
be close to normally distributed with mean 0 and variance close to constant.

One checks whether that is the case by plotting the residuals against the fitted value
and the normal probability plot of the residuals. Plotting residuals against variables
of interest might also help, and when observations are ordered in time or in space one
should check whether there is any time or space dependency left in the residuals.

If the model is wrong, one needs to go back to the theoretical modeling stage, find out
what was wrong with the hypotheses made, fix them, and suggest a new model.

One can also look for alternative non-linear models compatible with what is observed.
One can compare the fit of different non-linear models with the same number of pa-
rameters through their SSR . To compare non-linear models with different number of
parameters, one needs to resort to cross validation or to model selection criteria combin-
ing SSR with some term that penalizes the number of parameters.

Example: Length and age

Example: Peptides

PiE2, GCED Josep Ginebra


Un Model Lineal Generalitzat es una f ormula matem`atica que ens permet

explicar una variable anomenada variable resposta (Y ) en fuci o d’unes al-


tres anomenades variables explicatives.

Chapter 3

Categorical Response Models

Like in linear modeling, one starts with a set of p − 1 explanatory variables, xj , but here
the response, y, is categorical. We present first the case of k = 2 categories, dealt with
binary logistic models, and then the general case, dealt with nominal logistic models.

Poisson log-linear models, useful for count response data, are also presented.

3.1 Binary logistic models

When the response is categorical with two categories, like success/failure, normal linear
models do not apply because the normality assumption does not make sense anymore.

The next five examples are all of this kind. In the first two the response is presented
in the 0/1 format, in the third and fourth examples the response is in the event/trial
format, and in the last example it is in the response/frequency format.

Example: Kyphosis

Example: Birthweight

Example: Nicotine

Example: University Plans

1
Chapter 3. Categorical Response Models 2

Example: Death Penalty

For these problems, one resorts to the binary logistic model, which is a statistical model
that assumes that the number of successes, yi , for a given value of xi = (x1i , . . . , xp−1i ),
are conditionally independent and distributed as:

yi |xi ∼ binomial(ni , π(xi )), (3.1)

where ni is the number of binary observations made at xi , and where the probability of
success or of 1’s at xi is modeled as:
eβ0 +β1 x1i +...+βp−1 xp−1i
π(xi ) = , (3.2)
1 + eβ0 +β1 x1i +...+βp−1 xp−1i
where β = (β0 , . . . , βp−1 ) is only known to belong to Rp . The relationship between π(xi )
and xi is not linear, and such that 0 ≤ π(xi ) ≤ 1 for all xi . When this model holds:
π(xi ) Pr(1|xi )
log = log = log Odds(xi ) = β0 + β1 x1i + . . . + βp−1 xp−1i , (3.3)
1 − π(xi ) Pr(0|xi )
which will be useful when interpreting the coefficients of the model.

By using this statistical model one assumes that:

1. E(yi |xi ) = ni π(xi ),

2. V (yi |xi ) = ni π(xi )(1 − π(xi )),

3. yi |xi ∼ binomial,

4. yi |xi are independent.

The variance of yi is not constant, because it depends on π(xi ) and hence on xi . This
variance is largest close to π(xi ) = 1/2, and it is smallest when π(xi ) is close to 0 or 1.
logit or canonical link
This is the default population (theoretical) model for binary response data, and it is
recognized as the logistic model with logit link. Instead, under the probit link,

π(xi ) = Φ(β0 + β1 x1i + . . . + βp−1 xp−1i ), (3.4)

where Φ(·) is the cumulative distribution function for the Normal(0, 1), and under the
Gompit (complementary log-log) link,

π(xi ) = 1 − exp (− exp (β0 + β1 x1i + . . . + βp−1 xp−1i )), (3.5)

which are two functions of xi that are also bounded between 0 and 1. An advantage of
using the logit instead of the probit or the Gompit links, is that under the logit link the
coefficients can be interpreted through log odds ratios.

PiE2, GCED Josep Ginebra


Chapter 3. Categorical Response Models 3

3.2 Model fit

In fixed effects models one assumes that the value of β ∈ Rp that generates the data is
unique and unknown. Once one collects data, one needs to compute the fitted model,
eb0 +b1 x1i +...+bp−1 xp−1i
π̂(xi ) = , (3.6)
1 + eb0 +b1 x1i +...+bp−1 xp−1i
which will be known, but will not be unique because it depends on the sample available.

Defining residuals as ei = yi − ŷi , and fitting the model by minimizing the sum of the
squares of these residuals, here is not an efficient way to proceed. Least squares treats
observations as if they were all equally reliable, which makes sense when fitting normal
models with constant variance, but here not all observations are equally reliable.

Since the observations with large or small π(xi ) have smaller variances, they are more
reliable and should get more weight than the observations with π(xi ) close to .5. That is
why here one usually fits models using either one of two alternative criteria that minimize
distances between observed and predicted values other than their Euclidean distance.

1. Minimize the Pearson statistic,



n
(yi − ni π̂(xi ))2 ∑ n
X 2 (Y, Ŷ ) = = (eiP )2 , (3.7)
i=1
ni π̂(xi )(1 − π̂(xi )) i=1

as a function of b = (b0 , . . . , bp−1 ), where ŷi = ni π̂(xi ) is the predicted value for yi ,
and where the denominator is an estimate of the variance of yi . Hence, here one
minimizes a weighted sum of the squares of the residuals, with the inverse of the
variance as the weight. The larger the variance, the less reliable the observation,
and the smaller the weight. Note that this is in fact minimizing the sum of the
square of standardized residuals.

2. Minimize the (residual) deviance statistic:



n
yi ni − yi
Deviance(Y, Ŷ ) = ResDev = 2 (yi log + (ni − yi ) log ),
i=1
ni π̂(xi ) ni (1 − π̂(xi ))
(3.8)
as a function of b = (b0 , b1 , . . . , bp−1 ), where yi and ni − yi are the number of 1’s
and of 0’s, while ni π̂(xi ) and ni (1 − π̂(xi )) are the predicted number of 1’s and of
0’s. Hence, here one measures distance between observed and predicted values by
assessing how far their ratio is from 1. The estimates obtained by minimizing the
deviance, coincide with the maximum likelihood estimates of β.

PiE2, GCED Josep Ginebra


Chapter 3. Categorical Response Models 4

Solving these two minimization problems requires solving a set of p non-linear equations.
By default, statistical packages minimize the deviance.

The deviance statistic, also called residual deviance, plays the same role played by SSR
for linear models. In particular, by adding one new variable in the model the residual
deviance always decreases, irrespective of whether that variable is useful or not, and
therefore the residual deviance is useful as model fit criteria, but not as model selection
criteria. By analogy one can also define a determination coefficient, R2 , measuring the
percentage of the total deviance captured by the model.

3.3 Inference about the coefficients of the model

To answer questions about βi through bi , one uses the fact that, if the model is correct,
the distribution of the bj ’s is well approximated through:

bj |x ∼ Normal(βj , V (bj )). (3.9)

Hence one can approximate 100(1 − α)% confidence intervals for βj , through
α/2 α/2
(bj − tn−p sbj , bj + tn−p sbj ), (3.10)

where s2bj is an estimate of V (bj ) and where t.025


n−p can be approximated by 2 when com-
puting 95% intervals.

One can decide whether βj = a by checking whether a is in the confience interval for
βj , or using the number of standard deviations separating bj from a, |bj − a|/sbj . The
statistic zj = bj /sbj to decide whether βj could be 0 is labeled by zj instead of tj .

One can compare one binary logistic model, M0 , nested into another one, M1 , through
the same type of F test presented for linear models, after replacing the residual sum of
squares of the two models, SSRi , by their residual deviances, ResDevi .

Another way of choosing between M0 and M1 , where M0 is the restricted and M1 the
unrestricted model, is through the likelihood ratio test, that relies on the fact that if the
simpler model, M0 , is correct, then ResDev0 − ResDev1 is χ2q distributed, where q is the
difference between the number of parameters of M1 and of M0 .

One result not available for linear models, that is useful to determine whether a logistic
model is acceptable or not, states that if the model is correct, the residual deviance is
approximately χ2n−p distributed, with an expected value of n − p.

PiE2, GCED Josep Ginebra


Chapter 3. Categorical Response Models 5

Hence, when the residual deviance is much larger than what is expected from a χ2n−p
distribution, the model is not correct, often because important variables are missing in
the model. Statistical packages provide the tail area (p-value) for the residual deviance
relative to the χ2n−p distribution. The smaller that tail area, the less trustful the model.

3.4 Model checking

One can check whether a logistic model is acceptable by checking whether its residual
deviance is consistent with a χ2n−p distribution. If the residual deviance of a model is not
“too far” from E(χ2n−p ) = n − p, one knows that the model is acceptable. One checks
the residuals only when the residual deviance is “too far” from n − p for the model to
be valid. In that case, the residual analysis helps find where does the model fail.

To carry out a residual analysis, one first needs to agree on a definition of residual. If
the model is fit through the Pearson statistic, it is natural to use Pearson residuals,

yi − ni π̂(xi )
ePi = √ , (3.11)
ni π̂(xi )(1 − π̂(xi ))

which makes the Pearson statistic into the sum or the squares of Pearson residuals. If
one fits the model by minimizing the deviance, then one uses deviance residuals:

yi ni − yi
ei = sign(yi − ni π̂(xi )) 2(yi log
D
+ (ni − yi ) log ), (3.12)
ni π̂(xi ) ni (1 − π̂(xi ))

which makes the deviance statistic into the sum of the squares of deviance residuals,

n
Deviance(Y, Ŷ ) = ResDev = (eD 2
i ) . (3.13)
i=1

If the model is not correct because the deviance (Pearson) statistic is too large to be χ2n−p
distributed, it is because the square of some of its residuals are too large. By identifying
which residuals are “too large”, one finds where the model is failing and one maybe
learns ways to improve it. If the model is correct, one expects about 95% of its residuals
to fall within (−2, 2), (here residuals and their standardized version are similar), and
hence any observation with a residual far from this range will be an outlier.

One can define deleted deviance and Pearson residuals by replacing π̂(xi ) by the leave-
one-out cross validated estimate for it, and the sum of the squares of these deleted
residuals leads to a PRESS like and to a RP2 red like model selection criteria.

PiE2, GCED Josep Ginebra


Chapter 3. Categorical Response Models 6

Finally note that the normal plot of residuals, and the plot of residuals against π̂(xi ) or
against log (π̂(xi )/(1 − π̂(xi )), are not as useful as in linear modeling; The residuals for
all the 1’s are positive, and the residuals for all the 0’s are negative, and the aspect of
these plots when the model is correct is different from the one in linear modeling.

Example: Challenger

Example: Eagles

3.5 Model selection

All the discussion around model selection for linear models applies here as well. In par-
ticular, one can use model selection criteria striking a compromise between maximizing
goodness of fit, measured through residual deviance, and minimizing complexity, mea-
2
sured through p, like Radj , AIC or BIC. One can also resort to cross validation based
model selection criteria like PRESS, RP2 red or RKf
2
old .

Best subsets and stepwise regression methods, together with shrinkage estimation meth-
ods are also useful as model selection tools for logistic regression.

3.6 Model interpretation

All the discussion about the difficulties of interpreting linear models applies as well when
it comes to interpreting fitted logistic models. In particular, one needs to be careful
because different models often provide reasonable fits for your sample and each model
tells a different story, because the dependency among explanatory variables needs to be
taken into account, because missing relevant variables introduce biases in the estimates
of the coefficients βj ’s of the variables in the model, and because data alone can not
distinguish between correlation and causality.

As in linear modeling, bj captures the role played by xj , the role played by other xk in
the model correlated with xj , and the role played by variables zr not in the model but
driving both xj and y. It will always be difficult to tell all these effects apart.

Here, on top of these difficulties, one needs to face the fact that π̂(xi ) is not linear in xi .

PiE2, GCED Josep Ginebra


Chapter 3. Categorical Response Models 7

We illustrate what bj means under the binary logistic model with the logit link through
the two explanatory variables case, where

eb0 +b1 x1 +b2 x2


π̂(x1 , x2 ) = , (3.14)
1 + eb0 +b1 x1 +b2 x2

and so where:
π̂(x1 ,x2 +1)
1−π̂(x1 ,x2 +1)
ˆ
Odds(x 1 , x2 + 1)
b2
e = = , (3.15)
π̂(x1 ,x2 ) ˆ
Odds(x 1 , x2 )
1−π̂(x1 ,x2 )

and:
ˆ
Odds(x 1 , x2 + 1)
b2 = log . (3.16)
ˆ
Odds(x 1 , x2 )

So bj is the estimated logodds ratio when xj increases by one unit and the value of all
other explanatory variables do not change.

There is a one to one relationship between probability and odds. When π(x) = 0
then Odds(x) = 0. When π(x) = 1/5 then Odds(x) = 1/4. When π(x) = 1/2 then
Odds(x) = 1. When π(x) = 4/5 then Odds(x) = 4. When π(x) = 1 then Odds(x) = ∞.

When bj > 0, by increasing the value of xj while keeping the other variables in the model
constant, one increases the Odds(x) and the probability of success, π(x). When bj < 0,
the odds and the probability of success decrease with increasing xj .

One needs training to feel confortable interpreting odds, odds ratios and log odd ratios
and hence to understand what a given value for bj means. In the special case where
π̂(x1 , x2 ) is very small, like when modeling the probability of a rare disease, the inter-
pretation is easier because in that case:

π̂(x1 , x2 + 1)
eb2 ≈ , (3.17)
π̂(x1 , x2 )

and:
π̂(x1 , x2 + 1)
b2 ≈ log . (3.18)
π̂(x1 , x2 )

The interpretation of the coefficients under the probit and the Gompit links is even more
complicated, because it can not be made through logodds ratios.

PiE2, GCED Josep Ginebra


Chapter 3. Categorical Response Models 8

3.7 Contingency tables and logistic model

A J × K contingency table is a table of counts with J rows and K columns. This type of
data is extremely frequent in psychology, sociology, political science and related areas.

A useful way of looking at J ×K contingency tables is to recognize that rows and columns
represent two categorical variables with J and K categories each. The purpose of the
analysis is to find out whether these two variables are related or not, and if they are, to
learn about the kind of relationship between them.

Sometimes one has J ×K ×I tables, that is, I different J ×K contingency tables. In that
case, one has three categorical variables with J, K and I categories, and the purpose is
to learn about the relationship between one of them and the other two.

Example: Death Penalty

Example: Hospitals

Example: Position and kind of tumor

In any such example, one just needs to recognize which categorical variable plays the role
of response, and which one or ones play the role of explanatory variables, fit a binary or
a nominal logistic model, and interpret it.

If the response is the column factor, then rows and columns are not related, not asso-
ciated, or independent, when the row category does not have any effect on the column
probabilities, and therefore when the row “probability profiles” for all rows are the same.

3.8 Nominal logistic model

When the response is categorical with more than two categories, one resorts to the
nominal logistic model, which is a straightforward generalization of the binary logistic
model. We illustrate it through the special case where y has three categories, A/B/C.

Under that model one assumes that yi given xi is conditionally independent and multi-

PiE2, GCED Josep Ginebra


Chapter 3. Categorical Response Models 9

nomially distributed, where probabilities of A, B and C given xi = (x1i , . . . , xp−1i ) are:


A A A
eβ0 +β1 x1i +...+βp−1 xp−1i
Pr(y = A|xi ) = A A A B B B , (3.19)
1 + eβ0 +β1 x1i +...+βp−1 xp−1i + eβ0 +β1 x1i +...+βp−1 xp−1i
B B B
eβ0 +β1 x1i +...+βp−1 xp−1i
Pr(y = B|xi ) = A A A B B B , (3.20)
1 + eβ0 +β1 x1i +...+βp−1 xp−1i + eβ0 +β1 x1i +...+βp−1 xp−1i
1
Pr(y = C|xi ) = A A A B B B , (3.21)
1 + eβ0 +β1 x1i +...+βp−1 xp−1i + eβ0 +β1 x1i +...+βp−1 xp−1i
with Pr(y = A|xi ) + Pr(y = B|xi ) + Pr(y = C|xi ) = 1.

3.9 Count response model

When the response, y, is a count variable supported in {0, 1, 2, . . .}, and the counts are
not very small, the normal linear model is usually a good approximation for the the
relationship between the logarithm of y and the x’s.

Example: Galapagos

As an alternative, one can resort to loglinear Poisson models, assuming that:

yi |xi ∼ Poisson(λ(xi ) = E(yi |xi )), (3.22)

where xi = (x1i , . . . , xp−1i ) and where:

λ(xi ) = E(yi |xi ) = eβ0 +β1 x1i +...+βp−1 xp−1i , (3.23)

with β = (β0 , . . . , βp−1 ) only known to belong to Rp . The relationship between λ(xi ) =
E(yi |xi ) and xi is such that λ(xi ) ≥ 0 for all xi . When this model holds,

log E(yi |xi ) = β0 + β1 x1i + . . . + βp−1 xp−1i , (3.24)

and V (yi |xi ) = E(yi |xi ).

This model is typically fit by minimizing the deviance:



n
yi
Deviance(Y, Ŷ ) = 2 yi log , (3.25)
i=1 E(yˆi |xi )

as a function of b = (b0 , . . . , bp−1 ), which is equivalent to maximizing the likelihood


function.

PiE2, GCED Josep Ginebra


Chapter 3. Categorical Response Models 10

The model is checked through the analysis of deviance residuals, defined as:

yi
ei = sign(yi − ŷi ) 2(yi log − (yi − ŷi )).
D
(3.26)
ŷi

When most of the counts are not too small, the loglinear Poisson model fit is often similar
to the corresponding normal linear model fit for log yi . Switching from the Normal
assumption to the Poisson assumption one better matches the support of yi , but then
one assumes that V ar(yi |xi ) = E(yi |xi ), which is not the case in many examples.

Loglinear Poisson models can also be used for the analysis of contingency tables, leading
to analysis equivalent to the one based on nominal logistic model.

PiE2, GCED Josep Ginebra

You might also like