0% found this document useful (0 votes)
4 views235 pages

Linear Review 1

The document discusses the principles of regression analysis, focusing on how to optimally guess the value of a quantitative random variable Y using mean squared error (MSE) as a criterion. It explains the process of constructing a regression function based on auxiliary variable X and highlights the trade-offs involved in choosing the complexity of the function class to avoid overfitting or underfitting. Additionally, it addresses the bias-variance trade-off in estimating regression functions and the objectives of prediction versus inference in regression analysis.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views235 pages

Linear Review 1

The document discusses the principles of regression analysis, focusing on how to optimally guess the value of a quantitative random variable Y using mean squared error (MSE) as a criterion. It explains the process of constructing a regression function based on auxiliary variable X and highlights the trade-offs involved in choosing the complexity of the function class to avoid overfitting or underfitting. Additionally, it addresses the bias-variance trade-off in estimating regression functions and the objectives of prediction versus inference in regression analysis.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 235

Review of Linear Models I

Presidency University

February, 2025
Guessing the value of a variable

I Suppose we need to guess a single value for a quantitative


random variable Y . What is the best value to guess?
Guessing the value of a variable

I Suppose we need to guess a single value for a quantitative


random variable Y . What is the best value to guess?

I To answer this question, we need to pick a function to be


optimized, which should measure how good or bad our guesses
are: how big an error we’re making. A reasonable, traditional
starting point is the mean squared error:

MSE (c) = E (Y − c)2

So we’d like to find the value c where MSE (c) is smallest.


Guessing the value of a variable

I Suppose we need to guess a single value for a quantitative


random variable Y . What is the best value to guess?

I To answer this question, we need to pick a function to be


optimized, which should measure how good or bad our guesses
are: how big an error we’re making. A reasonable, traditional
starting point is the mean squared error:

MSE (c) = E (Y − c)2

So we’d like to find the value c where MSE (c) is smallest.

I Thus the optimal choice is given by c = E (Y ). Hence the best


guess we can make about Y with respect to mean square error
is E (Y ).
Guessing Y from knowledge of another variable

I Now suppose we have another auxiliary variable X and we


make a guess of Y by some function of X say g (X ).
Guessing Y from knowledge of another variable

I Now suppose we have another auxiliary variable X and we


make a guess of Y by some function of X say g (X ).

I As before if we take MSE as the optimality criteria, then we


seek to minimize E (Y − g (X ))2 with respect to g (X ). We
find that the optimal function is f (x) = E (Y |X = x).
Guessing Y from knowledge of another variable

I Now suppose we have another auxiliary variable X and we


make a guess of Y by some function of X say g (X ).

I As before if we take MSE as the optimality criteria, then we


seek to minimize E (Y − g (X ))2 with respect to g (X ). We
find that the optimal function is f (x) = E (Y |X = x).

I This function f (X ) is called the regression function which we


would like to know when we try to predict Y based on X .
Regression of Y on X is the locus of the conditional mean
E (Y |X ).
Guessing Y from knowledge of another variable

I Now suppose we have another auxiliary variable X and we


make a guess of Y by some function of X say g (X ).

I As before if we take MSE as the optimality criteria, then we


seek to minimize E (Y − g (X ))2 with respect to g (X ). We
find that the optimal function is f (x) = E (Y |X = x).

I This function f (X ) is called the regression function which we


would like to know when we try to predict Y based on X .
Regression of Y on X is the locus of the conditional mean
E (Y |X ).

I Problem: This function f (X ) is generally unknown unless we


assume some completely known probability distribution of
(X , Y ).
Regression Analysis

I What we have at our hand are random samples


(x1 , y1 ), ..., (xn , yn ) from that distribution. This is often called
the learning set or the training set.
Regression Analysis

I What we have at our hand are random samples


(x1 , y1 ), ..., (xn , yn ) from that distribution. This is often called
the learning set or the training set.

I Regression analysis is all about constructing a suitable


approximation fˆ of f based on the training data set.
Regression Analysis

I What we have at our hand are random samples


(x1 , y1 ), ..., (xn , yn ) from that distribution. This is often called
the learning set or the training set.

I Regression analysis is all about constructing a suitable


approximation fˆ of f based on the training data set.

I As we shall see, constructing a suitable approximation fˆ is a


two step process:
I Step 1: Restricting to a class of functions F and find the best
approximation fF of f within that class.
I Step 2: Estimating (or learning) fF based on the data
(x1 , y1 ), ..., (xn , yn ) by fˆ.
More about Step 1

I The choice of the class F is a trade-off:


I If F contains too much complicated functions, this will
capture the variation in training data too much and will lead
to what we call over-fitting or undersmoothing.
More about Step 1

I The choice of the class F is a trade-off:


I If F contains too much complicated functions, this will
capture the variation in training data too much and will lead
to what we call over-fitting or undersmoothing.
I On the contrary, if F contains too much simple functions, then
it will fail to capture any variation in the training data: this is
called underfitting or oversmoothing.
More about Step 1

I The choice of the class F is a trade-off:


I If F contains too much complicated functions, this will
capture the variation in training data too much and will lead
to what we call over-fitting or undersmoothing.
I On the contrary, if F contains too much simple functions, then
it will fail to capture any variation in the training data: this is
called underfitting or oversmoothing.

I Neither scenario is desirable: both have their own problems


which we shall see.
Two Perspectives
I In general there are two objectives of any regression analysis:
I Given a new data point xnew we want to predict the value of
the response variable y , that is, we are interested only in
getting the fitted values ŷ at some new data xnew - problem of
prediction.
I We want to know the functional relationship between y and x,
that is, we want an approximation of the true regression of y
on x- problem of curve estimation or problem of inference.
Two Perspectives
I In general there are two objectives of any regression analysis:
I Given a new data point xnew we want to predict the value of
the response variable y , that is, we are interested only in
getting the fitted values ŷ at some new data xnew - problem of
prediction.
I We want to know the functional relationship between y and x,
that is, we want an approximation of the true regression of y
on x- problem of curve estimation or problem of inference.
I The problem of curve estimation is a much wider problem than
the problem of prediction because solving the former problem
will solve the later problem as a consequence.
Two Perspectives
I In general there are two objectives of any regression analysis:
I Given a new data point xnew we want to predict the value of
the response variable y , that is, we are interested only in
getting the fitted values ŷ at some new data xnew - problem of
prediction.
I We want to know the functional relationship between y and x,
that is, we want an approximation of the true regression of y
on x- problem of curve estimation or problem of inference.
I The problem of curve estimation is a much wider problem than
the problem of prediction because solving the former problem
will solve the later problem as a consequence.

I A natural question is then why consider the two problems


separately? This is because, the problem of prediction is much
simpler than the problem of curve estimation and hence we
can device many trivial regression procedure for the purpose.
Problem of inference
I Here our objective starts with understanding how the covariate X affects
the response Y .
Problem of inference
I Here our objective starts with understanding how the covariate X affects
the response Y .
I We want to estimate f but not for predicting Y . Now fˆ cannot be
treated as black-box, we need to know its exact mathematical form.
Problem of inference
I Here our objective starts with understanding how the covariate X affects
the response Y .
I We want to estimate f but not for predicting Y . Now fˆ cannot be
treated as black-box, we need to know its exact mathematical form.

I In this setting, one may be interested in answering the following


questions:
I Which predictors are associated with the response? It is often the

case that only some of the available predictors are substantially


associated with Y. Identifying the few important predictors among a
large set of possible variables can be extremely useful.
I What is the relationship between the response and each predictor?

Some predictors may have a positive relationship with Y while


others may have the opposite relationship.
I Can the relationship between Y and each predictor be adequately

summarized using a linear equation, or is the relationship more


complicated? In some situation linear form is reasonable or
desirable. But often the true relationship is more complicated, in
which case a linear model may not provide an accurate
representation of the relationship between the input and output
variables.
Problem of prediction

I In many situations, a set of inputs X are readily available, but


the output Y cannot be easily obtained.
Problem of prediction

I In many situations, a set of inputs X are readily available, but


the output Y cannot be easily obtained.

I In this setting, since the error term averages to zero, we can


predict Y using
Ŷ = fˆ(X )
where fˆ is estimate of f and Ŷ is the prediction for Y .
Problem of prediction

I In many situations, a set of inputs X are readily available, but


the output Y cannot be easily obtained.

I In this setting, since the error term averages to zero, we can


predict Y using
Ŷ = fˆ(X )
where fˆ is estimate of f and Ŷ is the prediction for Y .

I We note that fˆ acts like a black-box in the sense that we need


not to know the exact mathematical form of fˆ, so far it yields
accurate predictions Ŷ .
Measurement of accuracy
I In the problem of prediction the main objective was to find an
accurate estimate fˆ.
Measurement of accuracy
I In the problem of prediction the main objective was to find an
accurate estimate fˆ.
I But how do we measure accuracy here? The answer is using
mean squared error. We need to choose fˆ for which MSE is
minimum.
Measurement of accuracy
I In the problem of prediction the main objective was to find an
accurate estimate fˆ.
I But how do we measure accuracy here? The answer is using
mean squared error. We need to choose fˆ for which MSE is
minimum.
I For any approximation fˆn (x) based on a sample of size n, we
can write
MSE (fˆ(x)) = σx2 + Bias 2 (fˆn (x)) + Var (fˆn (x))
where
I σx2 = Var (Y |X = x) is the variance which is uncontrollable
(variance due to random causes),
I the second term Bias 2 (fˆn (x)) = E 2 [f (x) − E (fˆn (x))] is the
squared approximation bias (or error) which is incorporated
due to use of fˆn (x) instead of f (x) and
I the third term Var (fˆn (x)) = E [fˆn (x) − E (fˆn (x))]2 indicates the
variance of our estimated regression function.
Bias- variance Trade-off
I Although σx2 is beyond control, Bias 2 (fˆn (x)) and Var (fˆn (x)) depends on
the choice of fˆn (x) which makes the situation interesting.
Bias- variance Trade-off
I Although σx2 is beyond control, Bias 2 (fˆn (x)) and Var (fˆn (x)) depends on
the choice of fˆn (x) which makes the situation interesting.

I If we choose fˆn (x) to be complicated functions then Bias 2 (fˆn (x)) is small
but Var (fˆn (x)) increases considerably.
Bias- variance Trade-off
I Although σx2 is beyond control, Bias 2 (fˆn (x)) and Var (fˆn (x)) depends on
the choice of fˆn (x) which makes the situation interesting.

I If we choose fˆn (x) to be complicated functions then Bias 2 (fˆn (x)) is small
but Var (fˆn (x)) increases considerably.

I We note that even an unbiased estimator fˆn (x) may not be admissible
because of large variance.
Bias- variance Trade-off
I Although σx2 is beyond control, Bias 2 (fˆn (x)) and Var (fˆn (x)) depends on
the choice of fˆn (x) which makes the situation interesting.

I If we choose fˆn (x) to be complicated functions then Bias 2 (fˆn (x)) is small
but Var (fˆn (x)) increases considerably.

I We note that even an unbiased estimator fˆn (x) may not be admissible
because of large variance.

I On the other hand if we choose fˆn (x) to be simple functions, then


Var (fˆn (x)) is close to zero but Bias 2 (fˆn (x)) increases considerably.
Bias- variance Trade-off
I Although σx2 is beyond control, Bias 2 (fˆn (x)) and Var (fˆn (x)) depends on
the choice of fˆn (x) which makes the situation interesting.

I If we choose fˆn (x) to be complicated functions then Bias 2 (fˆn (x)) is small
but Var (fˆn (x)) increases considerably.

I We note that even an unbiased estimator fˆn (x) may not be admissible
because of large variance.

I On the other hand if we choose fˆn (x) to be simple functions, then


Var (fˆn (x)) is close to zero but Bias 2 (fˆn (x)) increases considerably.

I The catch is that, at least past a certain point, decreasing the


approximation bias can only come through increasing the estimation
variance.
Bias- variance Trade-off
I Although σx2 is beyond control, Bias 2 (fˆn (x)) and Var (fˆn (x)) depends on
the choice of fˆn (x) which makes the situation interesting.

I If we choose fˆn (x) to be complicated functions then Bias 2 (fˆn (x)) is small
but Var (fˆn (x)) increases considerably.

I We note that even an unbiased estimator fˆn (x) may not be admissible
because of large variance.

I On the other hand if we choose fˆn (x) to be simple functions, then


Var (fˆn (x)) is close to zero but Bias 2 (fˆn (x)) increases considerably.

I The catch is that, at least past a certain point, decreasing the


approximation bias can only come through increasing the estimation
variance.

I This is the bias-variance trade-off.


Bias- variance Trade-off
I Although σx2 is beyond control, Bias 2 (fˆn (x)) and Var (fˆn (x)) depends on
the choice of fˆn (x) which makes the situation interesting.

I If we choose fˆn (x) to be complicated functions then Bias 2 (fˆn (x)) is small
but Var (fˆn (x)) increases considerably.

I We note that even an unbiased estimator fˆn (x) may not be admissible
because of large variance.

I On the other hand if we choose fˆn (x) to be simple functions, then


Var (fˆn (x)) is close to zero but Bias 2 (fˆn (x)) increases considerably.

I The catch is that, at least past a certain point, decreasing the


approximation bias can only come through increasing the estimation
variance.

I This is the bias-variance trade-off.

I This trade-off is exactly the same as we have discussed earlier: overfitting


and oversmoothing.
Methods of finding fˆ

I We shall explore many linear and non-linear approaches for


estimating f .
Methods of finding fˆ

I We shall explore many linear and non-linear approaches for


estimating f .

I Most statistical estimation methods for this task can be


characterized as either parametric or non-parametric. Both
these approaches have their relative merits and demerits.
Parametric methods
I Parametric methods is generally a two-step approach to build
models:
1. Model Assumption: First we assume a specific functional form of f .
For example with p predictors X = (X1 , X2 , ..., Xp ) and a response
Y , one may assume a linear form of f (X ) as
f (X ) = β0 + β1 X1 + .... + βp Xp .
We note that this assumption of linearity makes our search for f
simple. We now need not to search among the set of all
p−dimensional functions, rather we only need to estimate the p + 1
coefficients βi to get the desired model.
2. Fitting the assumed model: After a model has been selected, we
need a procedure that uses the training data to fit or train the
model. For example, in the case of the linear model, for fitting we
need to estimate the parameters β0 , β1 , ..., βp .That is,we want to
find estimates of these parameters as β̂0 , β̂1 , ..., β̂p such that
Y ≈ β̂0 + β̂1 X1 + ... + β̂p Xp
Parametric methods
I Parametric methods is generally a two-step approach to build
models:
1. Model Assumption: First we assume a specific functional form of f .
For example with p predictors X = (X1 , X2 , ..., Xp ) and a response
Y , one may assume a linear form of f (X ) as
f (X ) = β0 + β1 X1 + .... + βp Xp .
We note that this assumption of linearity makes our search for f
simple. We now need not to search among the set of all
p−dimensional functions, rather we only need to estimate the p + 1
coefficients βi to get the desired model.
2. Fitting the assumed model: After a model has been selected, we
need a procedure that uses the training data to fit or train the
model. For example, in the case of the linear model, for fitting we
need to estimate the parameters β0 , β1 , ..., βp .That is,we want to
find estimates of these parameters as β̂0 , β̂1 , ..., β̂p such that
Y ≈ β̂0 + β̂1 X1 + ... + β̂p Xp

I One of the many available techniques to fit such model is


least square approach.
Issues with parametric approach
I This model-based approach is called parametric because
estimating the model essentially reduces to estimate a number
of parameters.
Issues with parametric approach
I This model-based approach is called parametric because
estimating the model essentially reduces to estimate a number
of parameters.
I Although assuming a parametric model simplifies the task, but
this approach has its own limitations.
Issues with parametric approach
I This model-based approach is called parametric because
estimating the model essentially reduces to estimate a number
of parameters.
I Although assuming a parametric model simplifies the task, but
this approach has its own limitations.
I First of all we need to make a certain assumption regarding
the model and this choice usually does not match the true
form. Further if this choice is too bad, then the estimates will
be poor.
Issues with parametric approach
I This model-based approach is called parametric because
estimating the model essentially reduces to estimate a number
of parameters.
I Although assuming a parametric model simplifies the task, but
this approach has its own limitations.
I First of all we need to make a certain assumption regarding
the model and this choice usually does not match the true
form. Further if this choice is too bad, then the estimates will
be poor.
I We can make our model more flexible by including more
parameters in the model, but that will lead to overfitting.
Issues with parametric approach
I This model-based approach is called parametric because
estimating the model essentially reduces to estimate a number
of parameters.
I Although assuming a parametric model simplifies the task, but
this approach has its own limitations.
I First of all we need to make a certain assumption regarding
the model and this choice usually does not match the true
form. Further if this choice is too bad, then the estimates will
be poor.
I We can make our model more flexible by including more
parameters in the model, but that will lead to overfitting.
I This is trade-off : if we choose too less parameters we may
have oversmoothing which means ignoring many potential
causes and if we include many parameters we suffer from
overfitting or what we call undersmoothing.
Non-parametric approach

I Non-parametric methods do not make explicit assumptions


about the functional form of f .
Non-parametric approach

I Non-parametric methods do not make explicit assumptions


about the functional form of f .

I Instead here we seek an estimate of f that gets as close to the


data points as possible without being too rough.
Non-parametric approach

I Non-parametric methods do not make explicit assumptions


about the functional form of f .

I Instead here we seek an estimate of f that gets as close to the


data points as possible without being too rough.

I Examples of such non-parametric model is spline regression.


I The major advantage of such approaches is here we avoid the
assumption of a particular functional form for f , and hence
they have the potential to accurately fit a wider range of
possible shapes for f .
I The major advantage of such approaches is here we avoid the
assumption of a particular functional form for f , and hence
they have the potential to accurately fit a wider range of
possible shapes for f .

I But a major disadvantage of this approach is: since they do


not reduce the problem of estimating f to a small number of
parameters,a very large number of observations(far more than
is typically needed for a parametric approach) is required in
order to obtain an accurate estimate for f .
Step1:F contains constant functions

I We choose to restricted class F to be the constant f (x) = f0 .


Step1:F contains constant functions

I We choose to restricted class F to be the constant f (x) = f0 .

I This indicates over smoothing.


Step1:F contains constant functions

I We choose to restricted class F to be the constant f (x) = f0 .

I This indicates over smoothing.

I But at times this may produce appropriate results.


I True regression f (x) is really a constant.
I f (x) varies rapidly but within narrow limits.
Step1:F contains constant functions

I We choose to restricted class F to be the constant f (x) = f0 .

I This indicates over smoothing.

I But at times this may produce appropriate results.


I True regression f (x) is really a constant.
I f (x) varies rapidly but within narrow limits.
I In such situations we can actually do better by fitting a
constant than matching the correct functional form.
Step1:F contains constant functions

I We choose to restricted class F to be the constant f (x) = f0 .

I This indicates over smoothing.

I But at times this may produce appropriate results.


I True regression f (x) is really a constant.
I f (x) varies rapidly but within narrow limits.
I In such situations we can actually do better by fitting a
constant than matching the correct functional form.

I In the second situation fˆ(x) = f0 should be biased but can


have less MSE than an unbiased estimator.
Example: Bias variance tradeoff in action

I For example, suppose our f (x) be of the form

f (x) = α + β sin(γx)
Example: Bias variance tradeoff in action

I For example, suppose our f (x) be of the form

f (x) = α + β sin(γx)

I Further we assume β  1 and γ  1 to ensure that f lies


within narrow limits.
Example: Bias variance tradeoff in action

I For example, suppose our f (x) be of the form

f (x) = α + β sin(γx)

I Further we assume β  1 and γ  1 to ensure that f lies


within narrow limits.

I Here estimating a constant regression function does better


than the estimated regression with the correct functional form.
Example: Bias variance tradeoff in action

I For example, suppose our f (x) be of the form

f (x) = α + β sin(γx)

I Further we assume β  1 and γ  1 to ensure that f lies


within narrow limits.

I Here estimating a constant regression function does better


than the estimated regression with the correct functional form.

I In fact here the MSE of the model fˆ1 (x) = f0 is less than the
MSE of the unbiased model fˆ2 (x) = α̂ + β̂ sin(γx) (assuming
γ to be known).
Example (contd.)

1.6
1.2
y

0.8
0.4

0.2 0.4 0.6 0.8 1.0

x
Example (Contd.)

I A rapidly-varying but nearly-constant regression function


y = 1 + 0.02 sin(200x) +  where  ∼ N(0, 0.5).
I The red dotted line is the constant line indicating sample
mean of response.
I The blue dot-dashed curve is the estimated function of the
form α̂ + β̂ sin(200x).
I With just a few observations, the constant actually predicts
better on new data ( MSE 0.53 ) than does the estimate sine
function (MSE 0.59).
What does this example tell us?

I “Optimum” choice does not necessarily mean the “truth”.


What does this example tell us?

I “Optimum” choice does not necessarily mean the “truth”.

I In this example, the truth was a sin curve whereas the


constant function is optimum.
What does this example tell us?

I “Optimum” choice does not necessarily mean the “truth”.

I In this example, the truth was a sin curve whereas the


constant function is optimum.

I Optimum means a “reasonably good” approximation of the


truth to serve our purpose.
What does this example tell us?

I “Optimum” choice does not necessarily mean the “truth”.

I In this example, the truth was a sin curve whereas the


constant function is optimum.

I Optimum means a “reasonably good” approximation of the


truth to serve our purpose.

I We shall always remember that we are searching for the


optimum not the truth.
What does this example tell us?

I “Optimum” choice does not necessarily mean the “truth”.

I In this example, the truth was a sin curve whereas the


constant function is optimum.

I Optimum means a “reasonably good” approximation of the


truth to serve our purpose.

I We shall always remember that we are searching for the


optimum not the truth.

I This is the motivation behind fixing the class F.


General Linear model
I Consider a setup where we have a single response variable y which is
quantitative and p covariates x1 , x2 , ...., xp which can be either
quantitative or qualitative or both.
General Linear model
I Consider a setup where we have a single response variable y which is
quantitative and p covariates x1 , x2 , ...., xp which can be either
quantitative or qualitative or both.

I Suppose we have n observations on each of these p variables. That is,


suppose we have observations y1 , y2 , ..., yn on the response y and
x1i , x2i , ..., xni on the i th covariate xi , i = 1, 2, ..., p.
General Linear model
I Consider a setup where we have a single response variable y which is
quantitative and p covariates x1 , x2 , ...., xp which can be either
quantitative or qualitative or both.

I Suppose we have n observations on each of these p variables. That is,


suppose we have observations y1 , y2 , ..., yn on the response y and
x1i , x2i , ..., xni on the i th covariate xi , i = 1, 2, ..., p.

I Then the general linear model can be written as

y = Xβ + 

where y = (y1 , y2 , ..., yn ) is the response vector and β = (b1 , ..., bp ) is the
vector of parameters and

x11 x12 · · · x1p


 
x21 x22 x2p 
X = . ..
 
.

 . . ··· 
xn1 xn2 xnp

is the design matrix.


Linear Model (contd.)

I Further  = (1 , 2 , .., n ) is the vector of random errors where


we assume
I E (i |X ) = 0∀i.
I Var (i |X ) = σ 2 ∀i.
I Cov (i , j |X ) = 0∀i 6= j.
I These assumptions can alternatively be stated as E (|X ) = 0
and D(|X ) = σ 2 In .
Linear Model (contd.)

I Further  = (1 , 2 , .., n ) is the vector of random errors where


we assume
I E (i |X ) = 0∀i.
I Var (i |X ) = σ 2 ∀i.
I Cov (i , j |X ) = 0∀i 6= j.
I These assumptions can alternatively be stated as E (|X ) = 0
and D(|X ) = σ 2 In .

I More specifically we assume single quantitative resposne


variable y and p covarites such that

E (y |X ) = X β and Var (y |X ) = σ 2 In .
Example: Simple Linear Regression

I Suppose we restrict our class to the class of all linear functions


F = {f (x) : f (x) = α + βx, α, β ∈ R}.
I For n observations, we can write it as

y = Xθ + 

where E () = 0 and Var () = σ 2 In .


I Here we have
 
1 x1
1 x2   
α
X =  . .  and θ = .
 
. .
. .  β
1 xn
Example: Polynomial Regression

I An immediate extension of this can be done by expanding the


class to incorporate the polynomials in x as

F = {f : f (x) = β0 + β1 x + .... + βp x p for some p.}

I For polynomial regression, we have

1 x1 x12 · · · x1p
   
β0
1 x2 x 2 · · · x p   β1 
2 2
X = . . .. .. ..  and θ =  ..  .
  
 .. .. . . .  .
2 p
1 xn xn · · · xn βp
Example: Multiple Regression

I Hence we can consider the class of functions as

F = {f : f (x) = β0 + β1 x1 + .... + βp xp }

where we have a single response variable y and p quantitative


predictor variables, x1 , x2 , ...xp .
I For multiple linear regression we have
   
1 x11 x12 · · · x1p β0
1 x21 x22 · · · x2p   β1 
X = . .. .. .. ..  and θ =  ..  .
   
.. . . . .  .
1 xn1 xn2 · · · xnp βp
Classification

I If all the columns of X (except the first column) contains


values of continuous variables, then the linear model is called
regression model.
Classification

I If all the columns of X (except the first column) contains


values of continuous variables, then the linear model is called
regression model.

I If all the columns of X contains values of discrete variables


(more specifically if all the columns contains values 0 or 1),
then the linear model is called ANOVA model.
Classification

I If all the columns of X (except the first column) contains


values of continuous variables, then the linear model is called
regression model.

I If all the columns of X contains values of discrete variables


(more specifically if all the columns contains values 0 or 1),
then the linear model is called ANOVA model.

I If some columns of X contains values of continuous variables


and some columns contains values of discrete variable, then
the linear model is called ANCOVA (or ANOCOVA) model.
Dealing with Factors
I In linear models we need to deal with what we call factor
variables or factors which are categorical variables with
different categories. The different categories of a factor are
called the factor levels.
Dealing with Factors
I In linear models we need to deal with what we call factor
variables or factors which are categorical variables with
different categories. The different categories of a factor are
called the factor levels.
I Suppose we have a single factor A with k levels A1 , A2 , ..., Ak
having potential effect on the response y . A natural question is
: How do we model the effects of all these levels in a single
linear model?
Dealing with Factors
I In linear models we need to deal with what we call factor
variables or factors which are categorical variables with
different categories. The different categories of a factor are
called the factor levels.
I Suppose we have a single factor A with k levels A1 , A2 , ..., Ak
having potential effect on the response y . A natural question is
: How do we model the effects of all these levels in a single
linear model?
I The answer is to use indicator variables or dummy
variables x1 , x2 , ..., xk−1 where
(
1 if the observation receives the i th level
xi =
0 otherwise.
Dealing with Factors
I In linear models we need to deal with what we call factor
variables or factors which are categorical variables with
different categories. The different categories of a factor are
called the factor levels.
I Suppose we have a single factor A with k levels A1 , A2 , ..., Ak
having potential effect on the response y . A natural question is
: How do we model the effects of all these levels in a single
linear model?
I The answer is to use indicator variables or dummy
variables x1 , x2 , ..., xk−1 where
(
1 if the observation receives the i th level
xi =
0 otherwise.
I We can write a linear model as
y = α + β1 x1 + .... + βk−1 xk−1 + 
where βi is the effect of the i th level of A.
Using dummy variables
I So why did we use k − 1 dummy variables when we had k
levels? Where is the effect of Ak modeled in the linear model?
Using dummy variables
I So why did we use k − 1 dummy variables when we had k
levels? Where is the effect of Ak modeled in the linear model?

I The answer is if the observation receives the k th level of the


factor A, then all xi = 0, i = 1, 2, ..., k − 1 and as such α
represents the expected value of y when the observations
receives Ak .
Using dummy variables
I So why did we use k − 1 dummy variables when we had k
levels? Where is the effect of Ak modeled in the linear model?

I The answer is if the observation receives the k th level of the


factor A, then all xi = 0, i = 1, 2, ..., k − 1 and as such α
represents the expected value of y when the observations
receives Ak .

I When an observation receives the level Ai , i = 1, 2, ..., k − 1,


then expected value of y is α + βi . As such
βi , i = 1, 2, ..., k − 1 represents the change in the expected
value of y due to Ai as compared to Ak .
Using dummy variables
I So why did we use k − 1 dummy variables when we had k
levels? Where is the effect of Ak modeled in the linear model?

I The answer is if the observation receives the k th level of the


factor A, then all xi = 0, i = 1, 2, ..., k − 1 and as such α
represents the expected value of y when the observations
receives Ak .

I When an observation receives the level Ai , i = 1, 2, ..., k − 1,


then expected value of y is α + βi . As such
βi , i = 1, 2, ..., k − 1 represents the change in the expected
value of y due to Ai as compared to Ak .

I That means each βi represents the expected difference in y


when the observation belongs to Ai and Ak . For this reason
βi ’s are sometimes called contrasts between two classes.
I Here we compare the effects of other level with that of Ak . In
such case we call Ak to be the baseline level.
I Here we compare the effects of other level with that of Ak . In
such case we call Ak to be the baseline level.

I Obviously we can take any level Ai (not necessarily Ak ) to be


the baseline level.
I Here we compare the effects of other level with that of Ak . In
such case we call Ak to be the baseline level.

I Obviously we can take any level Ai (not necessarily Ak ) to be


the baseline level.

I A general rule is thus : if we are working with a factor, then


we need to introduce k − 1 dummy variables.
Using dummy for all levels
I Now suppose in the same situation we use k dummy variables x1 , x2 , ..., xk
instead of k − 1 variables x1 , x2 , ..., xk−1 and fit a model as

y = α + β1 x1 + β2 x2 + ... + βk xk + 
Using dummy for all levels
I Now suppose in the same situation we use k dummy variables x1 , x2 , ..., xk
instead of k − 1 variables x1 , x2 , ..., xk−1 and fit a model as

y = α + β1 x1 + β2 x2 + ... + βk xk + 

I Then we note that the variables x1 , x2 , ..., xk are not independent : they satisfy
P
a constraint xi = 1, that is any observation must receive any one of the levels
Ai .
Using dummy for all levels
I Now suppose in the same situation we use k dummy variables x1 , x2 , ..., xk
instead of k − 1 variables x1 , x2 , ..., xk−1 and fit a model as

y = α + β1 x1 + β2 x2 + ... + βk xk + 

I Then we note that the variables x1 , x2 , ..., xk are not independent : they satisfy
P
a constraint xi = 1, that is any observation must receive any one of the levels
Ai .

I Here the design matrix is

1 x11 x12 ··· x1k


 
1 x21 x22 ··· x2k 
X = . .. .. 
 
 .. . . 
1 xn1 xn2 ··· xnk

but it is not of full rank.


Using dummy for all levels
I Now suppose in the same situation we use k dummy variables x1 , x2 , ..., xk
instead of k − 1 variables x1 , x2 , ..., xk−1 and fit a model as

y = α + β1 x1 + β2 x2 + ... + βk xk + 

I Then we note that the variables x1 , x2 , ..., xk are not independent : they satisfy
P
a constraint xi = 1, that is any observation must receive any one of the levels
Ai .

I Here the design matrix is

1 x11 x12 ··· x1k


 
1 x21 x22 ··· x2k 
X = . .. .. 
 
 .. . . 
1 xn1 xn2 ··· xnk

but it is not of full rank.

I Statistical lesson: There can be alternative parametrization for the same model.
Example: ANOVA model (One way layout)
I Suppose we have a factor A and let A1 , A2 , ...., Ak be the levels of A
which constitutes the population of interest.
Example: ANOVA model (One way layout)
I Suppose we have a factor A and let A1 , A2 , ...., Ak be the levels of A
which constitutes the population of interest.

I Further assume there are ni observations receiving the level Ai and yij be
the j th observation receiving the i th level Ai .
Example: ANOVA model (One way layout)
I Suppose we have a factor A and let A1 , A2 , ...., Ak be the levels of A
which constitutes the population of interest.

I Further assume there are ni observations receiving the level Ai and yij be
the j th observation receiving the i th level Ai .

I The model we consider is

yij = µi + eij , j = 1, 2, ..., ni , i = 1, 2, ..., k.

where
µi = fixed effect due to Ai and eij = random error.
Example: ANOVA model (One way layout)
I Suppose we have a factor A and let A1 , A2 , ...., Ak be the levels of A
which constitutes the population of interest.

I Further assume there are ni observations receiving the level Ai and yij be
the j th observation receiving the i th level Ai .

I The model we consider is

yij = µi + eij , j = 1, 2, ..., ni , i = 1, 2, ..., k.

where
µi = fixed effect due to Ai and eij = random error.

I We assume that
eij ∼ N(0, σ 2 )
and eij0 s are independent.
Example: ANOVA model (One way layout)
I Suppose we have a factor A and let A1 , A2 , ...., Ak be the levels of A
which constitutes the population of interest.

I Further assume there are ni observations receiving the level Ai and yij be
the j th observation receiving the i th level Ai .

I The model we consider is

yij = µi + eij , j = 1, 2, ..., ni , i = 1, 2, ..., k.

where
µi = fixed effect due to Ai and eij = random error.

I We assume that
eij ∼ N(0, σ 2 )
and eij0 s are independent.

I This implies that E (yij ) = µi and Var (yij ) = σ 2 for all j = 1, ..., ni which
means µ0i s are the factor level means for each i and σ 2 is the common
variability among observations belonging to each group.
One way ANOVA as linear model

I Here we have introduced k dummy variables for k levels but


without any intercept.
One way ANOVA as linear model

I Here we have introduced k dummy variables for k levels but


without any intercept.

I In terms of dummy variables we can write

y = µ1 x1 + µ2 x2 + .... + µk xk + 

where xi = 1 or 0 according as the observation receives Ai or


not.
One way ANOVA as linear model
I Suppose we denote

y11 e11
   
 y12   e12 
 ..   .. 
   
 .   . 
   
y1n  e1n 
 1  1
1n1 0 ··· 0
   
 y21 
  µ1  e21 
 
 .   µ2   0 1n2 ··· 0   . 
y =  ..  , β =  .  and Xn×k =  .
 . 
.. .. ..  and  = .
     
 ..   ..
 
y2n2 
  . . .  e2n2 
 
 . 
 .  µ k 0 0 ··· 1nk  . 
 . 
 .   . 
   
 yk1   ek1 
   
 .   . 
 ..   .. 
yknk eknk
One way ANOVA as linear model
I Suppose we denote

y11 e11
   
 y12   e12 
 ..   .. 
   
 .   . 
   
y1n  e1n 
 1  1
1n1 0 ··· 0
   
 y21 
  µ1  e21 
 
 .   µ2   0 1n2 ··· 0   . 
y =  ..  , β =  .  and Xn×k =  .
 . 
.. .. ..  and  = .
     
 ..   ..
 
y2n2 
  . . .  e2n2 
 
 . 
 .  µ k 0 0 ··· 1nk  . 
 . 
 .   . 
   
 yk1   ek1 
   
 .   . 
 ..   .. 
yknk eknk

I Then the above model can be written as

y = Xβ + 
2
where  ∼ Nn (0, σ In ).
Reparametrization
I At times, an alternative but completely equivalent formulation of the
single-factor ANOVA model is used. This alternative formulation is called
the factor effects model.
Reparametrization
I At times, an alternative but completely equivalent formulation of the
single-factor ANOVA model is used. This alternative formulation is called
the factor effects model.

I Let us write
µi = µ̄ + (µi − µ̄) = µ + αi
P
ni µi
where µ = µ̄ = n
and αi = µi − µ̄.
Reparametrization
I At times, an alternative but completely equivalent formulation of the
single-factor ANOVA model is used. This alternative formulation is called
the factor effects model.

I Let us write
µi = µ̄ + (µi − µ̄) = µ + αi
P
ni µi
where µ = µ̄ = n
and αi = µi − µ̄.
X
I Then we note that ni αi = 0.
i
Reparametrization
I At times, an alternative but completely equivalent formulation of the
single-factor ANOVA model is used. This alternative formulation is called
the factor effects model.

I Let us write
µi = µ̄ + (µi − µ̄) = µ + αi
P
ni µi
where µ = µ̄ = n
and αi = µi − µ̄.
X
I Then we note that ni αi = 0.
i

I Now our linear model of interest becomes

yij = µ + αi + eij , i = 1, 2, ..., ni , j = 1, 2, ..., k

where µ denotes the general effect or the average effect and αi denotes
the
X additional effect (fixed) due to Ai subject to the restriction
ni αi = 0 and eij denotes the random error.
i
Reparametrization
I At times, an alternative but completely equivalent formulation of the
single-factor ANOVA model is used. This alternative formulation is called
the factor effects model.

I Let us write
µi = µ̄ + (µi − µ̄) = µ + αi
P
ni µi
where µ = µ̄ = n
and αi = µi − µ̄.
X
I Then we note that ni αi = 0.
i

I Now our linear model of interest becomes

yij = µ + αi + eij , i = 1, 2, ..., ni , j = 1, 2, ..., k

where µ denotes the general effect or the average effect and αi denotes
the
X additional effect (fixed) due to Ai subject to the restriction
ni αi = 0 and eij denotes the random error.
i

I We assume that for all i, j, eij are independent N(0, σ 2 ) variables.


Reparametrized form as linear Model

I Now, in terms of dummy variables, we have included k dummy


variables for k levels along with an intercept.
Reparametrized form as linear Model

I Now, in terms of dummy variables, we have included k dummy


variables for k levels along with an intercept.

I In this case the linear model becomes

y = Xβ + 

where β = (µ, α1 , α2 , ..., αk )T and


 
1n1 1n1 0 ··· 0
 1n
 2 0 1n2 ··· 0 
Xn×(k+1) =  . .. .. .. .

 .. . . . 
1n k 0 0 ··· 1nk
Example: More use of dummy variables
I Consider a setup where we need to judge the effectiveness of a treatment (or
may be comparing the effectiveness of two treatments, in which case the control
group may be thought of getting some treatment). This may be a controlled
experiment or observational study.
Example: More use of dummy variables
I Consider a setup where we need to judge the effectiveness of a treatment (or
may be comparing the effectiveness of two treatments, in which case the control
group may be thought of getting some treatment). This may be a controlled
experiment or observational study.

I Suppose the data are obtained in the form


Control Treatment
y11 y21
y12 y22
.. ..
. .
..
y1n1 .
y2n2
Example: More use of dummy variables
I Consider a setup where we need to judge the effectiveness of a treatment (or
may be comparing the effectiveness of two treatments, in which case the control
group may be thought of getting some treatment). This may be a controlled
experiment or observational study.

I Suppose the data are obtained in the form


Control Treatment
y11 y21
y12 y22
.. ..
. .
..
y1n1 .
y2n2

I Note that we allow the number of observations in the two groups to be different
and y ’s represent the value of the response.
Example: More use of dummy variables
I Consider a setup where we need to judge the effectiveness of a treatment (or
may be comparing the effectiveness of two treatments, in which case the control
group may be thought of getting some treatment). This may be a controlled
experiment or observational study.

I Suppose the data are obtained in the form


Control Treatment
y11 y21
y12 y22
.. ..
. .
..
y1n1 .
y2n2

I Note that we allow the number of observations in the two groups to be different
and y ’s represent the value of the response.

I This situation can also be tackled with a linear model with the use of dummy
variables.
More use of dummy (contd.)
I Let us define a dummy variable as

x = 1 or 0 according as the observation receives the treatment or not


More use of dummy (contd.)
I Let us define a dummy variable as

x = 1 or 0 according as the observation receives the treatment or not

I Then the linear model can be written as

z = α + βx + 

or more precisely

zi = α + βxi + i , i = 1, 2, ..., n(= n1 + n2 ).


More use of dummy (contd.)
I Let us define a dummy variable as

x = 1 or 0 according as the observation receives the treatment or not

I Then the linear model can be written as

z = α + βx + 

or more precisely

zi = α + βxi + i , i = 1, 2, ..., n(= n1 + n2 ).

I Here (
y1i , i = 1, 2, ..., n1
zi =
y2(i−n1 ) , i = n1 + 1, ..., n1 + n2
and 1 , 2 , ..., n are the random errors.
More use (Contd.)
I Suppose we write the above linear model as

z = X θ + .
More use (Contd.)
I Suppose we write the above linear model as

z = X θ + .

I Then θ = (α, β) is the vector of parameters, z is the response vector and  is


the random error vector.
More use (Contd.)
I Suppose we write the above linear model as

z = X θ + .

I Then θ = (α, β) is the vector of parameters, z is the response vector and  is


the random error vector.

I It is instructive to have a look at the structure of the design matrix


 
1 0
 1 0
 
 . .. 
 .
 . . 

 1 0
 
Xn×2 = · · · · · ·
 
 1 1
 
 1 1
 
 .. . 
 
 . .. 
1 1

where the upper submatrix consists of n1 rows and the lower one contains
n2 rows.
More use (Contd.)
I Suppose we write the above linear model as

z = X θ + .

I Then θ = (α, β) is the vector of parameters, z is the response vector and  is


the random error vector.

I It is instructive to have a look at the structure of the design matrix


 
1 0
 1 0
 
 . .. 
 .
 . . 

 1 0
 
Xn×2 = · · · · · ·
 
 1 1
 
 1 1
 
 .. . 
 
 . .. 
1 1

where the upper submatrix consists of n1 rows and the lower one contains
n2 rows.

I Note that in the above formulation the effect of the treatment is α + β and the
effect of the control is α- so the change in effect due to the treatment is β.
More than one categorical predictors

I We can include as many factor covariates as we wish to in our


linear model.
More than one categorical predictors

I We can include as many factor covariates as we wish to in our


linear model.

I All the factors need not to have same number of levels.


More than one categorical predictors

I We can include as many factor covariates as we wish to in our


linear model.

I All the factors need not to have same number of levels.

I But then we have separate sets of dummy variables for each


factors.
More than one categorical predictors

I We can include as many factor covariates as we wish to in our


linear model.

I All the factors need not to have same number of levels.

I But then we have separate sets of dummy variables for each


factors.

I The only wrinkle with having multiple factors is that α, the


over-all intercept, is now the expected value of y for
individuals where all categorical variables are in their respective
baseline levels.
More than one categorical predictors

I We can include as many factor covariates as we wish to in our


linear model.

I All the factors need not to have same number of levels.

I But then we have separate sets of dummy variables for each


factors.

I The only wrinkle with having multiple factors is that α, the


over-all intercept, is now the expected value of y for
individuals where all categorical variables are in their respective
baseline levels.

I With multiple factors we shall have a new issue in our model


called interaction effect.
Interaction

I Interaction effect is the joint effect of two or more factors.


Interaction

I Interaction effect is the joint effect of two or more factors.

I Suppose we measure the effect of fertilizers and soil quality on


the yield of plots.
Interaction

I Interaction effect is the joint effect of two or more factors.

I Suppose we measure the effect of fertilizers and soil quality on


the yield of plots.

I Here the responses are the yields on different plots and there
are two factors fertilizer brand and soil type.
Interaction

I Interaction effect is the joint effect of two or more factors.

I Suppose we measure the effect of fertilizers and soil quality on


the yield of plots.

I Here the responses are the yields on different plots and there
are two factors fertilizer brand and soil type.

I It may happen that a particular soil type behaves exceptionally


in presence of a particular fertilizer.
Interaction

I Interaction effect is the joint effect of two or more factors.

I Suppose we measure the effect of fertilizers and soil quality on


the yield of plots.

I Here the responses are the yields on different plots and there
are two factors fertilizer brand and soil type.

I It may happen that a particular soil type behaves exceptionally


in presence of a particular fertilizer.

I This is the interaction effect and should be included in the


model explicitly.
Interaction effect
I Formally when we say that there are no interactions between
two variables xi and xj , we mean that

E [y |x]
∂xi
is not a function of xj .
Interaction effect
I Formally when we say that there are no interactions between
two variables xi and xj , we mean that

E [y |x]
∂xi
is not a function of xj .
I This means there are no interactions if and only if
p
X
E [y |x] = α + fi (xi )
i=1
so that each coordinate of x makes its own separate, additive
contribution to y .
Interaction effect
I Formally when we say that there are no interactions between
two variables xi and xj , we mean that

E [y |x]
∂xi
is not a function of xj .
I This means there are no interactions if and only if
p
X
E [y |x] = α + fi (xi )
i=1
so that each coordinate of x makes its own separate, additive
contribution to y .
I The standard multiple linear regression model of course
includes no interactions between any of the predictor variables.
Interaction effect
I Formally when we say that there are no interactions between
two variables xi and xj , we mean that

E [y |x]
∂xi
is not a function of xj .
I This means there are no interactions if and only if
p
X
E [y |x] = α + fi (xi )
i=1
so that each coordinate of x makes its own separate, additive
contribution to y .
I The standard multiple linear regression model of course
includes no interactions between any of the predictor variables.
I But general considerations of statistical modeling give us no
reason whatsoever to anticipate that interactions are rare, or
that when they exist they are small.
Interaction (Contd.)

I Conventionally interactions are included in a linear model by


adding a product term.
Interaction (Contd.)

I Conventionally interactions are included in a linear model by


adding a product term.

I For example, suppose we are dealing with two factor covariates


each with two levels so that we include their effect in the linear
model by two dummy variables x1 and x2 .
Interaction (Contd.)

I Conventionally interactions are included in a linear model by


adding a product term.

I For example, suppose we are dealing with two factor covariates


each with two levels so that we include their effect in the linear
model by two dummy variables x1 and x2 .

I Then the interaction effect will be modeled as,

y = α + β1 x1 + β2 x2 + β3 x1 x2 + 
Interaction (Contd.)

I Conventionally interactions are included in a linear model by


adding a product term.

I For example, suppose we are dealing with two factor covariates


each with two levels so that we include their effect in the linear
model by two dummy variables x1 and x2 .

I Then the interaction effect will be modeled as,

y = α + β1 x1 + β2 x2 + β3 x1 x2 + 

I It is no longer correct to interpret β1 as

E [y |X1 = x1 + 1, X2 = x2 ] − E [y |X1 = x1 , X2 = x2 ].
Interaction (Contd.)
I That difference is, rather β1 + β3 x2 .
Interaction (Contd.)
I That difference is, rather β1 + β3 x2 .

I Similarly, β2 is no longer the expected difference in y between


two otherwise-identical cases where x2 differs by 1.
Interaction (Contd.)
I That difference is, rather β1 + β3 x2 .

I Similarly, β2 is no longer the expected difference in y between


two otherwise-identical cases where x2 differs by 1.

I The fact that we can’t give one answer to “how much does the
response change when we change this variable?”, that the
correct answer to that question always involves the other
variable, is what interaction means.
Interaction (Contd.)
I That difference is, rather β1 + β3 x2 .

I Similarly, β2 is no longer the expected difference in y between


two otherwise-identical cases where x2 differs by 1.

I The fact that we can’t give one answer to “how much does the
response change when we change this variable?”, that the
correct answer to that question always involves the other
variable, is what interaction means.

I What we can say is that β1 is the slope with regard to x1 when


x2 = 0, and likewise β2 is how much we expect y to change for
a one-unit change in x2 when x1 = 0.
Interaction (Contd.)
I That difference is, rather β1 + β3 x2 .

I Similarly, β2 is no longer the expected difference in y between


two otherwise-identical cases where x2 differs by 1.

I The fact that we can’t give one answer to “how much does the
response change when we change this variable?”, that the
correct answer to that question always involves the other
variable, is what interaction means.

I What we can say is that β1 is the slope with regard to x1 when


x2 = 0, and likewise β2 is how much we expect y to change for
a one-unit change in x2 when x1 = 0.

I β3 is the rate at which the slope on x1 changes as x2 changes,


and likewise the rate at which the slope on x2 changes with x1 .
Why Product Interactions?
I Conventionally linear models use interaction terms as product
of the indicator variables, e.g. x1 x2 .
Why Product Interactions?
I Conventionally linear models use interaction terms as product
of the indicator variables, e.g. x1 x2 .
I Interactions could have alternatively been introduced by using
x1 x2
terms like 1+|x 1 x2 |
or x1 H(x2 − c) where H is the step function
(
1 ,x ≥ 0
H(x) = .
0 ,x < 0
Why Product Interactions?
I Conventionally linear models use interaction terms as product
of the indicator variables, e.g. x1 x2 .
I Interactions could have alternatively been introduced by using
x1 x2
terms like 1+|x 1 x2 |
or x1 H(x2 − c) where H is the step function
(
1 ,x ≥ 0
H(x) = .
0 ,x < 0
I A natural question is: Is there any special reason to use
product interactions?
Why Product Interactions?
I Conventionally linear models use interaction terms as product
of the indicator variables, e.g. x1 x2 .
I Interactions could have alternatively been introduced by using
x1 x2
terms like 1+|x 1 x2 |
or x1 H(x2 − c) where H is the step function
(
1 ,x ≥ 0
H(x) = .
0 ,x < 0
I A natural question is: Is there any special reason to use
product interactions?
I Suppose that the real regression function µ(x) = E (Y |x) is a
smooth function of all the coordinates of x.
Why Product Interactions?
I Conventionally linear models use interaction terms as product
of the indicator variables, e.g. x1 x2 .
I Interactions could have alternatively been introduced by using
x1 x2
terms like 1+|x 1 x2 |
or x1 H(x2 − c) where H is the step function
(
1 ,x ≥ 0
H(x) = .
0 ,x < 0
I A natural question is: Is there any special reason to use
product interactions?
I Suppose that the real regression function µ(x) = E (Y |x) is a
smooth function of all the coordinates of x.
I Because it is smooth, we should be able to do a Taylor
expansion around any particular point, say x ∗ as
p p p
X ∂µ 1 XX ∂2µ
µ(x) ≈ µ(x ∗ )+ (xi −xi∗ ) |x=x ∗ + (x−xi∗ )(x−xj∗ )
∂xi 2 ∂xi xj
i=1 i=1 j=1
Product interactions (Contd.)
I The first term, µ(x ∗ ), is a constant. The next sum will give us
linear terms in all the xi (plus more constants). The double
sum after that will give us terms for each product xi xj , plus
all the squares xi2 , plus more constants.
Product interactions (Contd.)
I The first term, µ(x ∗ ), is a constant. The next sum will give us
linear terms in all the xi (plus more constants). The double
sum after that will give us terms for each product xi xj , plus
all the squares xi2 , plus more constants.

I Thus, if the true regression function is smooth, and we only


see a small range of values for each predictor variable, using
product terms is reasonable — provided we also include
quadratic terms for each variable.
Product interactions (Contd.)
I The first term, µ(x ∗ ), is a constant. The next sum will give us
linear terms in all the xi (plus more constants). The double
sum after that will give us terms for each product xi xj , plus
all the squares xi2 , plus more constants.

I Thus, if the true regression function is smooth, and we only


see a small range of values for each predictor variable, using
product terms is reasonable — provided we also include
quadratic terms for each variable.

I Further we note that if xi ’s are indicators the quadratic terms


are same as linear terms.
Product interactions (Contd.)
I The first term, µ(x ∗ ), is a constant. The next sum will give us
linear terms in all the xi (plus more constants). The double
sum after that will give us terms for each product xi xj , plus
all the squares xi2 , plus more constants.

I Thus, if the true regression function is smooth, and we only


see a small range of values for each predictor variable, using
product terms is reasonable — provided we also include
quadratic terms for each variable.

I Further we note that if xi ’s are indicators the quadratic terms


are same as linear terms.

I Obviously we can include other type of interactions like


x1 x2
1+|x1 x2 | but then we need to form a new column of predictors
in the design matrix.
Product interactions (Contd.)
I The first term, µ(x ∗ ), is a constant. The next sum will give us
linear terms in all the xi (plus more constants). The double
sum after that will give us terms for each product xi xj , plus
all the squares xi2 , plus more constants.

I Thus, if the true regression function is smooth, and we only


see a small range of values for each predictor variable, using
product terms is reasonable — provided we also include
quadratic terms for each variable.

I Further we note that if xi ’s are indicators the quadratic terms


are same as linear terms.

I Obviously we can include other type of interactions like


x1 x2
1+|x1 x2 | but then we need to form a new column of predictors
in the design matrix.

I Also then there may be difficulty in interpretations.


Example: Two way ANOVA

I Let there be two factors A and B. Suppose p levels of A


namely A1 , A2 , ..., Ap and q levels of B namely B1 , B2 , ..., Bq
constitute the entire population.
Example: Two way ANOVA

I Let there be two factors A and B. Suppose p levels of A


namely A1 , A2 , ..., Ap and q levels of B namely B1 , B2 , ..., Bq
constitute the entire population.

I Therefore we have pq level combinations (Ai , Bj ).


Example: Two way ANOVA

I Let there be two factors A and B. Suppose p levels of A


namely A1 , A2 , ..., Ap and q levels of B namely B1 , B2 , ..., Bq
constitute the entire population.

I Therefore we have pq level combinations (Ai , Bj ).

I Further suppose µij is fixed effect due to (Ai , Bj ).


Example: Two way ANOVA

I Let there be two factors A and B. Suppose p levels of A


namely A1 , A2 , ..., Ap and q levels of B namely B1 , B2 , ..., Bq
constitute the entire population.

I Therefore we have pq level combinations (Ai , Bj ).

I Further suppose µij is fixed effect due to (Ai , Bj ).

I Thus µij is the mean response of observations receiving


treatment combination (Ai , Bj ).
Interpretation

I The interpretation of a treatment mean µij depends on


whether the study is observational, experimental, or a mixture
of the two.
Interpretation

I The interpretation of a treatment mean µij depends on


whether the study is observational, experimental, or a mixture
of the two.

I In an observational study, the treatment mean µij corresponds


to the population mean for the elements having the
characteristics of the i th level of factor A and the j th level of
factor B.
Interpretation

I The interpretation of a treatment mean µij depends on


whether the study is observational, experimental, or a mixture
of the two.

I In an observational study, the treatment mean µij corresponds


to the population mean for the elements having the
characteristics of the i th level of factor A and the j th level of
factor B.

I In an experimental study, the treatment mean µij stands for


the mean response that would be obtained if the treatment
consisting of the i th level of factor A and the j th level of factor
B were applied to all units in the population of experimental
units about which inferences are to be drawn.
Reparametrization

I For all i, j, rewrite µij as


µij = µ̄00 + (µ̄i0 − µ̄00 ) + (µ̄0j − µ̄00 ) + (µij − µ̄i0 − µ̄0j + µ̄00 )
Reparametrization

I For all i, j, rewrite µij as


µij = µ̄00 + (µ̄i0 − µ̄00 ) + (µ̄0j − µ̄00 ) + (µij − µ̄i0 − µ̄0j + µ̄00 )

XX
1
I Here µ̄00 = pq µij is the general effect (say µ) as it is
i j
obtained by averaging over the effects of all possible level
combinations.
Reparametrization

I For all i, j, rewrite µij as


µij = µ̄00 + (µ̄i0 − µ̄00 ) + (µ̄0j − µ̄00 ) + (µij − µ̄i0 − µ̄0j + µ̄00 )

XX
1
I Here µ̄00 = pq µij is the general effect (say µ) as it is
i j
obtained by averaging over the effects of all possible level
combinations.
I Further
1X
µ̄i0 = µij = the fixed effect due to Ai
q j
X
⇒ αi = µ̄i0 − µ̄00 = fixed additional effect (main) due to Ai with αi = 0
i
I And
1X
µ̄0j = µij = the fixed effect due to Bj
p i
X
⇒ βj = µ̄0j − µ̄00 = fixed additional effect (main) due to Bj with βj = 0.
j
I And
1X
µ̄0j = µij = the fixed effect due to Bj
p i
X
⇒ βj = µ̄0j − µ̄00 = fixed additional effect (main) due to Bj with βj = 0.
j

I Also µij − µ̄i0 is the additional effect due to Bj when A is held constant at the
i th level Ai .
I And
1X
µ̄0j = µij = the fixed effect due to Bj
p i
X
⇒ βj = µ̄0j − µ̄00 = fixed additional effect (main) due to Bj with βj = 0.
j

I Also µij − µ̄i0 is the additional effect due to Bj when A is held constant at the
i th level Ai .

I Averaging out over those effects for varying i, we get µ̄0j − µ̄00 .
I Thus

γij = (µij − µ̄i0 ) − (µ̄0j − µ̄00 ) = fixed interaction effect due to (Ai , Bj )

with X
γij = 0 for all j
i

and X
γij = 0 for all i.
j
Interaction or no interaction?
I One potential question of interest is when should we include
interaction in our model?
Interaction or no interaction?
I One potential question of interest is when should we include
interaction in our model?
I The fact is that there cannot be any objective answer to this
question.
Interaction or no interaction?
I One potential question of interest is when should we include
interaction in our model?
I The fact is that there cannot be any objective answer to this
question.
I Rather let us understand the difference between including or
not including the interaction term in the model.
Interaction or no interaction?
I One potential question of interest is when should we include
interaction in our model?
I The fact is that there cannot be any objective answer to this
question.
I Rather let us understand the difference between including or
not including the interaction term in the model.
I For illustration let us consider an example of a simple
two-factor study in which the effects of gender (male and
female) and age (young, middle and old) on learning of a task
are of interest.
Interaction or no interaction?
I One potential question of interest is when should we include
interaction in our model?
I The fact is that there cannot be any objective answer to this
question.
I Rather let us understand the difference between including or
not including the interaction term in the model.
I For illustration let us consider an example of a simple
two-factor study in which the effects of gender (male and
female) and age (young, middle and old) on learning of a task
are of interest.
I When we assume no interaction effects we call the factor
effects are additive, that is,
µij = µ + αi + βj
Interaction or no interaction?
I One potential question of interest is when should we include
interaction in our model?
I The fact is that there cannot be any objective answer to this
question.
I Rather let us understand the difference between including or
not including the interaction term in the model.
I For illustration let us consider an example of a simple
two-factor study in which the effects of gender (male and
female) and age (young, middle and old) on learning of a task
are of interest.
I When we assume no interaction effects we call the factor
effects are additive, that is,
µij = µ + αi + βj

I This can mean two things.


No interaction

I The figure shows that Age has some effect (due to difference
in height) whereas gender has no effect (since lines have zero
slope) on the mean response.
I Also the lines do not intersect meaning that there is no
interaction effect.
No interaction

I Here both age and gender have effects on the mean response
but still there is no interaction effect because the lines do not
intersect.
I Thus it is entirely possible that factors are additive (that is
factors have main effects but they do not interact).
Interaction

I There are main effects of both the factors along with the
interaction effect.
I Is it possible that factors have interaction effects but no main
effects? (Can some parallel lines intersect ? )
Notes on interactions
I In case of multifactor studies some interactions may be zero
even though the factors are interacting. All interactions must
equal zero in order for the two factors to be additive.
Notes on interactions
I In case of multifactor studies some interactions may be zero
even though the factors are interacting. All interactions must
equal zero in order for the two factors to be additive.

I When two factors interact, the question arises whether the


factor level means are still meaningful measures.
Notes on interactions
I In case of multifactor studies some interactions may be zero
even though the factors are interacting. All interactions must
equal zero in order for the two factors to be additive.

I When two factors interact, the question arises whether the


factor level means are still meaningful measures.

I For instance, suppose in our example the gender factor level


means comes out to be 13 and 11. It may be argued that
these are misleading measures.
Notes on interactions
I In case of multifactor studies some interactions may be zero
even though the factors are interacting. All interactions must
equal zero in order for the two factors to be additive.

I When two factors interact, the question arises whether the


factor level means are still meaningful measures.

I For instance, suppose in our example the gender factor level


means comes out to be 13 and 11. It may be argued that
these are misleading measures.

I They indicate that some difference exists in learning time for


men and women, but that this difference is not too great.
Notes on interactions
I In case of multifactor studies some interactions may be zero
even though the factors are interacting. All interactions must
equal zero in order for the two factors to be additive.

I When two factors interact, the question arises whether the


factor level means are still meaningful measures.

I For instance, suppose in our example the gender factor level


means comes out to be 13 and 11. It may be argued that
these are misleading measures.

I They indicate that some difference exists in learning time for


men and women, but that this difference is not too great.

I These factor level means hide the fact that there is no


difference in mean learning time between genders for young
persons, but there is a relatively large difference for old
persons.
I In such a case we call the interaction effect to be important
interactions implying that one should not ordinarily examine
the effects of each factor separately in terms of the factor level
means.
I In such a case we call the interaction effect to be important
interactions implying that one should not ordinarily examine
the effects of each factor separately in terms of the factor level
means.
I Sometimes when two factors interact, the interaction effects
are so small that they are considered to be unimportant
interactions (the curves get almost parallel).
I In such a case we call the interaction effect to be important
interactions implying that one should not ordinarily examine
the effects of each factor separately in terms of the factor level
means.
I Sometimes when two factors interact, the interaction effects
are so small that they are considered to be unimportant
interactions (the curves get almost parallel).
I In the case of unimportant interactions, the analysis of factor
effects can proceed as for the case of no interactions.
I In such a case we call the interaction effect to be important
interactions implying that one should not ordinarily examine
the effects of each factor separately in terms of the factor level
means.
I Sometimes when two factors interact, the interaction effects
are so small that they are considered to be unimportant
interactions (the curves get almost parallel).
I In the case of unimportant interactions, the analysis of factor
effects can proceed as for the case of no interactions.
I The determination of whether interactions are important or
unimportant is admittedly sometimes difficult because it
depends on the context of the application.
I In such a case we call the interaction effect to be important
interactions implying that one should not ordinarily examine
the effects of each factor separately in terms of the factor level
means.
I Sometimes when two factors interact, the interaction effects
are so small that they are considered to be unimportant
interactions (the curves get almost parallel).
I In the case of unimportant interactions, the analysis of factor
effects can proceed as for the case of no interactions.
I The determination of whether interactions are important or
unimportant is admittedly sometimes difficult because it
depends on the context of the application.
I The subject area specialist (researcher) needs to play a
prominent role in deciding whether an interaction is important
or unimportant. The advantage of unimportant (or no)
interactions, namely, that one is then able to analyze the
factor effects separated is especially great when the study
contains more than two factors.
I Occasionally, it is meaningful to consider the effects of each factor in
terms of the factor level means even when important interactions are
present.
I Occasionally, it is meaningful to consider the effects of each factor in
terms of the factor level means even when important interactions are
present.

I For example, two methods of teaching Linear Models (hard: using


projections and standard: using usual sampling distributions) were used in
teaching students of excellent, good, and medium quantitative ability.
I Occasionally, it is meaningful to consider the effects of each factor in
terms of the factor level means even when important interactions are
present.

I For example, two methods of teaching Linear Models (hard: using


projections and standard: using usual sampling distributions) were used in
teaching students of excellent, good, and medium quantitative ability.

I Important interactions between teaching method and student’s


quantitative ability were found to be present.
I Occasionally, it is meaningful to consider the effects of each factor in
terms of the factor level means even when important interactions are
present.

I For example, two methods of teaching Linear Models (hard: using


projections and standard: using usual sampling distributions) were used in
teaching students of excellent, good, and medium quantitative ability.

I Important interactions between teaching method and student’s


quantitative ability were found to be present.

I Students with excellent quantitative ability tended to perform equally,


with the two teaching methods.
I Occasionally, it is meaningful to consider the effects of each factor in
terms of the factor level means even when important interactions are
present.

I For example, two methods of teaching Linear Models (hard: using


projections and standard: using usual sampling distributions) were used in
teaching students of excellent, good, and medium quantitative ability.

I Important interactions between teaching method and student’s


quantitative ability were found to be present.

I Students with excellent quantitative ability tended to perform equally,


with the two teaching methods.

I Whereas students of moderate or good quantitative ability tended to


perform better when taught by the standard method.
I Occasionally, it is meaningful to consider the effects of each factor in
terms of the factor level means even when important interactions are
present.

I For example, two methods of teaching Linear Models (hard: using


projections and standard: using usual sampling distributions) were used in
teaching students of excellent, good, and medium quantitative ability.

I Important interactions between teaching method and student’s


quantitative ability were found to be present.

I Students with excellent quantitative ability tended to perform equally,


with the two teaching methods.

I Whereas students of moderate or good quantitative ability tended to


perform better when taught by the standard method.

I If equal numbers of students with moderate good, and excellent


quantitative ability are to be taught by one of the the teaching methods,
then the method that produces the best average result for all students
might be of interest even in the presence of important interactions.
I Occasionally, it is meaningful to consider the effects of each factor in
terms of the factor level means even when important interactions are
present.

I For example, two methods of teaching Linear Models (hard: using


projections and standard: using usual sampling distributions) were used in
teaching students of excellent, good, and medium quantitative ability.

I Important interactions between teaching method and student’s


quantitative ability were found to be present.

I Students with excellent quantitative ability tended to perform equally,


with the two teaching methods.

I Whereas students of moderate or good quantitative ability tended to


perform better when taught by the standard method.

I If equal numbers of students with moderate good, and excellent


quantitative ability are to be taught by one of the the teaching methods,
then the method that produces the best average result for all students
might be of interest even in the presence of important interactions.

I A comparison of the teaching method factor level means would then be


relevant, even though important interactions are present.
Two way layout with one observation per cell
I In many studies we have constraints on cost, time, and
materials that limit the number of observations that can be
obtained.
Two way layout with one observation per cell
I In many studies we have constraints on cost, time, and
materials that limit the number of observations that can be
obtained.

I For example, a process engineer in a manufacturing company


may have only a limited time to experiment with the
production line.
Two way layout with one observation per cell
I In many studies we have constraints on cost, time, and
materials that limit the number of observations that can be
obtained.

I For example, a process engineer in a manufacturing company


may have only a limited time to experiment with the
production line.

I If the line is available for one day and only eight batches of
product can be produced in a day, the experiment may have to
be limited to eight observations.
Two way layout with one observation per cell
I In many studies we have constraints on cost, time, and
materials that limit the number of observations that can be
obtained.

I For example, a process engineer in a manufacturing company


may have only a limited time to experiment with the
production line.

I If the line is available for one day and only eight batches of
product can be produced in a day, the experiment may have to
be limited to eight observations.

I If the study involves one factor at four levels and a second


factor at two levels so that there are eight factor level
combinations, only one replication of the experiment is then
possible for each treatment.
I Another reason why some studies contain only one case per
treatment is that the response of interest is a single aggregate
measure of performance.
I Another reason why some studies contain only one case per
treatment is that the response of interest is a single aggregate
measure of performance.

I For example, in a marketing research study of alternative


package designs, evaluation of each alternative may require a
separate market test.
I Another reason why some studies contain only one case per
treatment is that the response of interest is a single aggregate
measure of performance.

I For example, in a marketing research study of alternative


package designs, evaluation of each alternative may require a
separate market test.

I The response of interest is the observed market share, and this


results in a single response for each treatment combination.
I Another reason why some studies contain only one case per
treatment is that the response of interest is a single aggregate
measure of performance.

I For example, in a marketing research study of alternative


package designs, evaluation of each alternative may require a
separate market test.

I The response of interest is the observed market share, and this


results in a single response for each treatment combination.

I Special attention is required for the analysis of two-factor


studies containing only one replication per treatment because
no degrees of freedom are available for estimation of the
experimental error with the standard two-factor ANOVA
model.
No interaction model

I When there is only one case for each treatment, we no longer


can work with two-factor ANOVA model using interaction
effect.
No interaction model

I When there is only one case for each treatment, we no longer


can work with two-factor ANOVA model using interaction
effect.

I This is because no estimate of the error variance σ 2 will be


available.
No interaction model

I When there is only one case for each treatment, we no longer


can work with two-factor ANOVA model using interaction
effect.

I This is because no estimate of the error variance σ 2 will be


available.

I Recall that SSE is a sum of squares made up of components


measuring the variability within each treatment.
No interaction model

I When there is only one case for each treatment, we no longer


can work with two-factor ANOVA model using interaction
effect.

I This is because no estimate of the error variance σ 2 will be


available.

I Recall that SSE is a sum of squares made up of components


measuring the variability within each treatment.

I With only one case per treatment, there is no variability within


a treatment, and SSE will then always be zero.
I A way out of this difficulty is to change the model.
I A way out of this difficulty is to change the model.

I We shall see later that if the two factors do not interact so


that γij = 0, the interaction mean square MSAB has
expectation σ 2 .
I A way out of this difficulty is to change the model.

I We shall see later that if the two factors do not interact so


that γij = 0, the interaction mean square MSAB has
expectation σ 2 .

I Thus, if it is possible to assume that the two factors do not


interact, we may use MSAB as the estimator of the error
variance σ 2 and proceed with the analysis of factor effects as
usual.
I A way out of this difficulty is to change the model.

I We shall see later that if the two factors do not interact so


that γij = 0, the interaction mean square MSAB has
expectation σ 2 .

I Thus, if it is possible to assume that the two factors do not


interact, we may use MSAB as the estimator of the error
variance σ 2 and proceed with the analysis of factor effects as
usual.

I If it is unreasonable to assume that the two factors do not


interact, transformations may be tried to remove the
interaction effects.
Model

I We assume that we have a single observation yij corresponding


to each level combination.
Model

I We assume that we have a single observation yij corresponding


to each level combination.

I Hence the model we consider here is

yij = µij + eij , i = 1, 2, ..., p, j = 1, 2, ..., q

where µij is fixed effect due to (Ai , Bj ) and eij is random error.
Model

I We assume that we have a single observation yij corresponding


to each level combination.

I Hence the model we consider here is

yij = µij + eij , i = 1, 2, ..., p, j = 1, 2, ..., q

where µij is fixed effect due to (Ai , Bj ) and eij is random error.

I With the reparametrized version the model reduces to

yij = µ + αi + βj + eij .
Example: Two way layout with more than one observation
per cell
I Let there be two factors A and B such that A has p levels
A1 , A2 , ..., Ap and B has q levels B1 , B2 , ..., Bq . These pq level
combinations (Ai , Bj ) constitute the entire population of
interest.
Example: Two way layout with more than one observation
per cell
I Let there be two factors A and B such that A has p levels
A1 , A2 , ..., Ap and B has q levels B1 , B2 , ..., Bq . These pq level
combinations (Ai , Bj ) constitute the entire population of
interest.
I Further we assume that we have m observations corresponding
to each level combination.
Example: Two way layout with more than one observation
per cell
I Let there be two factors A and B such that A has p levels
A1 , A2 , ..., Ap and B has q levels B1 , B2 , ..., Bq . These pq level
combinations (Ai , Bj ) constitute the entire population of
interest.
I Further we assume that we have m observations corresponding
to each level combination.
I Suppose yijk be the k th observation receiving the treatment
combination (Ai , Bj ).
Example: Two way layout with more than one observation
per cell
I Let there be two factors A and B such that A has p levels
A1 , A2 , ..., Ap and B has q levels B1 , B2 , ..., Bq . These pq level
combinations (Ai , Bj ) constitute the entire population of
interest.
I Further we assume that we have m observations corresponding
to each level combination.
I Suppose yijk be the k th observation receiving the treatment
combination (Ai , Bj ).
I Then the model we consider here is

yijk = µ+αi +βj +γij +eijk , k = 1, 2, ..m, i = 1, 2, .., p, j = 1, 2, ..., q


where eijk is the random error which we assume to be
I independent (over i, j and k)
I having N(0, σ 2 ) for all i, j, k.
Ordinal Factors
I In case of ordinal variables the levels can be put in a sensible order, but
there’s no implication that the distance from one level to the next is
constant.
Ordinal Factors
I In case of ordinal variables the levels can be put in a sensible order, but
there’s no implication that the distance from one level to the next is
constant.
I We have basically two ways to handle them:
I Ignoring the ordering and treat them like nominal categorical
variables.
I Ignoring the fact that they’re only ordinal and not metric, assign
them numerical codes (say 1, 2, 3, . . . ) and treat them like
ordinary numerical variables.
Ordinal Factors
I In case of ordinal variables the levels can be put in a sensible order, but
there’s no implication that the distance from one level to the next is
constant.
I We have basically two ways to handle them:
I Ignoring the ordering and treat them like nominal categorical
variables.
I Ignoring the fact that they’re only ordinal and not metric, assign

them numerical codes (say 1, 2, 3, . . . ) and treat them like


ordinary numerical variables.
I The first procedure is unbiased, but can end up dealing with a lot of
distinct coefficients.
Ordinal Factors
I In case of ordinal variables the levels can be put in a sensible order, but
there’s no implication that the distance from one level to the next is
constant.
I We have basically two ways to handle them:
I Ignoring the ordering and treat them like nominal categorical
variables.
I Ignoring the fact that they’re only ordinal and not metric, assign

them numerical codes (say 1, 2, 3, . . . ) and treat them like


ordinary numerical variables.
I The first procedure is unbiased, but can end up dealing with a lot of
distinct coefficients.
I It also has the drawback that if the relationship between Y and the
categorical variable is monotone, that may not be respected by the
coefficients we estimate.
Ordinal Factors
I In case of ordinal variables the levels can be put in a sensible order, but
there’s no implication that the distance from one level to the next is
constant.
I We have basically two ways to handle them:
I Ignoring the ordering and treat them like nominal categorical
variables.
I Ignoring the fact that they’re only ordinal and not metric, assign

them numerical codes (say 1, 2, 3, . . . ) and treat them like


ordinary numerical variables.
I The first procedure is unbiased, but can end up dealing with a lot of
distinct coefficients.
I It also has the drawback that if the relationship between Y and the
categorical variable is monotone, that may not be respected by the
coefficients we estimate.
I The second procedure is very easy, but usually without any substantive or
logical basis. It implies that each step up in the ordinal variable will
predict exactly the same difference in y , and why should that be the
case?
Ordinal Factors
I In case of ordinal variables the levels can be put in a sensible order, but
there’s no implication that the distance from one level to the next is
constant.
I We have basically two ways to handle them:
I Ignoring the ordering and treat them like nominal categorical
variables.
I Ignoring the fact that they’re only ordinal and not metric, assign

them numerical codes (say 1, 2, 3, . . . ) and treat them like


ordinary numerical variables.
I The first procedure is unbiased, but can end up dealing with a lot of
distinct coefficients.
I It also has the drawback that if the relationship between Y and the
categorical variable is monotone, that may not be respected by the
coefficients we estimate.
I The second procedure is very easy, but usually without any substantive or
logical basis. It implies that each step up in the ordinal variable will
predict exactly the same difference in y , and why should that be the
case?
I If, after treating an ordinal variable like a nominal one, we get contrasts
which are all (approximately) equally spaced, we might then try the
Factors along with quantitative covariates
I It is perfectly possible that in our linear model some covariates
are factors and others are numeric variables.
Factors along with quantitative covariates
I It is perfectly possible that in our linear model some covariates
are factors and others are numeric variables.
I To illustrate things let us assume that we are dealing with a
factor having two levels and there are other p numeric
covariates x1 , x2 , ..., xp .
Factors along with quantitative covariates
I It is perfectly possible that in our linear model some covariates
are factors and others are numeric variables.
I To illustrate things let us assume that we are dealing with a
factor having two levels and there are other p numeric
covariates x1 , x2 , ..., xp .
I Thus introducing a single dummy variable xb we can write the
linear model as
p
X
y = α + βb xb + βi xi + .
i=1
Factors along with quantitative covariates
I It is perfectly possible that in our linear model some covariates
are factors and others are numeric variables.
I To illustrate things let us assume that we are dealing with a
factor having two levels and there are other p numeric
covariates x1 , x2 , ..., xp .
I Thus introducing a single dummy variable xb we can write the
linear model as
p
X
y = α + βb xb + βi xi + .
i=1

I Geometrically, if we plot the expected value of y against


x1 , ...xp , we will now get two regression surfaces.
Factors along with quantitative covariates
I It is perfectly possible that in our linear model some covariates
are factors and others are numeric variables.
I To illustrate things let us assume that we are dealing with a
factor having two levels and there are other p numeric
covariates x1 , x2 , ..., xp .
I Thus introducing a single dummy variable xb we can write the
linear model as
p
X
y = α + βb xb + βi xi + .
i=1

I Geometrically, if we plot the expected value of y against


x1 , ...xp , we will now get two regression surfaces.
I They will be parallel to each other, and offset by βb .
Factors along with quantitative covariates
I It is perfectly possible that in our linear model some covariates
are factors and others are numeric variables.
I To illustrate things let us assume that we are dealing with a
factor having two levels and there are other p numeric
covariates x1 , x2 , ..., xp .
I Thus introducing a single dummy variable xb we can write the
linear model as
p
X
y = α + βb xb + βi xi + .
i=1

I Geometrically, if we plot the expected value of y against


x1 , ...xp , we will now get two regression surfaces.
I They will be parallel to each other, and offset by βb .
I We thus have a model where each category gets its own
intercept: α for the baseline level and α + βb for the other
class.
Why not just split the data?
I If we want to give each class its own intercept, why not just
split the data and estimate two models, one for each class?
Why not just split the data?
I If we want to give each class its own intercept, why not just
split the data and estimate two models, one for each class?

I The answer is that sometimes we’ll do just this, especially if


there’s a lot of data for each class.
Why not just split the data?
I If we want to give each class its own intercept, why not just
split the data and estimate two models, one for each class?

I The answer is that sometimes we’ll do just this, especially if


there’s a lot of data for each class.

I However, if the regression surfaces for the two categories really


are parallel to each other, by splitting the data we’re losing
some precision in our estimate of the common slopes, without
gaining anything.
Why not just split the data?
I If we want to give each class its own intercept, why not just
split the data and estimate two models, one for each class?

I The answer is that sometimes we’ll do just this, especially if


there’s a lot of data for each class.

I However, if the regression surfaces for the two categories really


are parallel to each other, by splitting the data we’re losing
some precision in our estimate of the common slopes, without
gaining anything.

I In fact, if the two surfaces are nearly parallel, for moderate


sample sizes the small bias that comes from pretending the
slopes are all equal can be overwhelmed by the reduction in
variance, so that the resulting MSE of the estimates of
parameters are less.
Interaction of Categorical and Numerical Variables

I If we multiply the indicator variable,say xb for a binary


category, with an ordinary numerical variable, say x1 , we get a
different slope on xi for each category:

y = α + β1 x1 + β1b xb x1 + 
Interaction of Categorical and Numerical Variables

I If we multiply the indicator variable,say xb for a binary


category, with an ordinary numerical variable, say x1 , we get a
different slope on xi for each category:

y = α + β1 x1 + β1b xb x1 + 

I When xb = 0, the slope on x1 is β1 , but when xb = 1, the


slope on x1 is β1 + β1b
Interaction of Categorical and Numerical Variables

I If we multiply the indicator variable,say xb for a binary


category, with an ordinary numerical variable, say x1 , we get a
different slope on xi for each category:

y = α + β1 x1 + β1b xb x1 + 

I When xb = 0, the slope on x1 is β1 , but when xb = 1, the


slope on x1 is β1 + β1b

I The coefficient for the interaction is the difference in slopes


between the two categories.
Interaction of Categorical and Numerical Variables

I If we multiply the indicator variable,say xb for a binary


category, with an ordinary numerical variable, say x1 , we get a
different slope on xi for each category:

y = α + β1 x1 + β1b xb x1 + 

I When xb = 0, the slope on x1 is β1 , but when xb = 1, the


slope on x1 is β1 + β1b

I The coefficient for the interaction is the difference in slopes


between the two categories.

I It says that the categories share a common intercept, but their


regression lines are not parallel (unless β1b = 0).
Interaction (Contd.)

I We could expand the model by letting each category have its


own slope and its own intercept:

y = α + βb xb + β1 x1 + β1b xb x1 + 
Interaction (Contd.)

I We could expand the model by letting each category have its


own slope and its own intercept:

y = α + βb xb + β1 x1 + β1b xb x1 + 

I This model, where “everything is interacted with the category”,


is very close to just running two separate regressions, one per
category.
Interaction (Contd.)

I We could expand the model by letting each category have its


own slope and its own intercept:

y = α + βb xb + β1 x1 + β1b xb x1 + 

I This model, where “everything is interacted with the category”,


is very close to just running two separate regressions, one per
category.

I It does, however, insist on having a single noise variance σ 2


(which separate regressions wouldn’t accomplish).

You might also like