Lecture 19: Interactions
Lecture 19: Interactions
Let
m(x) = E[Y |X = x]
where x = (x1 , . . . , xp ). We say that there is no interaction between Xj and Xk if
∂m(x)
∂xi
does not depend on xj .
Consider the linear model
m(x) = β0 + β1 x1 + β2 x2 .
Then ∂m(x)
∂x1 = β1 and
∂m(x)
∂x2 = β2 . There are no interactions.
Now suppose that
m(x) = β0 + β1 x1 + β2 x2 + β3 x1 x2 .
Then ∂m(x) ∂m(x)
∂x1 = β1 + β3 x2 and ∂x2 = β2 + β3 x1 . So we say there is an interaction between x1 and
x2 .
If your model does not fit well, then adding interactions is yet another way to improve the fit
of the model. You could plot the residuals versus X1 X2 or just as the interaction to the model.
Y = β0 + β1 X1 + β2 X2 + β3 X1 X2 + . (1)
Once we add such a term, we estimate β3 in exactly the same way we’d estimate any other coefficient.
People often call β1 and β2 the main effects and they call β3 the interaction effect. This is not
the greatest terminology but it is pretty standard. Usually people don’t add interactions into a
model without adding the main effects. So it’s rare to see a model of the form Y = β0 +β3 X1 X2 +.
Adding in the main effects gives a model with more flexibility and generality.
Y = β0 + β1 X1 + β1B XB X1 + . (2)
When XB = 0, the slope on X1 is β1 , but when XB = 1, the slope on X1 is β1 + β1B ; the coefficient
for the interaction is the difference in slopes between the two categories.
In fact, look closely at Eq. 2. It says that the categories share a common intercept, but their
regression lines are not parallel (unless β1B = 0). We could expand the model by letting each
category have its own slope and its own intercept:
Y = β0 + βB XB + β1 X1 + β1B XB X1 + .
1
This model is similar to running two separate regressions, one per category. It does, however, insist
on having a single noise variance σ 2 (which separate regressions wouldn’t accomplish). Also, if
there were additional predictors in the model which were not interacted with the category, e.g.,
Y = β0 + βB XB + β1 X1 + β1B XB X1 + β2 X2 +
then this would definitely not be the same as running two separate regressions. We can also add
categoricla variables and interactions with categorical variables. Just remember that a categorical
variable with k levels requires adding only k − 1 indicator variables.
Y = β0 + β1 XB + β2 XC + β3 XB XC +
E [Y |XB = 0, XC = 0] = β0 (3)
E [Y |XB = 1, XC = 0] = β0 + β1 (4)
E [Y |XB = 0, XC = 1] = β0 + β2 (5)
E [Y |XB = 1, XC = 1] = β0 + β1 + β2 + β3 (6)
Conversely, these give us four equations in four unknowns, so if we know the group or conditional
means on the left-hand sides, we could solve these equations for the β’s.
3 Higher-Order Interactions
Nothing stops us from considering interactions among three or more variables, rather than just
two. For example
Y = β0 + β1 X1 + β2 X2 + β3 X3 + β4 X1 X2 + β5 X1 X3 + β6 X2 X3 + β7 X1 X2 X3 + .
As you can see, these models get complicated very quickly. Also, we have to ask ourselves: which
interactions should I add? For example, I could have added X12 X2 into the model as well as other
terms. We are now entering the realm of model-building and model-selection that we will discuss
in a future lecture. For now, we will try to keep our models fairly simple.
4 Interactions in R
The lm function is set up to comprehend multiplicative or product interactions in model formulas.
Pure product interactions are denoted by :, so the formula
lm(y ~ x1:x2)
2
tells R to fit the model Y = β0 + βX1 X2 + . (Intercepts are included by default in R.) Since it is
relatively rare to include just a product term without linear terms, it’s more common to use the
symbol *, which expands out to both sets of terms. That is,
lm(y ~ x1*x2)
(x1+x2):(x3+x4)
is the same as
Also,
(x1+x2)*(x3+x4)
is the same as
The reason you can’t just write x1^2 in your model formula is that the power operator also has
a special meaning in formulas, of repeatedly *-ing its argument with itself. That is,
(x1+x2+x3)^2
is the same as
(x1+x2+x3)(x1+x2+x3)
which is
poly and interactions. If you want to use poly to do polynomial regression and interactions,
do this:
lm(y ~ poly(x1,x2,degree=2)
Y = β0 + β1 X1 + β2 X12 + β3 X2 + β4 X22 + β5 X1 X2 + .
3
4.1 Example
Let’s continue with the mobility data. First, here is a useful trick:
x = c("a","b","c","d","e","f")
y = c("a","b")
z = x %in% y
print(z)
## [1] TRUE TRUE FALSE FALSE FALSE FALSE
The command
%in%
is a matching operator.
Let’s use this to create a binary variable indicating whether a state was or was not part of the
Confederacy in the Civil War.
Confederacy = c("AR","AL","FL","GA","LA","MS","NC","SC","TN","TX","VA")
mobility$Dixie = mobility$State %in% Confederacy
out = lm(Mobility ~ Commute*Dixie,data=mobility)
summary(out)
The coefficient for the interaction is negative, suggesting that increasing the fraction of workers
with short commutes predicts a smaller difference in rates of mobility in the South than it does in the
rest of the country. This coefficient is not significantly different from zero, but, more importantly,
we can be confident it is small, compared to the base-line value of the slope on Commute:
confint(out)
## 2.5 % 97.5 %
## (Intercept) 0.00543 0.03220
## Commute 0.16900 0.22200
## DixieTRUE -0.04470 0.00225
## Commute:DixieTRUE -0.05680 0.05420
Thus, even if the South does have a different slope than the rest of the country, it is not a very
different slope.
The difference in the intercept, however, is more substantial. It, too, is not significant at the
5% level, but that is because (as we see from the confidence interval) it might be quite large and
negative or perhaps just barely positive — it’s not so precisely measured, but it’s either lowering the
expected rate of mobility or adding to it trivially. Of course, we should really do all our diagnostics
here before paying much attention to these inferential statistics.