Lecture7 - Regression Extensions
Lecture7 - Regression Extensions
Chu Junhong
[email protected]
HKU Business School
Road Map
Regression on categorical variables
Interactions
Different slopes
Different intercepts
Seasonality
Day of week effect
Month of year effect
Hour of day effect
Time trend
Categorical Variables
A categorical variable, also called a
qualitative variable, takes a countable
number of distinct (and fixed) groups
(attribute levels) and assigns each
individual to a particular group on the basis
of some qualitative property. Examples:
Religion, gender, undergraduate university,
college major, product brand, distribution
channel, season, day of a week,…
The groups have no natural order, i.e., you
cannot say that winter is better/worse than
summer.
Categorical Variables
Categorical variables cannot be the Y
variable; they can only be the X
variables
However, statistical models can only
work on numerical data
If you have categorical variables
(qualitative data) what should you
do?
6
But, the numbers of 1,…,7 are just labels. They do not contain
any numerical value (e.g., 7 is not larger than 6 by 1), and can
be relabeled easily with, say, A, .., G, without any information loss.
We cannot simply put these numbers into a regression equation
Solution: Create Dummy
Variables
A dummy variable is an indicator that
only take two values: 0 and 1
is used to represent each response or attribute level of
a categorical variable
# of dummy variables needed = # of levels –
1
Examples
Gender: 1) male, 2) female => need 1 dummy variable
Education: 1) no schooling, 2) primary, 3) middle
school, 4) high school, 5) college+ => need 4 dummy
variables
Religion: 1) Catholic, 2) Christian, 3) Buddhist, 4)
Muslim, 5) Free thinker, 6) others => need 5 dummy
variables
Why Only K-1 Dummies for K
Attributes?

What is the corresponding x for ?

 (a vector of 1’s)
Treat Numeric Var as
Categories
We can also treat numeric variables (e.g.,
income bracket, distance segment) as
categorical
To examine the effect of each group
separately
To allow for non-linear effects (will use
examples)
Linear effect: the effect of increasing from 1 to 2
is the same as increasing from 2 to 3, or from 3
to 4, …
Nonlinear effect: the effect of increasing from 1
to 2 ⧧ from 2 to 3 ⧧ from 3 to 4, …
How to Interpret Dummy
Variables?
For k levels (attributes, responses), we can only have k-1
dummy variables in a regression as X variables if the var
enters alone.
One level or attribute / response is reserved as the “base”; all
interpretation is relative to the base/reference, whose
coefficient is 0
If the coefficient is positive, it is higher/larger than the base
If the coefficient is negative, it is lower/smaller than the
base
If female is the base:
When the male’s coefficient is positive, it means that
“compared to females, males are on average taller by…”
When the male’s coefficient is negative, it means that
“compared to females, males on average have lower xxx by…”
Whichever group is used as the reference, the coefficients for
the dummies will be different, but the interpretations will be
the same
Check whether WTP depends on Your
undergraduate majors
Merge the
undergraduate
major IDs back to
the data
Generate
dummies for
undergraduate
majors
Different Intercepts
Using “array” to
generate dummies for
major IDs
Run
regressions on
dummies
Use MajorID=1 Use MajorID=4
as the as the
reference reference
WTP maj
for
wine
maj
maj
maj
Height
Interactions (1):
categorical*continuous
In data analysis, we often interact two
independent variables.

If  is categorical (undergraduate major) and  is
continuous (height, price), then means
 means that you
will have one slope for each level of . You will
have K slopes.
The different effects of your height on your WTP by
undergraduate majors
The different effects of father’s height on your height by
undergraduate majors
3 outlet types in the IRI data: grocery store, mass
merchandisers, and drug store, you will a slope for each of
them
Different Slopes for each Major
WTP
for
wine
maj
maj
maj
maj
Height
Interactions (2): categorical
+categorical*continuous
In data analysis, we often interact two
independent variables.

If  is categorical (undergraduate major) and 
is continuous (height, price), then  means
that you will have one intercept for level of 
(K-1 in total); means
means that you will have one
slope for each level of . You will have K
slopes.
Different intercepts
different + slopes
WTP
for
wine
maj
maj
maj
maj
Height
Interactions: Categorical*Categorical
4 majors, 2 genders
One intercept for each combination
Interactions: Continuous*Continuous
R2 = 0.1476 R2 = 0.4378
Whether Purchase Q and Price
Sensitivity vary by outlet type
R2 = 0.4427
Whether Purchase Q and Price
Sensitivity vary by outlet type
R2 = 0.4427
Compare the Results with
Different Bases (References)