0% found this document useful (0 votes)
18 views58 pages

Lecture7 - Regression Extensions

The document discusses regression analysis techniques for categorical variables, including the creation of dummy variables, interactions between variables, and the effects of seasonality and time trends. It explains how to interpret coefficients for categorical variables and the importance of reference groups in regression models. Additionally, it provides practical examples and applications of these concepts in data analysis.

Uploaded by

JackCaizhizhao
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as KEY, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
18 views58 pages

Lecture7 - Regression Extensions

The document discusses regression analysis techniques for categorical variables, including the creation of dummy variables, interactions between variables, and the effects of seasonality and time trends. It explains how to interpret coefficients for categorical variables and the importance of reference groups in regression models. Additionally, it provides practical examples and applications of these concepts in data analysis.

Uploaded by

JackCaizhizhao
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as KEY, PDF, TXT or read online on Scribd
You are on page 1/ 58

Regression Extensions

Chu Junhong
[email protected]
HKU Business School
Road Map
Regression on categorical variables
Interactions
Different slopes
Different intercepts
Seasonality
Day of week effect
Month of year effect
Hour of day effect
Time trend
Categorical Variables
A categorical variable, also called a
qualitative variable, takes a countable
number of distinct (and fixed) groups
(attribute levels) and assigns each
individual to a particular group on the basis
of some qualitative property. Examples:
Religion, gender, undergraduate university,
college major, product brand, distribution
channel, season, day of a week,…
The groups have no natural order, i.e., you
cannot say that winter is better/worse than
summer.
Categorical Variables
Categorical variables cannot be the Y
variable; they can only be the X
variables
However, statistical models can only
work on numerical data
If you have categorical variables
(qualitative data) what should you
do?
6

But, the numbers of 1,…,7 are just labels. They do not contain
any numerical value (e.g., 7 is not larger than 6 by 1), and can
be relabeled easily with, say, A, .., G, without any information loss.
We cannot simply put these numbers into a regression equation
Solution: Create Dummy
Variables
A dummy variable is an indicator that
only take two values: 0 and 1
is used to represent each response or attribute level of
a categorical variable
# of dummy variables needed = # of levels –
1
Examples
Gender: 1) male, 2) female => need 1 dummy variable
Education: 1) no schooling, 2) primary, 3) middle
school, 4) high school, 5) college+ => need 4 dummy
variables
Religion: 1) Catholic, 2) Christian, 3) Buddhist, 4)
Muslim, 5) Free thinker, 6) others => need 5 dummy
variables
Why Only K-1 Dummies for K
Attributes?

What is the corresponding x for ?

 (a vector of 1’s)
Treat Numeric Var as
Categories
We can also treat numeric variables (e.g.,
income bracket, distance segment) as
categorical
To examine the effect of each group
separately
To allow for non-linear effects (will use
examples)
Linear effect: the effect of increasing from 1 to 2
is the same as increasing from 2 to 3, or from 3
to 4, …
Nonlinear effect: the effect of increasing from 1
to 2 ⧧ from 2 to 3 ⧧ from 3 to 4, …
How to Interpret Dummy
Variables?
For k levels (attributes, responses), we can only have k-1
dummy variables in a regression as X variables if the var
enters alone.
One level or attribute / response is reserved as the “base”; all
interpretation is relative to the base/reference, whose
coefficient is 0
If the coefficient is positive, it is higher/larger than the base
If the coefficient is negative, it is lower/smaller than the
base
If female is the base:
When the male’s coefficient is positive, it means that
“compared to females, males are on average taller by…”
When the male’s coefficient is negative, it means that
“compared to females, males on average have lower xxx by…”
Whichever group is used as the reference, the coefficients for
the dummies will be different, but the interpretations will be
the same
Check whether WTP depends on Your
undergraduate majors

Count the freq of


each major and
order in desc

Create IDs for


undergraduate
majors

Merge the
undergraduate
major IDs back to
the data

Generate
dummies for
undergraduate
majors
Different Intercepts

Using “array” to
generate dummies for
major IDs

Run
regressions on
dummies
Use MajorID=1 Use MajorID=4
as the as the
reference reference

It’s quite cumbersome to create dummy variables,


esp. when you have a large number of attributes.
You can use Proc GLM to treat each level of
attribute as a dummy.
Change the default
reference group
Different Intercepts, Same Slope for each
Undergraduate major: Parallel Lines

WTP maj
for
wine
maj

maj

maj

Height
Interactions (1):
categorical*continuous
In data analysis, we often interact two
independent variables.

If  is categorical (undergraduate major) and  is
continuous (height, price), then means
 means that you
will have one slope for each level of . You will
have K slopes.
The different effects of your height on your WTP by
undergraduate majors
The different effects of father’s height on your height by
undergraduate majors
3 outlet types in the IRI data: grocery store, mass
merchandisers, and drug store, you will a slope for each of
them
Different Slopes for each Major

WTP
for
wine

maj

maj

maj

maj

Height
Interactions (2): categorical
+categorical*continuous
In data analysis, we often interact two
independent variables.

If  is categorical (undergraduate major) and 
is continuous (height, price), then  means
that you will have one intercept for level of 
(K-1 in total); means
means that you will have one
slope for each level of . You will have K
slopes.
Different intercepts
different + slopes

WTP
for
wine

maj

maj

maj

maj

Height
Interactions: Categorical*Categorical

In data analysis, we often interact two


independent variables.

If and
and are
are both categorical, then means
 means
that you will have one intercept for each
combination of these two categorical
variables .
3 outlet types: grocery store, drug store, mass
merchandisers; 2 markets: Eau Claire and Pittsfield

4 majors, 2 genders
One intercept for each combination
Interactions: Continuous*Continuous

In data analysis, we often interact two


independent variables.

If and
and are
are both continuous (e.g.,
advertising and discount), then means
means
that the marginal effect of  on y
depends the value of  and vice versa.
, : main effects
: interaction effects
Main effects + Interaction
Effects
Seasonality and Time Trend
Seasonality means  differs by season
Season can be year, quarter, month, day of
week, hour of day, etc.
Need to create dummies to check seasonality
Time trend means “long-term” increase
or decrease (can be nonlinear) in 
Need to create a continuous variable and
include it in the regression model
Seasonality and time trend can both
present in the same data
Seasonality with no time
trend
Upward time trend with No
seasonality
Downward time trend with No
seasonality
Nonlinear time trend, No
seasonality
Upward Time Trend with
Seasonality
Downward Time Trend with
Seasonality
Use Ride Data to Practice
Seasonality + Time Trend
Seasonality – Day of Week
effect
Seasonality – Month of Year
Effect
Seasonality – Day of Week &
Month of Year Effect
Time Trend
Seasonality (DoW+Month) + Time Trend

After we control for


month effect and Day of
week effect, there is no
more time trend.
If we examine the month
effect, October is
highest and August is
the lowest, which likely
captures the time trend.
To check on this, let’s
drop month effect in our
model.
Drop Month Effect

The time trend becomes


significant: the calls
grow over time.
This confirms our guess
that month effect
absorbs the time trend.
Let’s also practice interactions
with the Ride data
Price sensitivity by distance

The longer the


distance, the more
price sensitive
Price sensitivity by distance

Main effect of distance: when


distance increases by 1km, the
demand increases by exp(0.05458)-1
= 5.61%
Main effect of price: when price
increases by 1%, the demand will
decrease by 0.0167%
Interaction effect of price*distance:
when distance increases by 1km,
price sensitivity increases by 0.0744
percentage points. When distance =
1km, the total price elasticity = -
0.0166-0.0744 = -0.0910
When distance = 10km, the total
price elasticity = -0.0166-0.0744*10
= -0.7506
Demand levels + price
sensitivity by distance segment
Optional: I also Use the IRI
data to Practice Seasonality,
Time Trend, and Interactions.
For those who are interested
to learn more, please take a
look by yourself.
IRI Coffee Purchases: Prepare
Data
Import panelists’ coffee purchases
in 2004
IRI Coffee Purchases: Prepare
Data
Import panelists’ coffee purchases
in 2005
IRI Coffee Purchases: Prepare
Data
Combine two years’ data
IRI Coffee Purchases: Prepare
Data
Convert package sizes into
equivalent units
IRI Coffee Purchases: Prepare
Data
Import demographics
IRI Coffee Purchases: Prepare
Data
Merge demographics with purchase
data
IRI Coffee Purchases: Prepare
Data
Convert IRI week into calendar
week for seasonality check
Demand Analysis: log-log
regression

R2 = 0.1476 R2 = 0.4378
Whether Purchase Q and Price
Sensitivity vary by outlet type

The average purchase Q


in “GR” is exp(-
0.436)=64.7% of that in
MA; the average
purchase Q in “DR” is
exp(-1.047724) = 35.1%
of that in MA.
(base The price elasticity in
) “DR” is
-0.41; it is -0.93 in “GR”,
and
-1.11 in “MA”.

R2 = 0.4427
Whether Purchase Q and Price
Sensitivity vary by outlet type

The average purchase Q


in “GR” is exp
(0.6116)=1.84 times
that in DR; the average
purchase Q in “MA” is
exp(1.047724) = 2.85
times that in DR.
(base
The price elasticity in
)
“DR” is
-0.41; it is -0.93 in “GR”,
and
-1.11 in “MA”.

R2 = 0.4427
Compare the Results with
Different Bases (References)

The slopes are identical


The intercepts: Add the intercept to the coefficients of the 3
outlets to see whether they are identical.
Conclusion: whichever group is used as the base, it will not
affect the interpretation.
“noint” option allows all
groups to have coefficients
Seasonality, No Time Trend
Seasonality and Time Trend
Check Seasonality in Price
Sensitivity and Purchase Quantity

Parameter est. of S.E. elasticity S.E.


mean Q
Intercept 1.3024 0.0191
month 1 -0.1026 0.0264 -0.9280 0.0147
month 2 -0.0709 0.0289 -0.9497 0.0164
month 3 0.0400 0.0264 -0.9842 0.0148
month 4 -0.0233 0.0297 -0.9584 0.0172
month 5 -0.0099 0.0286 -0.9653 0.0159
month 6 0.0723 0.0325 -1.0006 0.0191
month 7 -0.0925 0.0311 -0.9230 0.0183
month 8 -0.0078 0.0286 -0.9973 0.0158
month 9 0.0034 0.0307 -0.9887 0.0177
month 10 -0.0865 0.0292 -0.9513 0.0164
month 11 -0.2438 0.0249 -0.8485 0.0128
month 12 0.0000 . -1.0078 0.0140
time 0.0025 0.0001

You might also like