Multiple Regression Further Issues

Department of Finance & Banking, University of Malaya
Multiple Regression
Analysis: Further Issues
Dr. Aidil Rizal Shahrin

[email protected]
October 10, 2020

Contents
1 Effects of Data Scaling on OLS Statistics

1.1 Standardized Coefficients
2 More on Functional Form

2.1 Logarithmic Functional Forms
2.1.1 Log-Linear Model
2.2 Model with Quadratics
2.3 Model with Interaction Terms
2.4 Computing Average Partial Effects
3 Goodness-of-Fit and Selection of Regressors

3.1 Adjusted R2
3.2 Choosing Nonnested Models
2/40 Aidil Rizal Shahrin University of Malaya Unofficial Beamer Theme

Effects of Data Scaling on OLS Statistics
i. Below is an equation relating infant birth weight in ounces

(bwght), on number of cigarettes smoked per day (cigs) and
annual family income in thousands of dollars (faminc).
\ = β̂0 + β̂1 cigs + β̂2 faminc

bwght (1)

Figure 1: Effects of Data Scaling
ii. In the first column of Fig.1, we have the result of Eq.1.

Remember bwght unit is ounces, cigs is number of
cigarettes per day and faminc in thousands of dollar.
iii. Now, change unit of measurement of birth weight in

pounds (1 lbs = (1 oz)/16). Thus, Eq.1 when bwght are
converted to pound is:
\
bwght/16 = β̂0 /16 + (β̂1 /16) cigs + (β̂2 /16) faminc (2)
as reported in column 2 in Fig.1.

iv. The different between Eq.1 and Eq.2 in term of
interpretation (i use one example only):
V
\ = −.46434∆cigs vs. ∆bwghtlbs = −.0289∆cigs

∆bwght
Increase in 1 cigs, reduce bwhght by 0.464 ounce for first,

while second reduce bwghtlbs by 0.0289 pounds.
(0.0289 lbs × 16 = 0.464 oz, we got the same answer).
v. How about statistical significance? Not affected. The
standard error for β̂1 in Eq.2 are divided by 16. So the t
statistics for cigs in both Eq.1 and 2 are the same which is
t = −5.058.
vi. Same goes to CI in Eq.2, for cigs, the lower bound and
upper bound is 16 times smaller than CI in Eq.1 of cigs (just
divided by 16).
vii. R2 for Eq.1 and 2 are the same, do you noticed that?
viii. Why SSR and SER differ? First focus on SSR. Let ûi the i
obs. of Eq.1. Then, residuals of same obs. Eq.2 is ûi /16.
Thus, the squared residual in Eq.2 is (ûi /16)2 = û2i /256. That
is why SSREq.2 = SSREq.1 /256.
p p
ix. Since SER = σ̂ = SSR/(n − k − 1) = SSR/1, 385. Thus,
SEREq.2 = SEREq.1 /16.
x. Now, we change the unit of independent variable cigs, to
packs, where 20 × cigs = 1 pack. Thus, Eq.1 with this
transformation is:
\ = β̂0 + (20β̂1 ) (cigs/20) + β̂2 faminc
bwght
(3)
= β̂0 + (20β̂1 ) packs + β̂2 faminc
xi. The intercept and slope coefficient of faminc are

unchanged. The result is in column 3 Fig.1.
xii. Why we drop cigs in column 3? To avoid perfect

multicollinearity.
xiii. The se(20β̂1 ) is 20 × se(β̂1 ) in Eq.1. Thus,
tcigs = tpacks = −5.059, no effect on statistical significance.
xiv. Do yo notice the SSR and SER for Eq.1 and 3 are the same?
Why? Remember ûi = yi − ŷi ? So it involves only y.

Standardized Coefficients
i. If the variables involved measures in scale, the best way to

interpret is using standardized coefficient.
ii. What it means is that all variables have been standardized in
the sample by subtracting off its mean and diving by its
standard deviation.
iii. Let k = 3, then we have
yi = β̂0 + β̂1 xi1 + β̂2 xi2 + β̂3 xi3 (4)
Taking the average of Eq.4 over the sample, we have
ȳ = β̂0 + β̂1 x̄1 + β̂2 x̄2 + β̂3 x̄3 (5)

Pbecause the difference between ȳ and all term on

iv. Eq.5 hold
RHS is ( ni=1 ûi )/n = 0. Now subtract Eq.5 from Eq.4, we
have:
yi − ȳ = β̂1 (xi1 − x̄1 ) + β̂2 (xi2 − x̄2 ) + β̂3 (xi3 − x̄3 ) (6)
v. Let σ̂y is sample s.d. for y, and σ̂1 for sample s.d. for x1 and
so on. Then with little algebra, we have

yi − ȳ σ̂1 xi1 − x̄1 σ̂2 xi2 − x̄2
= β̂1 + β̂2
σ̂y σ̂y σ̂1 σ̂y σ̂2
(7)
σ̂3 xi3 − x̄3 ûi
+ β̂3 +
σ̂y σ̂3 σ̂y
Each variable in Eq.7 has been standardized by replacing it

with its z-score.
vi. Rewriting Eq.7 and dropping the subscript, we have:
zy = b̂1 z1 + b̂2 z2 + b̂3 z3 + error (8)
where zy denotes the z-score of y, z1 denotes the z-score of

x1 and so on.
vii. The new coefficient in Eq.8 is:
σ̂j

b̂j = β̂j for j = 1, 2, 3 (9)
σ̂y
is called standardized coefficients or beta coefficients.

viii. Interpretation of Eq.8, if x1 increase by one standard
deviation, then ŷ changes by b̂1 standard deviations.

Log-Linear Model
i. The log linear equation is (all log is natural log):
log(y) = β0 + β1 x (10)
ii. Both its slope and elasticity change at each point and are
the same as sign as β1 .
iii. Taking antilogarithm on Eq.10, we have
exp [log(y)] = y = exp(β0 + β1 x) (11)
which is an exponential function. The function require

y > 0. Fig.2 plot the Eq.11.

Log-Linear Model
Figure 2: A log linear function: y = exp(β0 + β1 x).

Log-Linear Model
iv. The slope at any point is (taking derivative og Eq.11):
∆y
= exp(β0 + β1 x) × β1
∆x (12)
= yβ1
For β1 > 0, the marginal effect increase for larger values of

y (this function increase at increasing rate).
v. Remember elasticity formula (elasticity measure
percentage in y given a 1% increase in x)?
(∆y/y) × 100
ε=
(∆x/x) × 100
(13)
∆y x
= ×
∆x y

Log-Linear Model
Thus for Eq.11, the elasticity is:

x
ε = yβ1 ×
y (14)
= β1 x
vi. Beside elasticity, we can also determine the semi-elasticity

(it measure the percentage change in y given a 1-unit
increase in x). The formula is
(∆y/y) × 100
εsemi =
∆x
(15)
∆y 1
= × × 100
∆x y

Log-Linear Model
From Eq.12, the semi-elasticity of Eq.11 is:
1
εsemi = yβ1 × × 100
y (16)
= 100β1
Thus:
%∆y = (100β1 )∆x (17)
vii. Noticed that for all calculation above, we are referring to
Eq.11, where y is the dependent variable. But our original
model is Eq.10, where the dependent variable is log(y).

Log-Linear Model
viii. However, we can rely on log approximation of (then

multiple with 100 to make as percentage on both side):
%∆y ≈ 100 × ∆ log(y) (18)
However, this approximation only work if ∆y is small. If

you refer to Eq.17, this depends on ∆x and β1 . So, if this
the case, to compute the exact percentage change, use this
formula:
%∆y = 100 × [exp(β1 ∆x) − 1] (19)
So when x change by 1, we have
%∆y = 100 × [exp(β1 ) − 1] (20)

Log-Linear Model
ix. About logarithm, when to use and many more:

a. When y > 0, model using log(y) as the dependent variable
often satisfy the CLM assumptions more closely than model
using the level y. Strictly positive variables often have
conditional distribution that are heteroskedastic or skewed;
taking the log mitigate, if not eliminate, both problems.
b. Taking log of a variable often narrows its range. Particularly
true variables that can be large monetary value, such as
firm’s annual sales or baseball players’ salary. Population
variable also tend to vary widely. Narrowing the range of
the dependent and independent variable can make OLS less
sensitive to outlier.
c. But, logarithmic transformation may create extreme value
when a variable y is between zero and one (such as a
proportion) and takes on values close to zero. In this case,
log(y) (which is necessarily negative) can be very large in
magnitude whereas the original variable y is bounded
between zero and one.
Log-Linear Model
d. Some rule of thumb, when a variable is a positive dollar

amount, the log is often taken. Variables such as wages,
salaries, firm sales, and firm market value. If they are being
large integer value, also apply logarithm such as population,
total number of employees, school enrollment.
e. Another rule of thumb, variables measured in years-such as
education, experience, tenure, age- usually appear in their
original form.
f. Yet another, variable in proportion or percent-such as the
unemployment rate, the participation rate in a pension plan,
percentage of students passing-can appear in either original
or logarithm form, but the tendency is more toward level
because it has percentage point change interpretation.

Log-Linear Model
g. Limitation, log cannot be used when the variable takes zero

or negative value. If y is nonnegative, but can take value of
zero, log(1 + y) sometimes is used. Percentage change
interpretations are still observed except for changes
beginning at y = 0. However, log(1 + y) cannot be normally
distributed.
h. Never compare R2 where the model one is using y the other
log(y) as dependent variable (all the same except y is
transform).

Model with Quadratics
i. It is use in economic to capture decreasing or increasing

marginal effects. In the simple case, y depends on a single
observed factor x (noticed that single factor?)
y = β0 + β1 x + β2 x2 + u (21)
ii. β1 does not measure partial effect of x to y since we cannot

hold x2 fix if we change x, they are the same.

iii. If we estimate Eq.21, we have:
ŷ = β̂0 + β̂1 x + β̂2 x2 (22)
with the approximation (taking derivative of Eq.22 with

respect to x)
∆ŷ
≈ β̂1 + 2β̂2 x (23)
∆x̂
In many application β̂1 is positive and β̂2 is negative.
iv. Using Eq.23 needs some explanation:
a. If x = 0, β̂1 approximate the slope going from x = 0 to x = 1.
b. If x = 1, β̂1 + 2β̂2 approximate the slope going from x = 1 to
x = 2.
c. If x = 10, β̂1 + 2β̂2 approximate the slope going from x = 10 to
x = 11.

Example....
Using the wage data in WAGE1, we obtain:
w
[ age = 3.73 + .298 exper − .0061 exper2
(.35) (.041) (.0009)
2
n = 526, R = .093
Which implied that exper has a diminishing effect on wage.

The first experience worth roughly 30 cents per hour. The
second year experience worth 28.6 cents. From 10 to 11
years, worth 17.6 cents per hour. Notice it is diminishing?

v. When β̂1 > 0 and β̂2 < 0, the quadratic has parabolic
shape, as what our example have.
vi. The turning point is:
−β̂1
x∗ = (24)
2β̂2
vii. In our example, the turning point x∗ = 24.4 years. It tells us

that return to experience become zero at about 24.4 years.
Refer to Fig.3.

Figure 3: Quadratic relationship between w

[ age and exper.

viii. Summary of quadratic model:

a. β̂1 > 0 and β̂2 < 0, inverted U-shape with maximum point.
b. β̂1 < 0 and β̂2 > 0, U-shape, with minimum point.
c. β̂1 > 0 and β̂2 > 0, no turning point for values x > 0. Smallest
expected value of y is at x = 0 (x is nonnegative).
d. β̂1 < 0 and β̂2 < 0, no turning point for values x > 0. Largest
expected value of y is at x = 0 (x is nonnegative).

Model with Interaction Terms
i. The purpose of this model is that the effect on dependent

variable with respect to an explanatory variable to depend
on magnitude of yet another explanatory variable.
ii. For example:
price = β0 + β1 sqrft + β2 bdrms + β3 sqrft · bdrms + u (25)
the partial effect of bdrms on price (holding all other

variables fixed) is
∆price
= β2 + β3 sqrft (26)
∆bdrms
iii. If β3 > 0 in Eq.26, an additional bedroom (∆ bdrms = 1),
yields a higher increase in housing price (∆ price), for
larger houses (sqrft). There is interaction effect between
square footage (sqrft) and number of bedrooms (bdrms).
iv. What will be the value of sqrft in Eq.26? Might be its mean
value, or lower and upper quartiles in the sample (any
interesting value of sqrft). Of course whether β3 is
significant can be easily test using t test on Eq.25. If
significant, we have an interaction effect.
v. Most of the time it is better to re-parameterize the model in
Eq.25. Why? Remember, β2 is the of bdrms on price for a
home with zero square feet! (house is non existence but
you have bedroom?)

vi. We can re-parameterize Eq.25 to be:
price = α0 + δ1 sqrft + δ2 bdrms

(27)
+ β3 (sqrft − µsqrft )(bdrms − µbdrms ) + u
where now, δ2 is the partial effect of bdrms on price at the

mean of sqrft, or δ2 = β2 + β3 µsqrft . (see Interaction Term
Proof for details discussion).
vii. Now the coefficients has useful interpretation. In fact you
can use other than the mean of variables as shown above.

Computing Average Partial Effects
i. Often, we wants a single value to describe the relationship

between the dependent variable y and each explanatory
variable.
ii. One popular measure is called average partial effect
(APE), also called the average marginal effect.

iii. Let say, we have a model to explained standardized

outcome on final exam (stndfnl) in terms of percentage of
class attended, prior college grade point average (priGPA)
and ACT score is
stndfl = β0 + β1 atndrte + β2 priGPA + β3 ACT

+ β4 priGPA2 + β5 ACT2 + β6 priGPA · atndrte + u
the partial effect of priGPA is:
∆stndfnl
= β2 + 2β4 priGPA + β6 atndrte (28)
∆priGPA
and the partial effect of atndrte is:
∆stndfnl
= β1 + β6 priGPA (29)
∆atndrte
And the APE for Eq.28 is:
APEpriGPA = β̂2 + 2β̂4 priGPA + β6 atndrte (30)
For Eq.29 is
APEstndfnl = β̂1 + β̂6 priGPA (31)
Where priGPA and atndrte is the sample average of priGPA

and atndrte respectively.
iv. Why applying this? For example, in Eq.31, we don’t need
to report this partial effect of each students in the sample,
instead we have average these partial effects.

Goodness-of-Fit and Selection of Regressors
i. Beginning students of econometrics tend to put too much

weight on R2 .
ii. Choosing a set of explanatory variables based on the size
of R2 can lead to nonsensical models.
iii. Furthermore, in time series we might obtained artificially
high R2 , but the result is misleading.
iv. In CLRM assumption, there is no minimum value of R2
required.
v. One thing that we can say about lower R2 is that the error
variance is large relative to the variance of y, which mean
we may have hard time precisely estimating the βj . But this
can be deal with larger sample size.

Adjusted R2
i. We rewrite the R2 as:

SSR/n
R2 = 1 − (32)
SST/n
the difference now we introduce n (its cancel each other,

then we have the standard R2 ).
ii. The population R2 is defined as (remember relationship
between R2 and correlation?)
σu2
ρ2 = 1 − (33)
σy2

Adjusted R2
iii. The estimate of σu2 in Eq.33 is SSR/n which is biased.

Instead, SSR/(n − k − 1) is the unbiased estimator for it.
Same goes to SST/n is biased. The unbiased estimator for
σy2 is SST/(n − 1). Using the unbiased estimator, now we
have the adjusted R-squared (sometimes call corrected
R-squared) of:
SSR/(n − k − 1)
R̄2 = 1 −
SST/(n − 1)
(34)
n − 1 SSR
=1−
n − k − 1 SST
iv. It is tempting to say that R̄2 correct the bias in R2 for

estimating the population R-squared, ρ2 , but it does not.
The ratio of two unbiased estimator is not an unbiased
estimator (like what we have in Eq.34).
Adjusted R2
v. However, the primary advantage of R̄2 is that it imposes a

penalty for adding additional independent variables to the
model.
vi. As compare to R2 , when you add more x0 s, it never falls
and most of the time it increases due to SSR (remember?).
vii. The penalty by adding more x0 s is k in Eq.34. So, with x is
added to the model, SSR falls, this will increase the R̄2 in
Eq.34. However, the factor (n − 1)/(n − k − 1) will increase
too. So whether the R̄2 increase or decrease depends on
these offsetting effect.
viii. Another interesting fact is that [(n − 1)/(n − k − 1)] > 1, so
R̄2 < R2 , always.
ix. R̄2 can be negative, if the regressors taken together, reduce
SSR too little, and this reduction fails to offset the factor
(n − 1)/(n − k − 1) in Eq.34.
Adjusted R2
x. The adjusted R2 also related to t and F statistic. If |t|-ratio

for any explanatory variable is less than 1, dropping that
particular variable will improve adjusted R2 . (see Adjusted
R Squared for proof).

Choosing Nonnested Models
i. The adjusted R2 , in some cases, allow us to choose a model

without redundant independent variables.
ii. Using the major league baseball salary example:
log(salary) = β0 + β1 years + β2 gamesyr + β3 bavg

(35)
+ β4 hrunsyr + u
log(salary) = β0 + β1 years + β2 gamesyr + β3 bavg

(36)
+ β4 rbisyr + u
where these two equations are nonnested models because

neither equation is a special case of the other.
iii. While, the F statistics allow us to test nested model: one
model (the restricted model) is a special case of the other
model (the unrestricted model).
iv. In your textbook, R̄2 = .6211 for Eq.35 and R̄2 = .6226 for
Eq.36. Based on this result, there is slight preference of the
model in Eq.36 to model in Eq.35. However, the different is
practically small.
v. However, using R̄2 to compare nonnested sets of
independent variable of different functional forms is
valuable.
vi. For example, model relating R&D intensity to firm sales:
rdintens = β0 + β1 log(sales) + u (37)

rdintens = β0 + β1 sales + β2 sales2 + u (38)
where both model captures a diminishing return. Based on

R2 , Eq.37 is .061 while Eq.38 is .148. However, this is unfair
since Eq.37 has fewer parameter.
vii. However, based on R̄2 , Eq.37 is .030 while Eq.38 is .090.

Thus, quadratic model Eq.38 is preferable in measuring
diminishing return of this model.
viii. Caution, we cannot use R̄2 or R2 to choose between
nonnested model to choose between different functional
forms for the dependent variable, for example either y or
log(y).

Multiple Regression Further Issues

Uploaded by

Copyright:

Available Formats

Multiple Regression Further Issues

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Multiple Regression Further Issues

Uploaded by

Copyright:

Available Formats

Department of Finance & Banking, University of Malaya

Dr. Aidil Rizal Shahrin

October 10, 2020

1 Effects of Data Scaling on OLS Statistics

2 More on Functional Form

3 Goodness-of-Fit and Selection of Regressors

2/40 Aidil Rizal Shahrin University of Malaya Unofficial Beamer Theme

i. Below is an equation relating infant birth weight in ounces

\ = β̂0 + β̂1 cigs + β̂2 faminc

3/40 Aidil Rizal Shahrin University of Malaya Unofficial Beamer Theme

Figure 1: Effects of Data Scaling

ii. In the first column of Fig.1, we have the result of Eq.1.

iii. Now, change unit of measurement of birth weight in

as reported in column 2 in Fig.1.

5/40 Aidil Rizal Shahrin University of Malaya Unofficial Beamer Theme

\ = −.46434∆cigs vs. ∆bwghtlbs = −.0289∆cigs

Increase in 1 cigs, reduce bwhght by 0.464 ounce for first,

xi. The intercept and slope coefficient of faminc are

xii. Why we drop cigs in column 3? To avoid perfect

8/40 Aidil Rizal Shahrin University of Malaya Unofficial Beamer Theme

i. If the variables involved measures in scale, the best way to

yi = β̂0 + β̂1 xi1 + β̂2 xi2 + β̂3 xi3 (4)

Taking the average of Eq.4 over the sample, we have

ȳ = β̂0 + β̂1 x̄1 + β̂2 x̄2 + β̂3 x̄3 (5)

9/40 Aidil Rizal Shahrin University of Malaya Unofficial Beamer Theme

Pbecause the difference between ȳ and all term on

Each variable in Eq.7 has been standardized by replacing it

vi. Rewriting Eq.7 and dropping the subscript, we have:

zy = b̂1 z1 + b̂2 z2 + b̂3 z3 + error (8)

where zy denotes the z-score of y, z1 denotes the z-score of

is called standardized coefficients or beta coefficients.

11/40 Aidil Rizal Shahrin University of Malaya Unofficial Beamer Theme

i. The log linear equation is (all log is natural log):

exp [log(y)] = y = exp(β0 + β1 x) (11)

which is an exponential function. The function require

12/40 Aidil Rizal Shahrin University of Malaya Unofficial Beamer Theme

Figure 2: A log linear function: y = exp(β0 + β1 x).

13/40 Aidil Rizal Shahrin University of Malaya Unofficial Beamer Theme

iv. The slope at any point is (taking derivative og Eq.11):

For β1 > 0, the marginal effect increase for larger values of

14/40 Aidil Rizal Shahrin University of Malaya Unofficial Beamer Theme

Thus for Eq.11, the elasticity is:

vi. Beside elasticity, we can also determine the semi-elasticity

15/40 Aidil Rizal Shahrin University of Malaya Unofficial Beamer Theme

From Eq.12, the semi-elasticity of Eq.11 is:

16/40 Aidil Rizal Shahrin University of Malaya Unofficial Beamer Theme

viii. However, we can rely on log approximation of (then

%∆y ≈ 100 × ∆ log(y) (18)

However, this approximation only work if ∆y is small. If

%∆y = 100 × [exp(β1 ) − 1] (20)

17/40 Aidil Rizal Shahrin University of Malaya Unofficial Beamer Theme

ix. About logarithm, when to use and many more:

d. Some rule of thumb, when a variable is a positive dollar

19/40 Aidil Rizal Shahrin University of Malaya Unofficial Beamer Theme

g. Limitation, log cannot be used when the variable takes zero

20/40 Aidil Rizal Shahrin University of Malaya Unofficial Beamer Theme

i. It is use in economic to capture decreasing or increasing

ii. β1 does not measure partial effect of x to y since we cannot

21/40 Aidil Rizal Shahrin University of Malaya Unofficial Beamer Theme

iii. If we estimate Eq.21, we have: