Multiple Regression Further Issues

Download as pdf or txt
Download as pdf or txt
You are on page 1of 40

Department of Finance & Banking, University of Malaya

Multiple Regression
Analysis: Further Issues

Dr. Aidil Rizal Shahrin


[email protected]

October 10, 2020


Contents

1 Effects of Data Scaling on OLS Statistics


1.1 Standardized Coefficients

2 More on Functional Form


2.1 Logarithmic Functional Forms
2.1.1 Log-Linear Model
2.2 Model with Quadratics
2.3 Model with Interaction Terms
2.4 Computing Average Partial Effects

3 Goodness-of-Fit and Selection of Regressors


3.1 Adjusted R2
3.2 Choosing Nonnested Models

2/40 Aidil Rizal Shahrin University of Malaya Unofficial Beamer Theme


Effects of Data Scaling on OLS Statistics

i. Below is an equation relating infant birth weight in ounces


(bwght), on number of cigarettes smoked per day (cigs) and
annual family income in thousands of dollars (faminc).

\ = β̂0 + β̂1 cigs + β̂2 faminc


bwght (1)

3/40 Aidil Rizal Shahrin University of Malaya Unofficial Beamer Theme


Effects of Data Scaling on OLS Statistics

Figure 1: Effects of Data Scaling

ii. In the first column of Fig.1, we have the result of Eq.1.


Remember bwght unit is ounces, cigs is number of
cigarettes per day and faminc in thousands of dollar.
4/40 Aidil Rizal Shahrin University of Malaya Unofficial Beamer Theme
Effects of Data Scaling on OLS Statistics

iii. Now, change unit of measurement of birth weight in


pounds (1 lbs = (1 oz)/16). Thus, Eq.1 when bwght are
converted to pound is:

\
bwght/16 = β̂0 /16 + (β̂1 /16) cigs + (β̂2 /16) faminc (2)

as reported in column 2 in Fig.1.

5/40 Aidil Rizal Shahrin University of Malaya Unofficial Beamer Theme


Effects of Data Scaling on OLS Statistics
iv. The different between Eq.1 and Eq.2 in term of
interpretation (i use one example only):
V

\ = −.46434∆cigs vs. ∆bwghtlbs = −.0289∆cigs


∆bwght

Increase in 1 cigs, reduce bwhght by 0.464 ounce for first,


while second reduce bwghtlbs by 0.0289 pounds.
(0.0289 lbs × 16 = 0.464 oz, we got the same answer).
v. How about statistical significance? Not affected. The
standard error for β̂1 in Eq.2 are divided by 16. So the t
statistics for cigs in both Eq.1 and 2 are the same which is
t = −5.058.
vi. Same goes to CI in Eq.2, for cigs, the lower bound and
upper bound is 16 times smaller than CI in Eq.1 of cigs (just
divided by 16).
vii. R2 for Eq.1 and 2 are the same, do you noticed that?
6/40 Aidil Rizal Shahrin University of Malaya Unofficial Beamer Theme
Effects of Data Scaling on OLS Statistics

viii. Why SSR and SER differ? First focus on SSR. Let ûi the i
obs. of Eq.1. Then, residuals of same obs. Eq.2 is ûi /16.
Thus, the squared residual in Eq.2 is (ûi /16)2 = û2i /256. That
is why SSREq.2 = SSREq.1 /256.
p p
ix. Since SER = σ̂ = SSR/(n − k − 1) = SSR/1, 385. Thus,
SEREq.2 = SEREq.1 /16.
x. Now, we change the unit of independent variable cigs, to
packs, where 20 × cigs = 1 pack. Thus, Eq.1 with this
transformation is:
\ = β̂0 + (20β̂1 ) (cigs/20) + β̂2 faminc
bwght
(3)
= β̂0 + (20β̂1 ) packs + β̂2 faminc

xi. The intercept and slope coefficient of faminc are


unchanged. The result is in column 3 Fig.1.
7/40 Aidil Rizal Shahrin University of Malaya Unofficial Beamer Theme
Effects of Data Scaling on OLS Statistics

xii. Why we drop cigs in column 3? To avoid perfect


multicollinearity.
xiii. The se(20β̂1 ) is 20 × se(β̂1 ) in Eq.1. Thus,
tcigs = tpacks = −5.059, no effect on statistical significance.
xiv. Do yo notice the SSR and SER for Eq.1 and 3 are the same?
Why? Remember ûi = yi − ŷi ? So it involves only y.

8/40 Aidil Rizal Shahrin University of Malaya Unofficial Beamer Theme


Standardized Coefficients

i. If the variables involved measures in scale, the best way to


interpret is using standardized coefficient.
ii. What it means is that all variables have been standardized in
the sample by subtracting off its mean and diving by its
standard deviation.
iii. Let k = 3, then we have

yi = β̂0 + β̂1 xi1 + β̂2 xi2 + β̂3 xi3 (4)

Taking the average of Eq.4 over the sample, we have

ȳ = β̂0 + β̂1 x̄1 + β̂2 x̄2 + β̂3 x̄3 (5)

9/40 Aidil Rizal Shahrin University of Malaya Unofficial Beamer Theme


Standardized Coefficients

Pbecause the difference between ȳ and all term on


iv. Eq.5 hold
RHS is ( ni=1 ûi )/n = 0. Now subtract Eq.5 from Eq.4, we
have:

yi − ȳ = β̂1 (xi1 − x̄1 ) + β̂2 (xi2 − x̄2 ) + β̂3 (xi3 − x̄3 ) (6)

v. Let σ̂y is sample s.d. for y, and σ̂1 for sample s.d. for x1 and
so on. Then with little algebra, we have
       
yi − ȳ σ̂1 xi1 − x̄1 σ̂2 xi2 − x̄2
= β̂1 + β̂2
σ̂y σ̂y σ̂1 σ̂y σ̂2
    (7)
σ̂3 xi3 − x̄3 ûi
+ β̂3 +
σ̂y σ̂3 σ̂y

Each variable in Eq.7 has been standardized by replacing it


with its z-score.
10/40 Aidil Rizal Shahrin University of Malaya Unofficial Beamer Theme
Standardized Coefficients

vi. Rewriting Eq.7 and dropping the subscript, we have:

zy = b̂1 z1 + b̂2 z2 + b̂3 z3 + error (8)

where zy denotes the z-score of y, z1 denotes the z-score of


x1 and so on.
vii. The new coefficient in Eq.8 is:

σ̂j
 
b̂j = β̂j for j = 1, 2, 3 (9)
σ̂y

is called standardized coefficients or beta coefficients.


viii. Interpretation of Eq.8, if x1 increase by one standard
deviation, then ŷ changes by b̂1 standard deviations.

11/40 Aidil Rizal Shahrin University of Malaya Unofficial Beamer Theme


Log-Linear Model

i. The log linear equation is (all log is natural log):

log(y) = β0 + β1 x (10)

ii. Both its slope and elasticity change at each point and are
the same as sign as β1 .
iii. Taking antilogarithm on Eq.10, we have

exp [log(y)] = y = exp(β0 + β1 x) (11)

which is an exponential function. The function require


y > 0. Fig.2 plot the Eq.11.

12/40 Aidil Rizal Shahrin University of Malaya Unofficial Beamer Theme


Log-Linear Model

Figure 2: A log linear function: y = exp(β0 + β1 x).

13/40 Aidil Rizal Shahrin University of Malaya Unofficial Beamer Theme


Log-Linear Model

iv. The slope at any point is (taking derivative og Eq.11):

∆y
= exp(β0 + β1 x) × β1
∆x (12)
= yβ1

For β1 > 0, the marginal effect increase for larger values of


y (this function increase at increasing rate).
v. Remember elasticity formula (elasticity measure
percentage in y given a 1% increase in x)?

(∆y/y) × 100
ε=
(∆x/x) × 100
(13)
∆y x
= ×
∆x y

14/40 Aidil Rizal Shahrin University of Malaya Unofficial Beamer Theme


Log-Linear Model

Thus for Eq.11, the elasticity is:


x
ε = yβ1 ×
y (14)
= β1 x

vi. Beside elasticity, we can also determine the semi-elasticity


(it measure the percentage change in y given a 1-unit
increase in x). The formula is

(∆y/y) × 100
εsemi =
∆x
(15)
∆y 1
= × × 100
∆x y

15/40 Aidil Rizal Shahrin University of Malaya Unofficial Beamer Theme


Log-Linear Model

From Eq.12, the semi-elasticity of Eq.11 is:

1
εsemi = yβ1 × × 100
y (16)
= 100β1

Thus:
%∆y = (100β1 )∆x (17)
vii. Noticed that for all calculation above, we are referring to
Eq.11, where y is the dependent variable. But our original
model is Eq.10, where the dependent variable is log(y).

16/40 Aidil Rizal Shahrin University of Malaya Unofficial Beamer Theme


Log-Linear Model

viii. However, we can rely on log approximation of (then


multiple with 100 to make as percentage on both side):

%∆y ≈ 100 × ∆ log(y) (18)

However, this approximation only work if ∆y is small. If


you refer to Eq.17, this depends on ∆x and β1 . So, if this
the case, to compute the exact percentage change, use this
formula:
%∆y = 100 × [exp(β1 ∆x) − 1] (19)
So when x change by 1, we have

%∆y = 100 × [exp(β1 ) − 1] (20)

17/40 Aidil Rizal Shahrin University of Malaya Unofficial Beamer Theme


Log-Linear Model

ix. About logarithm, when to use and many more:


a. When y > 0, model using log(y) as the dependent variable
often satisfy the CLM assumptions more closely than model
using the level y. Strictly positive variables often have
conditional distribution that are heteroskedastic or skewed;
taking the log mitigate, if not eliminate, both problems.
b. Taking log of a variable often narrows its range. Particularly
true variables that can be large monetary value, such as
firm’s annual sales or baseball players’ salary. Population
variable also tend to vary widely. Narrowing the range of
the dependent and independent variable can make OLS less
sensitive to outlier.
c. But, logarithmic transformation may create extreme value
when a variable y is between zero and one (such as a
proportion) and takes on values close to zero. In this case,
log(y) (which is necessarily negative) can be very large in
magnitude whereas the original variable y is bounded
between zero and one.
18/40 Aidil Rizal Shahrin University of Malaya Unofficial Beamer Theme
Log-Linear Model

d. Some rule of thumb, when a variable is a positive dollar


amount, the log is often taken. Variables such as wages,
salaries, firm sales, and firm market value. If they are being
large integer value, also apply logarithm such as population,
total number of employees, school enrollment.
e. Another rule of thumb, variables measured in years-such as
education, experience, tenure, age- usually appear in their
original form.
f. Yet another, variable in proportion or percent-such as the
unemployment rate, the participation rate in a pension plan,
percentage of students passing-can appear in either original
or logarithm form, but the tendency is more toward level
because it has percentage point change interpretation.

19/40 Aidil Rizal Shahrin University of Malaya Unofficial Beamer Theme


Log-Linear Model

g. Limitation, log cannot be used when the variable takes zero


or negative value. If y is nonnegative, but can take value of
zero, log(1 + y) sometimes is used. Percentage change
interpretations are still observed except for changes
beginning at y = 0. However, log(1 + y) cannot be normally
distributed.
h. Never compare R2 where the model one is using y the other
log(y) as dependent variable (all the same except y is
transform).

20/40 Aidil Rizal Shahrin University of Malaya Unofficial Beamer Theme


Model with Quadratics

i. It is use in economic to capture decreasing or increasing


marginal effects. In the simple case, y depends on a single
observed factor x (noticed that single factor?)

y = β0 + β1 x + β2 x2 + u (21)

ii. β1 does not measure partial effect of x to y since we cannot


hold x2 fix if we change x, they are the same.

21/40 Aidil Rizal Shahrin University of Malaya Unofficial Beamer Theme


Model with Quadratics

iii. If we estimate Eq.21, we have:

ŷ = β̂0 + β̂1 x + β̂2 x2 (22)

with the approximation (taking derivative of Eq.22 with


respect to x)
∆ŷ
≈ β̂1 + 2β̂2 x (23)
∆x̂
In many application β̂1 is positive and β̂2 is negative.
iv. Using Eq.23 needs some explanation:
a. If x = 0, β̂1 approximate the slope going from x = 0 to x = 1.
b. If x = 1, β̂1 + 2β̂2 approximate the slope going from x = 1 to
x = 2.
c. If x = 10, β̂1 + 2β̂2 approximate the slope going from x = 10 to
x = 11.

22/40 Aidil Rizal Shahrin University of Malaya Unofficial Beamer Theme


Model with Quadratics

Example....
Using the wage data in WAGE1, we obtain:

w
[ age = 3.73 + .298 exper − .0061 exper2
(.35) (.041) (.0009)
2
n = 526, R = .093

Which implied that exper has a diminishing effect on wage.


The first experience worth roughly 30 cents per hour. The
second year experience worth 28.6 cents. From 10 to 11
years, worth 17.6 cents per hour. Notice it is diminishing?

23/40 Aidil Rizal Shahrin University of Malaya Unofficial Beamer Theme


Model with Quadratics

v. When β̂1 > 0 and β̂2 < 0, the quadratic has parabolic
shape, as what our example have.
vi. The turning point is:

−β̂1
x∗ = (24)
2β̂2

vii. In our example, the turning point x∗ = 24.4 years. It tells us


that return to experience become zero at about 24.4 years.
Refer to Fig.3.

24/40 Aidil Rizal Shahrin University of Malaya Unofficial Beamer Theme


Model with Quadratics

Figure 3: Quadratic relationship between w


[ age and exper.

25/40 Aidil Rizal Shahrin University of Malaya Unofficial Beamer Theme


Model with Quadratics

viii. Summary of quadratic model:


a. β̂1 > 0 and β̂2 < 0, inverted U-shape with maximum point.
b. β̂1 < 0 and β̂2 > 0, U-shape, with minimum point.
c. β̂1 > 0 and β̂2 > 0, no turning point for values x > 0. Smallest
expected value of y is at x = 0 (x is nonnegative).
d. β̂1 < 0 and β̂2 < 0, no turning point for values x > 0. Largest
expected value of y is at x = 0 (x is nonnegative).

26/40 Aidil Rizal Shahrin University of Malaya Unofficial Beamer Theme


Model with Interaction Terms

i. The purpose of this model is that the effect on dependent


variable with respect to an explanatory variable to depend
on magnitude of yet another explanatory variable.
ii. For example:

price = β0 + β1 sqrft + β2 bdrms + β3 sqrft · bdrms + u (25)

the partial effect of bdrms on price (holding all other


variables fixed) is

∆price
= β2 + β3 sqrft (26)
∆bdrms
iii. If β3 > 0 in Eq.26, an additional bedroom (∆ bdrms = 1),
yields a higher increase in housing price (∆ price), for
larger houses (sqrft). There is interaction effect between
square footage (sqrft) and number of bedrooms (bdrms).
27/40 Aidil Rizal Shahrin University of Malaya Unofficial Beamer Theme
Model with Interaction Terms

iv. What will be the value of sqrft in Eq.26? Might be its mean
value, or lower and upper quartiles in the sample (any
interesting value of sqrft). Of course whether β3 is
significant can be easily test using t test on Eq.25. If
significant, we have an interaction effect.
v. Most of the time it is better to re-parameterize the model in
Eq.25. Why? Remember, β2 is the of bdrms on price for a
home with zero square feet! (house is non existence but
you have bedroom?)

28/40 Aidil Rizal Shahrin University of Malaya Unofficial Beamer Theme


Model with Interaction Terms

vi. We can re-parameterize Eq.25 to be:

price = α0 + δ1 sqrft + δ2 bdrms


(27)
+ β3 (sqrft − µsqrft )(bdrms − µbdrms ) + u

where now, δ2 is the partial effect of bdrms on price at the


mean of sqrft, or δ2 = β2 + β3 µsqrft . (see Interaction Term
Proof for details discussion).
vii. Now the coefficients has useful interpretation. In fact you
can use other than the mean of variables as shown above.

29/40 Aidil Rizal Shahrin University of Malaya Unofficial Beamer Theme


Computing Average Partial Effects

i. Often, we wants a single value to describe the relationship


between the dependent variable y and each explanatory
variable.
ii. One popular measure is called average partial effect
(APE), also called the average marginal effect.

30/40 Aidil Rizal Shahrin University of Malaya Unofficial Beamer Theme


Computing Average Partial Effects

iii. Let say, we have a model to explained standardized


outcome on final exam (stndfnl) in terms of percentage of
class attended, prior college grade point average (priGPA)
and ACT score is

stndfl = β0 + β1 atndrte + β2 priGPA + β3 ACT


+ β4 priGPA2 + β5 ACT2 + β6 priGPA · atndrte + u

the partial effect of priGPA is:

∆stndfnl
= β2 + 2β4 priGPA + β6 atndrte (28)
∆priGPA

and the partial effect of atndrte is:

∆stndfnl
= β1 + β6 priGPA (29)
∆atndrte
31/40 Aidil Rizal Shahrin University of Malaya Unofficial Beamer Theme
Computing Average Partial Effects

And the APE for Eq.28 is:

APEpriGPA = β̂2 + 2β̂4 priGPA + β6 atndrte (30)

For Eq.29 is
APEstndfnl = β̂1 + β̂6 priGPA (31)

Where priGPA and atndrte is the sample average of priGPA


and atndrte respectively.
iv. Why applying this? For example, in Eq.31, we don’t need
to report this partial effect of each students in the sample,
instead we have average these partial effects.

32/40 Aidil Rizal Shahrin University of Malaya Unofficial Beamer Theme


Goodness-of-Fit and Selection of Regressors

i. Beginning students of econometrics tend to put too much


weight on R2 .
ii. Choosing a set of explanatory variables based on the size
of R2 can lead to nonsensical models.
iii. Furthermore, in time series we might obtained artificially
high R2 , but the result is misleading.
iv. In CLRM assumption, there is no minimum value of R2
required.
v. One thing that we can say about lower R2 is that the error
variance is large relative to the variance of y, which mean
we may have hard time precisely estimating the βj . But this
can be deal with larger sample size.

33/40 Aidil Rizal Shahrin University of Malaya Unofficial Beamer Theme


Adjusted R2

i. We rewrite the R2 as:


SSR/n
R2 = 1 − (32)
SST/n

the difference now we introduce n (its cancel each other,


then we have the standard R2 ).
ii. The population R2 is defined as (remember relationship
between R2 and correlation?)

σu2
ρ2 = 1 − (33)
σy2

34/40 Aidil Rizal Shahrin University of Malaya Unofficial Beamer Theme


Adjusted R2

iii. The estimate of σu2 in Eq.33 is SSR/n which is biased.


Instead, SSR/(n − k − 1) is the unbiased estimator for it.
Same goes to SST/n is biased. The unbiased estimator for
σy2 is SST/(n − 1). Using the unbiased estimator, now we
have the adjusted R-squared (sometimes call corrected
R-squared) of:

SSR/(n − k − 1)
R̄2 = 1 −
SST/(n − 1)
(34)
n − 1 SSR
=1−
n − k − 1 SST

iv. It is tempting to say that R̄2 correct the bias in R2 for


estimating the population R-squared, ρ2 , but it does not.
The ratio of two unbiased estimator is not an unbiased
estimator (like what we have in Eq.34).
35/40 Aidil Rizal Shahrin University of Malaya Unofficial Beamer Theme
Adjusted R2

v. However, the primary advantage of R̄2 is that it imposes a


penalty for adding additional independent variables to the
model.
vi. As compare to R2 , when you add more x0 s, it never falls
and most of the time it increases due to SSR (remember?).
vii. The penalty by adding more x0 s is k in Eq.34. So, with x is
added to the model, SSR falls, this will increase the R̄2 in
Eq.34. However, the factor (n − 1)/(n − k − 1) will increase
too. So whether the R̄2 increase or decrease depends on
these offsetting effect.
viii. Another interesting fact is that [(n − 1)/(n − k − 1)] > 1, so
R̄2 < R2 , always.
ix. R̄2 can be negative, if the regressors taken together, reduce
SSR too little, and this reduction fails to offset the factor
(n − 1)/(n − k − 1) in Eq.34.
36/40 Aidil Rizal Shahrin University of Malaya Unofficial Beamer Theme
Adjusted R2

x. The adjusted R2 also related to t and F statistic. If |t|-ratio


for any explanatory variable is less than 1, dropping that
particular variable will improve adjusted R2 . (see Adjusted
R Squared for proof).

37/40 Aidil Rizal Shahrin University of Malaya Unofficial Beamer Theme


Choosing Nonnested Models

i. The adjusted R2 , in some cases, allow us to choose a model


without redundant independent variables.
ii. Using the major league baseball salary example:

log(salary) = β0 + β1 years + β2 gamesyr + β3 bavg


(35)
+ β4 hrunsyr + u

log(salary) = β0 + β1 years + β2 gamesyr + β3 bavg


(36)
+ β4 rbisyr + u

where these two equations are nonnested models because


neither equation is a special case of the other.
iii. While, the F statistics allow us to test nested model: one
model (the restricted model) is a special case of the other
model (the unrestricted model).
38/40 Aidil Rizal Shahrin University of Malaya Unofficial Beamer Theme
Choosing Nonnested Models

iv. In your textbook, R̄2 = .6211 for Eq.35 and R̄2 = .6226 for
Eq.36. Based on this result, there is slight preference of the
model in Eq.36 to model in Eq.35. However, the different is
practically small.
v. However, using R̄2 to compare nonnested sets of
independent variable of different functional forms is
valuable.
vi. For example, model relating R&D intensity to firm sales:

rdintens = β0 + β1 log(sales) + u (37)


rdintens = β0 + β1 sales + β2 sales2 + u (38)

where both model captures a diminishing return. Based on


R2 , Eq.37 is .061 while Eq.38 is .148. However, this is unfair
since Eq.37 has fewer parameter.
39/40 Aidil Rizal Shahrin University of Malaya Unofficial Beamer Theme
Choosing Nonnested Models

vii. However, based on R̄2 , Eq.37 is .030 while Eq.38 is .090.


Thus, quadratic model Eq.38 is preferable in measuring
diminishing return of this model.
viii. Caution, we cannot use R̄2 or R2 to choose between
nonnested model to choose between different functional
forms for the dependent variable, for example either y or
log(y).

40/40 Aidil Rizal Shahrin University of Malaya Unofficial Beamer Theme

You might also like