Lecture 5
Lecture 5
(Model 1)
For illustration, consider the model with one continuous predictor, a dummy
variable z = 1{x≥c} for some constants c, and regression function given by,
E(y) = β0 + β1 x + β2 z(x − c)
= β0 + β1 x + β2 z(x − c)+
(
β0 + β1 x, if x < c
=
(β0 − β2 c) + (β1 + β2 )x ifx ≥ c.
where (.)+ is the positive part function defined by x+ = max(x, 0) for all x ∈ R
Such a model can be viewed as MLR model with two predictors, x and
(x − c)+ and all the fitting, statistical inference, and prediction procedures can
be carried out as usual.
Unlike the interaction model discussed earlier with two separate regression
lines gives rise to a single regression function which comprises two separate
straight lines connected continuously at x = c, which is called a kink.
1
Example1
Suppose that we are trying to model an individual’s charitable contributions (y)
in terms of his or her wages (x).
In 2007, individuals paid 7.65% of their income for Social Security taxes up to
$97,500 but social security taxes were charged on wages in excess of $97, 500.
Thus, one theory is that, for wages in excess of $97, 500, individuals have more
disposal income per dollar and thus should be more willing to make charita-
ble contributions. To capture this phenomenon, the piecewise linear regression
model can be used with c=97,500. (The slope of the regression line before and
after 97,500 should different).
2
Fig. The marginal change in Ey is lower below $97,500. The parameter β2
represents the difference in slopes
Example 2
You are given the following model:
lnyt = β0 +β1 lnxt1 +β2 lnxt2 +β3 (lnxt1 −lnxt0 1 )Dt +β4 (lnxt2 −lnxt0 2 )Dt +t
where,
3
4.00
0.60
−0.10
Fitting the model yields: β̂ =
−0.07
−0.01
Estimate the elasticity of frequency with respect to employment level for
1992.
( Solution:In class)
(Model 2)
E(y) = β0 + β1 x + β2 z + β3 zx
(
β0 + β1 x, if z = 0, or x < c,
=
(β0 + β2 ) + (β1 + β3 )x if z = 1, or x ≥ c.
4
Variable selection
• In practice we have many potential predictors. Suppose we have x1 , x2 , . . . xk
predictors associated with a response variable y.
• We will study the tools and techniques which helps to select variables to
enter in to the linear regression model.
5
This iterative process provides a useful recipe for structuring the task of
specifying a model to represent a set of data.
• The second step in the iteration is based on the assumptions of the speci-
fied model. These assumptions must be consistent with the data to make
valid use of the model.
• The third step, diagnostic checking, is also known as data and model crit-
icism; the data and model must be consistent with one another before
additional inferences can be made. Diagnostic checking is an important
part of the model formulation; it can reveal mistakes made in previous
steps and provide ways to correct them.
-In the first stage, out of all the candidate variables, the one that is most
statistically significant is added to the model.
- At the next stage, with the first stage variable already included, the next
most statistically significant variable is added.
6
- This procedure is repeated until all statistically significant variables have
been added.
- This starts with all variables in the model and drops the one with the
least significance, one at a time and repeats.
i. Consider all possible regression using one explanatory variable. For each
of the k regressions, compute t(bj ), the t-ratio for the slope. Choose that
variable with the largest t-ratio. If the t-ratio does not exceed a pre-
specified t-value( eg. two) then do not choose any variables and halt the
procedure.
ii. Add a variable to the model from the previous step. The variable to enter
is the one that makes the largest significant contribution. To determine
the size of contribution, use the absolute value of the variable’s t-ratio.
To enter, the t-ratio must exceed a specified t-value in absolute value.
iii. Delete a variable to the model from the previous step. The variable to be
removed is the one that makes the smallest contribution. To determine the
size of contribution, use the absolute value of the variable’s t-ratio. To be
removed, the t-ratio must be less than a specified t-value in absolute value.
iv. Repeat steps (ii) and (iii) until all possible additions and deletions are
performed.
7
Drawbacks
This algorithm is useful in that it quickly searches through a number of candi-
date models. However, there are several drawbacks:
1. The procedure “snoops” through a large number of models and may fit
the data “too well”.
2. There is no guarantee that the selected model is the best. The algorithm
does not consider models that are based on nonlinear combinations of
explanatory variables. It also ignores the presence of outliers and high
leverage points.
3. In addition, the algorithm does not even search all 2k possible linear re-
gressions.
4. The algorithm uses one criterion, a t-ratio, and does not consider other
criteria such as s, R2 , Ra2 , and so on.
6. By considering each variable separately, the algorithm does not take into
account the joint effect of explanatory variables.
• 7 Many statistical software routines have options for forcing variables into
a model equation.
8
• 5 Using a cutoff smaller than you ordinarily might.
• 3,4 To address these two drawbacks, we introduce the best regression rou-
tine. Suppose that there are four possible explanatory variables, x1 , x2 , x3 ,
and x4 , and the user would like to know what is the best two variable
model. The best regression algorithm searches over all six models of the
form Ey = β0 + β1 xi + β2 xj . Typically, a best regression routine recom-
mends one or two models for each p coefficient model, where p is a number
that is user specified. Because it has specified the number of coefficients to
enter the model, it does not matter which of the criteria we use: R2 , Ra2 ,
or s.
Residuals
• If model is an adequate representation of data, then residuals should
closely approximate random errors.
There are at least four types of patterns that can be uncovered through the
residual analysis.
9
When examining residuals, it is usually easier to work with a standardized resid-
ual, a residual that has been rescaled to be dimensionless. There are a number
of ways to define a standardized residual. Using ei = yi − yˆi as the ith residual
here are three commonly used definition :
ei ei
(a) s, (b) s√1−h , and (c) s √e1−hi
.
ii (i) ii
q Pn 2
1
where, s = n−(k+1) i=1 ei , hii = xTi (X T X)−1 xi . Similary define s(i)
to be the residual standard deviation when running a regression after having
deleted the ith observation.
Hat Matrix
In lecture 3, we have b = β̂ = (X T X)−1 X T Y , then Ŷ = Xb = X(X T X)−1 X T Y .
This equation suggests defining H = X(X T X)−1 X T Y , so that Ŷ = HY . We
can show that,
• HT = H
• HH = H
• (I − H)(I − H) = H
From the ith row of the equation Ŷ = HY we have,
yˆi = hi1 y1 + hi2 y2 + · · · + hii yi + · · · + hin yn .
Here, hij is the number in the ith row and the j th column of the matrix H.
That is the greater is hii , the greater is the effect that the ith response (yi ) has
on the corresponding fitted value (yˆi ). Thus, we call hii to be the leverage for
the ith observation. Since, hii is the ith diagonal element of H, direct expression
for hii is
10
Options for Handling Outliers
• Delete the outlier with care. If the outlier is not due to a recording error
or a misspecification of the mode, then it may be that the outlying obser-
vation simply exists by chance and is unrepresentative of the population
from which the observed data are drawn. Including the observation in the
analysis my unduly skew the model results without bringing much addi-
tional information.
• Retain the outlier in the model analysis, but make a note of its potential
effects.
11
Example-Stock Market Liquidity
An investor’s decision to purchase a stock is generally made with a number of
criteria in mind.
iii. Third, many investors are concerned with the length of time that they are
committing their capital with the purchase of a security.
iv. Fourth, investors are concerned with the ability to sell the stock at any
time convenient to the investor. We refer to this fourth criterion as the
liquidity of the stock.
To measure the liquidity, in this study, we use the number of shares traded
on an exchange over a specified period of time (called the VOLUME). We are
interested in studying the relationship between the volume and other financial
characteristics of a stock.
12
To begin to understand the liquidity measure VOLUME, we first fit a regression
model using NTRAN as an explanatory variable. The fitted regression model
with R2 = 83.4% and s = 4.35. Note that the t-ratio for the slope associated
with NTRAN is t(b1 ) = b1 /se(b1 ) = 0.00183/0.000074 = 24.7, indicating strong
statistical significance.
-The correlation between the residual and AVGT and the scatter plot (not
given here) indicates that there may be some information in the variable AVGT
in the residual.
- Thus, it seems sensible to use AVGT directly in the regression model. Re-
member that we are interpreting the residual as the value of VOLUME having
controlled for the effect of NTRAN.
The following table provides correlations between the model residuals and
other potential explanatory variables and indicates that there does not seem to
be much additional information in the explanatory variables. This is reaffirmed
13
by the corresponding table of scatter plots.
Influential Points
-A point that has a significant impact on the results of a model.
-Can drastically affect the model’s fit or results.
- Can greatly change the regression line or model parameters.
-Identified by influence measures such as leverage and Cook’s Distance.
Leverage
Hat matrix H is a positive semi definite matrix. Large leverage values indicate
that an observation may exhibit a disproportionate effect on the fit, essentially
because it is distant from the other observations (when looking at the space of
explanatory variables). How large is large? Some guidelines are available from
matrix algebra, where we have that
1
Pn
n ≤ hii ≤ 1 and h̄ = n1 i=1 hii = k+1 n
we use a widely adopted convention and declare an observation to be a
high leverage point if the leverage exceeds three times the average, that is, if
hii > 3(k + 1)/n.
• Include the observation in the summary statistics but comment on its ef-
fect. For example, an observation may barely exceed a cutoff and its effect
may not be important in the overall analysis.
• Delete the observation from the dataset. Again, the basic rationale for do-
ing so is that the observation is deemed not representative of some larger
14
population. An intermediate course of action between (i) and (ii) is to
present the analysis both with and without the high leverage point. In
this way, the impact of the point is fully demonstrated and the reader of
your analysis may decide which option is more appropriate.
Cook’s Distance
Cook’s Distance is a measure that considers both the response and the explana-
tory variables. This distance,Di , is defined as
Pn 2 2
i=1 (ŷj −ŷj(i) ) ei hii
Di = (k+1)s2 = se(ei ) (k+1)(1−hii )
• ŷj is the fitted value of yi computed when the model is fitted to whole
dataset. ŷj(i) is the prediction of the j th observation, computed leaving
the ith observation out of the regression fit. To measure the impact of
the ith observation, we compare the fitted values with and without the ith
observation. Each difference is then squared and summed over all obser-
vations to summarize the impact.
15
Example3 :Outliers and High Leverage Points
• Obs: n = 20 points
– Points B and C are high leverage points. Note that their values are
the same. This is because, the values of the explanatory variables B
and C are the same and only the response variable has been changed.
16