0% found this document useful (0 votes)
12 views16 pages

Lecture 5

The document discusses piecewise linear regression models, illustrating their application with examples related to charitable contributions and workers' compensation frequency. It covers variable selection techniques, including stepwise regression and the importance of residual analysis for identifying outliers and improving model formulation. Additionally, it highlights the iterative approach to data analysis and the limitations of automatic variable selection procedures.

Uploaded by

mike1226004050
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views16 pages

Lecture 5

The document discusses piecewise linear regression models, illustrating their application with examples related to charitable contributions and workers' compensation frequency. It covers variable selection techniques, including stepwise regression and the importance of residual analysis for identifying outliers and improving model formulation. Additionally, it highlights the iterative approach to data analysis and the limitations of automatic variable selection procedures.

Uploaded by

mike1226004050
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 16

University of Windsor, Winter 2025

Actuarial Regression and Time Series


ACSC- 4200/ ACSC- 8200

Dr. Poonam S. Malakar

Piecewise Linear Regression Model


In some applications, we expect the response to have some abrupt changes in
behaviour at certain values of an explanatory variable, even if the variable is
continuous.

(Model 1)
For illustration, consider the model with one continuous predictor, a dummy
variable z = 1{x≥c} for some constants c, and regression function given by,

E(y) = β0 + β1 x + β2 z(x − c)
= β0 + β1 x + β2 z(x − c)+
(
β0 + β1 x, if x < c
=
(β0 − β2 c) + (β1 + β2 )x ifx ≥ c.
where (.)+ is the positive part function defined by x+ = max(x, 0) for all x ∈ R

The slope of the regression function changes abruptly from β1 to β1 + β2


beyond x = c, but the regression function itself remains continuous (because
(.)+ is a continuous function).

Such a model can be viewed as MLR model with two predictors, x and
(x − c)+ and all the fitting, statistical inference, and prediction procedures can
be carried out as usual.

Unlike the interaction model discussed earlier with two separate regression
lines gives rise to a single regression function which comprises two separate
straight lines connected continuously at x = c, which is called a kink.

1
Example1
Suppose that we are trying to model an individual’s charitable contributions (y)
in terms of his or her wages (x).
In 2007, individuals paid 7.65% of their income for Social Security taxes up to
$97,500 but social security taxes were charged on wages in excess of $97, 500.
Thus, one theory is that, for wages in excess of $97, 500, individuals have more
disposal income per dollar and thus should be more willing to make charita-
ble contributions. To capture this phenomenon, the piecewise linear regression
model can be used with c=97,500. (The slope of the regression line before and
after 97,500 should different).

The regression function is Ey = β0 + β1 x + β2 z(x − 97, 500). This can be


written as
(
β0 + β1 x x < 97, 500,
Ey =
β0 − β2 (97, 500) + (β1 + β2 )x x ≥ 97, 500.
The following figure illustrates this relationship, known as piecewise linear re-
gression or sometimes a “broken-stick” model. The sharp break in this figure
at x = 97, 500 is called a kink. We have linear relationships above and below
the kinks and have used a binary variable to put the two pieces together. We
are not restricted to one kink.

2
Fig. The marginal change in Ey is lower below $97,500. The parameter β2
represents the difference in slopes

Example 2
You are given the following model:
lnyt = β0 +β1 lnxt1 +β2 lnxt2 +β3 (lnxt1 −lnxt0 1 )Dt +β4 (lnxt2 −lnxt0 2 )Dt +t

where,

• t indexes the years 1979-93 and t0 is 1990.

• y is a measure of workers’ compensation frequency.

• x1 is a measure of employment level.

• x2 is measure of unemployment rate.

• Dt is 0 for t ≤ t0 and 1 for t > t0 .

3
 
4.00
 0.60 
 
−0.10
Fitting the model yields: β̂ =  
−0.07
−0.01
Estimate the elasticity of frequency with respect to employment level for
1992.
( Solution:In class)

(Model 2)

The regression function in piecewise linear regression model need not be


continuous. A regression function with jumps will result from “interacting” a
continuous predictor x with the dummy variable z = 1{x≥c} defined above. The
regression function is

E(y) = β0 + β1 x + β2 z + β3 zx

(
β0 + β1 x, if z = 0, or x < c,
=
(β0 + β2 ) + (β1 + β3 )x if z = 1, or x ≥ c.

where the two straight lines generally do not connect at x = c.

Piecewise linear model corresponding to model 2

(study example from text, page 100)

4
Variable selection
• In practice we have many potential predictors. Suppose we have x1 , x2 , . . . xk
predictors associated with a response variable y.

• Which of these k variables should we use to construct our model?

• How do we decide on the best regression model?

• We will study the tools and techniques which helps to select variables to
enter in to the linear regression model.

An Iterative approach to data analysis and Modelling

5
This iterative process provides a useful recipe for structuring the task of
specifying a model to represent a set of data.

• The first step, the model formulation stage, is accomplished by examining


the data graphically and using prior knowledge of relationships, such as
from economic theory or standard industry practice.

• The second step in the iteration is based on the assumptions of the speci-
fied model. These assumptions must be consistent with the data to make
valid use of the model.

• The third step, diagnostic checking, is also known as data and model crit-
icism; the data and model must be consistent with one another before
additional inferences can be made. Diagnostic checking is an important
part of the model formulation; it can reveal mistakes made in previous
steps and provide ways to correct them.

Automatic Variable Selection Procedures


Business and economics relationships are complicated. There are typically many
variables that can serve as useful predictors of the dependent variable. In search-
ing for a suitable relationship, there is large number of potential models that are
based on linear combinations of explanatory variables and infinite number that
can be formed from nonlinear combinations. To search among models based on
linear combinations, several automatic procedures are available to select vari-
ables to be included in the model.

• Stepwise regression are procedures that employ t-tests to check the


“significance” of explanatory variables entered into, or deleted from, the
model.

• Forward selection version of stepwise


- In the forward selection version of stepwise regression, variables are added
one at a time.

-In the first stage, out of all the candidate variables, the one that is most
statistically significant is added to the model.

- At the next stage, with the first stage variable already included, the next
most statistically significant variable is added.

6
- This procedure is repeated until all statistically significant variables have
been added.

-Here, statistical significance is typically assessed using a variable’s t-ratio


the cutoff for statistical significance is typically a predetermined t-value
(e.g., two, corresponding to an approximate 95% significance level).

• Backward elimination version of stepwise

- The backward selection version works in a similar manner.

- This starts with all variables in the model and drops the one with the
least significance, one at a time and repeats.

- More generally, an algorithm that adds and deletes variables at each


stage is sometimes known as the stepwise regression algorithm.

Stepwise regression Algorithm


Suppose that the analyst has identified one variable as the response, y, and k
potential explanatory variables, x1 , x2 , . . . xk

i. Consider all possible regression using one explanatory variable. For each
of the k regressions, compute t(bj ), the t-ratio for the slope. Choose that
variable with the largest t-ratio. If the t-ratio does not exceed a pre-
specified t-value( eg. two) then do not choose any variables and halt the
procedure.

ii. Add a variable to the model from the previous step. The variable to enter
is the one that makes the largest significant contribution. To determine
the size of contribution, use the absolute value of the variable’s t-ratio.
To enter, the t-ratio must exceed a specified t-value in absolute value.

iii. Delete a variable to the model from the previous step. The variable to be
removed is the one that makes the smallest contribution. To determine the
size of contribution, use the absolute value of the variable’s t-ratio. To be
removed, the t-ratio must be less than a specified t-value in absolute value.

iv. Repeat steps (ii) and (iii) until all possible additions and deletions are
performed.

7
Drawbacks
This algorithm is useful in that it quickly searches through a number of candi-
date models. However, there are several drawbacks:
1. The procedure “snoops” through a large number of models and may fit
the data “too well”.

2. There is no guarantee that the selected model is the best. The algorithm
does not consider models that are based on nonlinear combinations of
explanatory variables. It also ignores the presence of outliers and high
leverage points.

3. In addition, the algorithm does not even search all 2k possible linear re-
gressions.

4. The algorithm uses one criterion, a t-ratio, and does not consider other
criteria such as s, R2 , Ra2 , and so on.

5. There is a sequence of significance tests involved. Thus, the significance


level that determines the t-value is not meaningful.

6. By considering each variable separately, the algorithm does not take into
account the joint effect of explanatory variables.

7. Purely automatic procedures may not take into account an investigator’s


special knowledge.

Respond to Drawbacks (reverse order)


Many of the criticisms of the basic stepwise regression algorithm can be ad-
dressed with modern computing software.

• 7 Many statistical software routines have options for forcing variables into
a model equation.

• 6 These combinations of variables may not be detected with the basic


algorithm but will be detected with the backward selection algorithm.
Because the backward procedure starts with all variables, it will detect,
and retain, variables that are jointly important.

8
• 5 Using a cutoff smaller than you ordinarily might.

• 3,4 To address these two drawbacks, we introduce the best regression rou-
tine. Suppose that there are four possible explanatory variables, x1 , x2 , x3 ,
and x4 , and the user would like to know what is the best two variable
model. The best regression algorithm searches over all six models of the
form Ey = β0 + β1 xi + β2 xj . Typically, a best regression routine recom-
mends one or two models for each p coefficient model, where p is a number
that is user specified. Because it has specified the number of coefficients to
enter the model, it does not matter which of the criteria we use: R2 , Ra2 ,
or s.

Stepwise regression and best regressions are examples of automatic vari-


able selection procedures.

Residuals
• If model is an adequate representation of data, then residuals should
closely approximate random errors.

• Random errors represent natural variation in data.

• There should be no discernible pattern in residuals.

• Patterns in the residuals indicate the presence of additional information


that we hope to incorporate into the model. A lack of pattern in the
residuals indicates that the model seems to account for the primary rela-
tionship in the data.

There are at least four types of patterns that can be uncovered through the
residual analysis.

i. residuals that are unusual.


ii. residuals that are related to other explanatory variable.
iii. residuals that display a heteroscedastic pattern (having different variance).
iv. residuals that display patterns through time.

9
When examining residuals, it is usually easier to work with a standardized resid-
ual, a residual that has been rescaled to be dimensionless. There are a number
of ways to define a standardized residual. Using ei = yi − yˆi as the ith residual
here are three commonly used definition :
ei ei
(a) s, (b) s√1−h , and (c) s √e1−hi
.
ii (i) ii
q Pn 2
1
where, s = n−(k+1) i=1 ei , hii = xTi (X T X)−1 xi . Similary define s(i)
to be the residual standard deviation when running a regression after having
deleted the ith observation.

Hat Matrix
In lecture 3, we have b = β̂ = (X T X)−1 X T Y , then Ŷ = Xb = X(X T X)−1 X T Y .
This equation suggests defining H = X(X T X)−1 X T Y , so that Ŷ = HY . We
can show that,

• HT = H
• HH = H
• (I − H)(I − H) = H
From the ith row of the equation Ŷ = HY we have,
yˆi = hi1 y1 + hi2 y2 + · · · + hii yi + · · · + hin yn .

Here, hij is the number in the ith row and the j th column of the matrix H.
That is the greater is hii , the greater is the effect that the ith response (yi ) has
on the corresponding fitted value (yˆi ). Thus, we call hii to be the leverage for
the ith observation. Since, hii is the ith diagonal element of H, direct expression
for hii is

hii = xTi (X T X)−1 xi

where, xi = (1, xi1 , . . . , xik )T

Using Residuals to Idendify Outliers


An important role of residual analysis is to identify outliers. An outlier is an
observation that is not well fit by the model; these are observations where the
residual is unusually large. A rule of thumb used by many statistical packages is
that an observation is marked as an outlier if the standardized residual exceeds
two in absolute value.

10
Options for Handling Outliers
• Delete the outlier with care. If the outlier is not due to a recording error
or a misspecification of the mode, then it may be that the outlying obser-
vation simply exists by chance and is unrepresentative of the population
from which the observed data are drawn. Including the observation in the
analysis my unduly skew the model results without bringing much addi-
tional information.

• Retain the outlier in the model analysis, but make a note of its potential
effects.

• An intermediate solution between (i) and (ii) is to present the analysis


of the model both with and without the outlying observation. This helps
emphasize the impact of the observation.

Using Residuals to Select Explanatory Variables


• Another important role of residual analysis is to help identify additional
explanatory variables that may be used to improve the formulation of the
model.

• It is a good idea to reinforce this correlation with a scatter plot. A plot of


residuals versus explanatory variables not only will reinforce graphically
the correlation statistic but also will serve to detect potential nonlinear
relationships.

• For example, a quadratic relationship can be detected using a scatter plot,


not a correlation statistic. To summarize, after a preliminary model fit,
you should do the following:

– Calculate summary statistics and display the distribution of (stan-


dardized) residuals to identify outliers.

– Calculate the correlation between the (standardized) residuals and


additional explanatory variables to search for linear relationships.

– Create scatter plots between the (standardized) residuals and addi-


tional explanatory variables to search for nonlinear relationships.

11
Example-Stock Market Liquidity
An investor’s decision to purchase a stock is generally made with a number of
criteria in mind.

i. First, investors usually look for a high expected return.

ii. A second criterion is the riskiness of a stock, which can be measured


through the variability of the returns.

iii. Third, many investors are concerned with the length of time that they are
committing their capital with the purchase of a security.

iv. Fourth, investors are concerned with the ability to sell the stock at any
time convenient to the investor. We refer to this fourth criterion as the
liquidity of the stock.

To measure the liquidity, in this study, we use the number of shares traded
on an exchange over a specified period of time (called the VOLUME). We are
interested in studying the relationship between the volume and other financial
characteristics of a stock.

For the trading activity variables, we examine

• The three-month total trading volume (VOLUME, in millions of shares)

• The three-month total number of transactions (NTRAN)

• The average time between transactions (AVGT, measured in minutes)


For the firm size variables, we use:

• Opening stock price on January 2, 1985 (PRICE),


• The number of outstanding shares on December 31, 1984 (SHARE, in
millions of shares)
• The market equity value (VALUE, in billions of dollars) obtained by tak-
ing the product ofPRICE and SHARE.

12
To begin to understand the liquidity measure VOLUME, we first fit a regression
model using NTRAN as an explanatory variable. The fitted regression model

V OLU M E = 1.65 + 0.00183N T RAN

with R2 = 83.4% and s = 4.35. Note that the t-ratio for the slope associated
with NTRAN is t(b1 ) = b1 /se(b1 ) = 0.00183/0.000074 = 24.7, indicating strong
statistical significance.

Residuals were computed using this estimated model. To determine whether


the residuals are related to the other explanatory variables, the following Table
shows correlations.

First table of correlations

-The correlation between the residual and AVGT and the scatter plot (not
given here) indicates that there may be some information in the variable AVGT
in the residual.

- Thus, it seems sensible to use AVGT directly in the regression model. Re-
member that we are interpreting the residual as the value of VOLUME having
controlled for the effect of NTRAN.

We next fit a regression model using NTRAN and AVGT as an explanatory


variables. The fitted regression model is:

V OLU M E = 4.41 − 0.322AV GT + 0.00167N T RAN

with R2 = 84.2% and s = 4.26. Given the t-ratio t(bAV GT ) = (−0.322)/0.135 =


−2.39, it seems that AVGT is a useful explanatory variable in the model. Note
also that s has decreased, indicating that Ra2 has increased.

The following table provides correlations between the model residuals and
other potential explanatory variables and indicates that there does not seem to
be much additional information in the explanatory variables. This is reaffirmed

13
by the corresponding table of scatter plots.

Second table of correlations

Influential Points
-A point that has a significant impact on the results of a model.
-Can drastically affect the model’s fit or results.
- Can greatly change the regression line or model parameters.
-Identified by influence measures such as leverage and Cook’s Distance.

Leverage
Hat matrix H is a positive semi definite matrix. Large leverage values indicate
that an observation may exhibit a disproportionate effect on the fit, essentially
because it is distant from the other observations (when looking at the space of
explanatory variables). How large is large? Some guidelines are available from
matrix algebra, where we have that

1
Pn
n ≤ hii ≤ 1 and h̄ = n1 i=1 hii = k+1 n
we use a widely adopted convention and declare an observation to be a
high leverage point if the leverage exceeds three times the average, that is, if
hii > 3(k + 1)/n.

Options for Handling High Leverage Points

• Include the observation in the summary statistics but comment on its ef-
fect. For example, an observation may barely exceed a cutoff and its effect
may not be important in the overall analysis.

• Delete the observation from the dataset. Again, the basic rationale for do-
ing so is that the observation is deemed not representative of some larger

14
population. An intermediate course of action between (i) and (ii) is to
present the analysis both with and without the high leverage point. In
this way, the impact of the point is fully demonstrated and the reader of
your analysis may decide which option is more appropriate.

• Choose another variable to represent the information. In some instances,


another explanatory variables will be available to serve as a replacement.
For example, in an apartment rents example, we could use the number of
bedrooms to replace a square footage variable as a measure of apartment
size. Although an apartment’s square footage may be unusually large,
thus causing it to be a high leverage point, it may have one, two, or three
bedrooms, depending on the sample examined.
• Use a nonlinear transformation of an explanatory variable.

Cook’s Distance
Cook’s Distance is a measure that considers both the response and the explana-
tory variables. This distance,Di , is defined as
Pn 2  2
i=1 (ŷj −ŷj(i) ) ei hii
Di = (k+1)s2 = se(ei ) (k+1)(1−hii )

• ŷj is the fitted value of yi computed when the model is fitted to whole
dataset. ŷj(i) is the prediction of the j th observation, computed leaving
the ith observation out of the regression fit. To measure the impact of
the ith observation, we compare the fitted values with and without the ith
observation. Each difference is then squared and summed over all obser-
vations to summarize the impact.

• The second equation provides another interpretation. The distance Di is


composed of a measure for outliers times a measure for leverage. In this
way, Cook’s distance accounts for both the response and the explanatory
variables.

15
Example3 :Outliers and High Leverage Points

To judge the size of the leverages,

• Obs: n = 20 points

• The leverages are bounded by 0.05 and 1.00 (1/n ≤ hii ≤ 1)


• the average leverage being h̄ = 2/20 = 0.10
• Using 0.3(= 3 × h̄) as a cutoff

– Points B and C are high leverage points. Note that their values are
the same. This is because, the values of the explanatory variables B
and C are the same and only the response variable has been changed.

– The typical value of Di is 1/n or 0.05, Cook’s distance provides one


statistic to alert us to the fact that each point is unusual in one re-
spect or another.

– Point C has a very large Di , reflecting the fact that it is both an


outlier and a high leverage point
– The 95th percentile of an F-distribution with df1 = 2and df2 = 18 is
3.555. The fact that point C has a value of Di that well exceeds this
cutoff indicates the substantial influence of this point.

16

You might also like