Chapter5 Regression TransformationAndWeightingToCorrectModelInadequacies

Download as pdf or txt
Download as pdf or txt
You are on page 1of 16

Chapter 5

Transformation and Weighting to Correct Model Inadequacies

The graphical methods help in detecting the violation of basic assumptions in regression analysis. Now we
consider the methods and procedures for building the models through data transformation when some of the
assumptions are violated.

Variance stabilizing transformations


In regression analysis, it is assumed that the variance of disturbances is constant, i.e.,
Var ( i )   2 , i  1, 2,..., n. Suppose this assumption is violated. A common reason for such isolation is that

the study variable follows a probability distribution in which the variance is functionally related to mean.

For example, if the study variable ( y ) in the model is Poisson random variable in a simple linear regression
model, then its variance is the same as the mean. Since mean of y is related to the explanatory variable x ,
so the variance of y will be proportional to x . In such cases, variance stabilizing transformations are
useful.

In another example, if y is proportion, i.e., 0  yi  1 then in such cases the variance of y is proportional

to E ( y )[1  E ( y )]. In such case, the variance – stabilizing transformation is useful.

Some commonly used variance-stabilizing transformations in the order of their strength are as follows:
Relation of  2 to E ( y ) Transformation

 2  constant y*  y (no transformation)

 2 E ( y ) y*  y (Poisson data)
 2  E ( y )[1  E ( y )] y*  sin 1 ( y ) (Binomial proportion 0  yi  1)
 2  [ E ( y )]2 y*  ln( y )
 2  [ E ( y )]3 y*  1/ y

1
 2  [ E ( y )]4 y* 
y

After making a suitable transformation, use y * as a study variable in the respective case.
Regression Analysis | Chapter 5 | Transf. Weight. Correct Model Inadequacies | Shalabh, IIT Kanpur
1
The strength of a transformation depends on the amount of curvature present in the curve between the study
and the explanatory variable. The transformation mentioned here ranges from relatively mild to relatively
strong. The square root transformation is relatively mild and reciprocal transformation is relatively strong.
The square root transformation is relatively mild and reciprocal transformation is relatively strong.

In general, a mild transformation applied when the minimum and maximum values do not range much (e.g.
ymax / ymin  2,3) and such transformation has little effect on the curvature. On the other hand, when the

minimum and maximum vary much, then a strong transformation is needed that will have a substantial
impact on the analysis.

In the presence of non-constant variance, the OLSE will remain unbiased but will looses the minimum
variance property.

When the study variable has been transformed as y* , then the predicted values are in the transformed scale.
It is often necessary to convert the predicted values back to the original units ( y ).

When the inverse transformation is applied directly to the original values, then it gives an estimate of the
median of the distribution of the study variable instead of the mean. So one needs to be careful while doing
so.

Confidence interval and prediction interval may be directly converted from one metric to another. The
reason being that the interval estimates are percentile of distribution and percentiles are unaffected by the
transformation. One may note that the resulting intervals may or may not be or remain the shortest possible
intervals.

Transformations to linearize the model


The basic assumption in linear regression analysis is that the relationship between the study variable and
explanatory variables is linear. Suppose this assumption is violated. Such violation can be checked by scatter
plot matrix, scatter diagrams, partial regression plots, lack of fit test etc.

In some cases, a nonlinear model can be linearized by using a suitable transformation. Such nonlinear
models are called intrinsically or transformable linear. The advantage of transforming the nonlinear

Regression Analysis | Chapter 5 | Transf. Weight. Correct Model Inadequacies | Shalabh, IIT Kanpur
2
function into the linear function is that the statistical tools are developed for the case of a linear regression
model. For example, exact tests for the test of hypothesis, confidence interval estimation etc. are developed
for the case of a linear regression model. Once the nonlinear function is transformed to a linear function, all
such tools can be readily applied, and there is no need to develop them separately.

Some linearizable functions are as follows:


1. If the curve between y and x is like as follows:

then the possible linearizable function is of the form


y   0 x 1 .

Using the transformation y*  ln y, x*  ln x, i.e., by taking log on both sides, the model
becomes
log y  log  0  1 log x

or y*   0*  1 x *

where  0*  log  0 and the model becomes a linear model. Note that the parameter  0 changes

to log  0 in the transformed model.

Regression Analysis | Chapter 5 | Transf. Weight. Correct Model Inadequacies | Shalabh, IIT Kanpur
3
2. If the curve between y and x is like as follows

then the possible linearizable function is of the form


y   0 exp ( 1 x)

Taking log e (ln) on both sides,

ln y  ln  0  1 x
or y*   0*  1 x
where y*  ln y and  0*  ln  0 .
So y*  ln y is the transformation needed in this case. The intercept term  0 becomes ln  0 in

the transformed model.

3. If the curve between y and x is like as follows

then the possible linearizable function is of the form


y   0  1 log x
Regression Analysis | Chapter 5 | Transf. Weight. Correct Model Inadequacies | Shalabh, IIT Kanpur
4
which can be written as
y   0  1 x *

using the transformation x*  log x.

4. If the curve between y and x is like as follows

then the possible linearizable function is of the form


x
y
 0 x  1
which can be written as
1 
 0  1
y x
or y*   0  1 x * .

1 1
which becomes a linear model by using the transformation y*  , x*   .
y x
 With the observed behaviour of the plots, one can choose any such curve and use the linearized form
of the function.
 When such transformations are used, many times the form of  also gets changed. For example, in
the case of
y   0 exp( 1 x) 

ln y  ln  0  1 x  ln 

or y*   0*  1 x   * .

Regression Analysis | Chapter 5 | Transf. Weight. Correct Model Inadequacies | Shalabh, IIT Kanpur
5
This implies that the multiplicative error in the original model is log normally distributed in the
transformed model. Many times, we ignore this aspect and continue to assume that the random errors
are still normally distributed. In such cases, the residuals from the transformed model should be
checked for the validity of the assumptions.
 When such transformations are used, the OLSE has the desired properties with respect to the
transformed data and not the original data.

Analytical methods for selecting a transformation on study variable


The Box-Cox method
Suppose the normality and/or constant variance of the study variable y can be corrected through a power

transformation on y . This means y is to be transformed as y  where  is the parameter to be determined.

For example, if   0.5, then the transformation is the square root and y is used as a study variable in

place of y .

Now the linear regression model has parameters  ,  2 and  . Box and Cox method tells how to estimate
simultaneously the  and parameters of the model using the method of maximum likelihood.

Note that as  approaches zero, y  approaches to 1. So there is a problem at   0 because this makes all
the observation y to be unity. It is meaningless that all the observation on the study variable are constant.

y 1
So there is a discontinuity at   0 . One approach to solve this difficulty is to use as a study

y 1
variable. Note that as   0,  ln y . So a possible solution is to use the transformed study variable

as
 y 1
 for   0
W  
ln y for   0.

So the family W is continuous. Still, it has a drawback. As  changes, the value of W change
dramatically. So it is difficult to obtain the best value of  . If different analysts obtain different values of
 , then it will fit different models. It may then not be appropriate to compare the models with different
values of  . So it is preferable to use an alternative form

Regression Analysis | Chapter 5 | Transf. Weight. Correct Model Inadequacies | Shalabh, IIT Kanpur
6
 y 1
 for   0
y ( )  V    y* 1
 y ln y for   0
 *
where y* is the geometric mean of yi ' s as y*  ( y1 y2 ... yn )1/ n which is constant.

For calculation purpose, we can use


1 n
ln y*   ln yi .
n i 1

When V is applied to each yi , we get V  V1 ,V2 ,...,Vn  ' as a vector of observation on transformed study

variable, and we use it to fit a linear model


V  X 
using the least squares or maximum likelihood method.

The quantity  y* 1 in the denominator is related to the nth power of Jacobian of transformation. See how:

We want to convert yi into yi(  ) as

yi  1
yi(  )  Wi  ;   0.

Let y   y1 , y2 ,..., yn  ', W  (W1 , W2 ,..., Wn ) '.

y1  1
Note that if W1  , then

W1  y1 1
  y1 1
y1 
W1
 0.
y2
In general,
Wi  yi 1 if i  j

y j 0 if i  j.

The Jacobian of transformation is given by


yi 1 1
J ( yi  Wi )    .
Wi  Wi  yi 1
 
 yi 

Regression Analysis | Chapter 5 | Transf. Weight. Correct Model Inadequacies | Shalabh, IIT Kanpur
7
W1 W1 W1

y1 y2 yn
y1 1 0 0  0
W2 W2 W2  1
 0 y2 0  0
J (W  y )  y1 y2 yn 
    
     1
0 0 0  yn
Wn Wn Wn

y1 y2 yn
n
  yi 1
i 1
 1
 n 
   yi 
 i 1 
 1
 
1  1 
J(y W)   n  .
J (W  Y )  
  yi 
 i 1 
Since this is a Jacobian when we want to transform the whole vector y to whole vector W . If an individual
yi is to be transform into Wi , then take its geometric mean as
 1
 
 
 1 
J ( yi  Wi )   1 
.
 n n 
   yi  
  i 1  
1
The quantity J (Y  W )  n
ensures that unit volume is preserved moving from the set of yi to the
yi 1
 1
i

set of Vi . This is a factor which scales and ensures that the residual sum of squares obtained from different

values of  can be compared.

To find the appropriate family, consider


y ( )  V  X   

y 1
where y (  )  ,  ~ N (0,  2 I ).
 y*
 1

Regression Analysis | Chapter 5 | Transf. Weight. Correct Model Inadequacies | Shalabh, IIT Kanpur
8
Applying the method of maximum likelihood for likelihood function for y (  ) ,

 n 2
  i 
n
( )  1  2
L  y    2 
exp   i 1 2 
 2   2 
 
n
 1 2   ' 
= 2 
exp   2 
 2   2 
n
 1 2  ( y (  )  X  ) '( y (  )  X  ) 
 2 
exp   
 2   2 2 
n  ( y (  )  X  ) '( y (  )  X  ) 
ln L  y (  )    ln  2    (ignoring constant).
2  2 2 
Solving
 ln L  y (  ) 
0

 ln L  y (  ) 
0
 2
gives the maximum likelihood estimators

ˆ ( )  ( X ' X ) 1 X ' y (  )
1 ( ) y (  ) ' Hy (  )
ˆ 2 ( )  y '  I  X ( X ' X ) 1 X ' y (  ) 
n n
for a given value of  .

Substituting these estimates in the log-likelihood function ln L  y (  )  gives

n n
L( )   ln ˆ 2   ln  SSr e s ( ) 
2 2
where SS r e s ( ) is the sum of squares due to residuals which is a function of  . Now maximize L( )
with respect to  . It is difficult to obtain any closed form of the estimator of  . So we maximize it
numerically.
n
The function  ln  SSr e s ( )  is called as the Box-Cox objective function.
2

Regression Analysis | Chapter 5 | Transf. Weight. Correct Model Inadequacies | Shalabh, IIT Kanpur
9
Let max be the value of  which minimizes the Box-Cox objective function. Then under fairly general

conditions, for any other 


n ln  SS r e s ( )   n ln  SS r e s (max ) 

has approximately  2 (1) distribution. This result is based on the large sample behaviour of the likelihood
ratio statistic. This is explained as follows:

The likelihood ratio test statistic in our case is


Max L
o
n   
Max L

n
 1 2
Max  2 
o   
 n
 1 2
Max  2 
  
n
 1 2
 ˆ 2 ( ) 
  
n
 1 2
 ˆ2 
  (max ) 
n
 1/ SS r e s ( )  2
 
 1/ SSr e s (max ) 

n  SSr e s (max ) 
ln   ln  
2  SSr e s ( ) 
n  SSr e s ( ) 
 ln   ln  
2  SS r e s (max ) 
n n
 ln  SSr e s ( )   ln  SSr e s (max ) 
2 2
  L( )  L(max )
where
n
L( )   ln  SSr e s ( ) 
2
n
L(max )   ln  SSr e s (max )  .
2

Regression Analysis | Chapter 5 | Transf. Weight. Correct Model Inadequacies | Shalabh, IIT Kanpur
10
Since under certain regularity conditions, 2 ln n converges in distribution to  2 (1) when the null

hypothesis is true, so
 2 ln  ~  2 (1)
 2 (1)
or  ln  ~
2
 2 (1)
or L(max )  L( ) ~ .
2

Computational procedure
The maximum- likelihood estimate of  corresponds to the value of  for which residual sum of squares
from the fitted model SSr e s ( ) is a minimum. To determine such  , we proceed computationally as follows:

- Fit y (  ) for various values of  . For example, start with values in (-1, 1) then take the values in
(-2, 2) and so on. Take about 15 to 20 values of  which are expected to be sufficient for the
estimation of optimum value.
- Plot SSr e s ( ) versus  .

- Find the value of  which minimizes SSr e s ( ) from the graph.

- A second iteration can be performed using a finer mesh of values of desired.

Note that the value of  can not be selected by directly comparing the residual sum of squares from the
regression of y  on x because for each  , the residual sum of squares is measured on a different scale.

It is better to use simple values of  . For example, the practical difference between   0.5 and   0.58 is
likely to be small but   0.5 is much easier to interpret.

Once  is selected, then use


 y  as a study variable if   0
 ln y as a study variable if   0.

It is entirely acceptable to use y (  ) as a response to the final model. This model will have a scale difference

and an origin shift in comparison to model using y  (or ln y ) as the response.

Regression Analysis | Chapter 5 | Transf. Weight. Correct Model Inadequacies | Shalabh, IIT Kanpur
11
An approximate confidence interval for 
We can find an approximate confidence interval for the transformation parameter  . This interval helps in
selecting the final value of  . For example, if ˆ  0.58 is the value of  which is minimizing the sum of
squares due to residual. But if   0.5 is in the confidence interval, then one may use the square root
transformation because it is easier to explain. Furthermore, if   1 is in the confidence interval, then it may
be concluded that no transformation is necessary.

In applying the method of maximum likelihood to the regression model, we are essentially maximizing
n
L( )   ln  SS r e s ( ) 
2
or equivalently, we are minimizing SS r e s ( ) .

An approximate 100(1-  ) % confidence interval for  consists of those values of  that satisfy

2 (1)
L(ˆ )  L( ) 
2
where 2 (1) is the upper  % point of the Chi-square distribution with one degree of freedom.

The approximate confidence interval is constructed using the following steps:


- Draw a plot of L( ) versus  .
- Draw a horizontal line at the height
 (1)
2
L(ˆ )  
2
on the vertical scale.
- This line would cut the L( ) at two points.
- The location of these two points on the  -axis defines the two endpoints of the approximate
confidence interval.
- If the sum of squares due to residuals is minimized and SSr e s ( ) versus  is plotted, then the line

must be plotted at the height


  2 (1) 
SS *  SS r e s (ˆ ) exp   
 n 

where ̂ is the value of  which minimizes the sum of squares due to residuals. See how:

Regression Analysis | Chapter 5 | Transf. Weight. Correct Model Inadequacies | Shalabh, IIT Kanpur
12
 2 (1) n  2 (1)
L(ˆ )     ln  SSr e s (ˆ )   
2 2 2
n  (1) 
 
2
  ln SS Re s (ˆ )   
2 n 
n    2 (1)  
2 
 
  ln SS r e s (ˆ )  ln exp    
  n  
n   ˆ  2 (1)  
  ln  SSr e s ( ).exp   
2    n  
n
  ln SS * .
2
Using the expansion of exponential function as
t2
exp(t )  1  t   ...
2!
 1  t,

 2 (1)  2 (1)  2 (1) 


we can approximate and replace exp   by 1  . So in place of exp   in applying the
 n  n  n 
confidence interval procedure, we can use the following:
Z2 /2  Z2 /2 
1  or 1  
  n 
t2 /2  t2 /2 
or 1   or 1  
  n 
2 /2  2 /2 
or 1   or 1  
  n 

where  is the degrees of freedom associated with the sum of squares due to residuals.
These expressions are based on the fact that
 2 (1)  Z 2  t2 if  is small.
It is debatable to use either  or n but practically the difference is very little between the confidence
interval results.

Box-Cox transformation was originally introduced to reduce the nonnormality in the data. It also helps in
reducing the nonlinearity. The approach is to find out the transformations, which attempts to reduce the
residuals associated with outliers and also reduce the problem of non-constant error variance if there was no
acute nonlinearity, to begin with.

Regression Analysis | Chapter 5 | Transf. Weight. Correct Model Inadequacies | Shalabh, IIT Kanpur
13
Transformation on explanatory variables: Box and Tidwell procedure
Suppose the relationship between y and one or more of the explanatory variables is nonlinear. Other usual
assumptions normally and independently distributed study variable with constant variance are at least
approximately satisfied.

We want to select an appropriate transformation on the explanatory variable so that the relationship between
y and transformed explanatory variable is as simple as possible.

Box and Tidwell procedure describes a general analytical procedure for determining the form of
transformation on x

Suppose that the study variable y is related to the power of explanatory variables. Box and Tidwell
procedures for explanatory variables choose the variables as
 xij j  1
 when  j  0, i  1, 2,.., n; j  1, 2,..., k
zij    j

ln xij when  j  0.

We need to estimate  j ' s . Since the dependent variable is not being transformed, we need not worry about

the changes of scale and minimize


n

 y    1 zi1  ...   k zik 


2
i 0
i 1

by using the nonlinear least-squares techniques.

We consider this for simple linear regression model instead of a nonlinear regression model.

Assume y is related to   x as
E ( y )  f ( ,  0 , 1 )   0  1
 x if   0
where   
ln x if   0

where  0 , 1 and  are the unknown parameters.

Suppose  0 is the initial guess of constant  .

Usually, the first guess is  0  1 so that   x or no transformation is applied in the first iteration.
Regression Analysis | Chapter 5 | Transf. Weight. Correct Model Inadequacies | Shalabh, IIT Kanpur
14
Expand about the initial guess in a Taylor series and ignoring terms of order higher them one gives
 df ( ,  0 , 1 ) 
E ( y )  f ( 0 ,  0 , 1 )  (   0 )  
 d  0
0

 df ( ,  0 , 1 ) 
  0  1 x  (  1)   .
 d  0
0

 df ( ,  0 , 1 ) 
Suppose the term   is known, then it can be treated just like as an additional explanatory
 d  0
0

variable. Then the parameters  0 , 1 and  can be estimated by least-squares method.

The estimate of  can be considered as an improved estimate of the transformation parameter.

This term can be written as


 df ( ,  0 , 1 )   df ( ,  0 , 1 )   d  
      .
 d  0  d  0  d   
0
0

d
Since the form of transformation is known, i.e.,   x , so  x ln x.
d
Furthermore
 df ( ,  0 , 1 )  d (  0  1 x)
    1.
 d  0 dx

So 1 can be estimated by fitting the model

ŷ  ˆ0  ˆ1 x

by least-squares method.

Then an “adjustment” to initial guess  0  1 is computed by defining a second regression variable as

  x ln x
estimating the parameter in
E ( y )   0*  1* x  (  1) 1
  0*  1* x  
by least-squares.

Regression Analysis | Chapter 5 | Transf. Weight. Correct Model Inadequacies | Shalabh, IIT Kanpur
15
This gives the following:
yˆ  ˆ0*  ˆ1* x  
ˆ
ˆ  (  1) ˆ1

ˆ
or 1  1
ˆ1
as the revised estimate of  .

Note that ˆ1 is obtained from ŷ  ˆ0  ˆ1 x and ˆ is obtained from ŷ  ˆ0*  ˆ1* x  
ˆ .

Generally, ˆ1 and ˆ1* will differ.

This procedure may be repeated using a new regression x*  x1 in the calculation.

This procedure generally converges rapidly.

Usually, the first stage result 1 is a satisfactory estimate of  . The round-off error is a potential problem.

If enough decimal places are not taken care, then the successive values of  may oscillate badly. If the
standard deviation of error ( ) is large or the range of the explanatory variable is very small relative to its
mean, then the estimator may face convergence problems. This situation implies that the data do not support
the need for any transformation.

Regression Analysis | Chapter 5 | Transf. Weight. Correct Model Inadequacies | Shalabh, IIT Kanpur
16

You might also like