Heteroscedastic Regression Models
Heteroscedastic Regression Models
8, 2023-08
Hugo Hernandez
ForsChem Research, 050030 Medellin, Colombia
[email protected]
doi: 10.13140/RG.2.2.31538.58562
Abstract
One of the most important tools for data analysis is statistical regression. This technique
consists on identifying the best parameters of a given mathematical model describing a
particular set of experimental observations. This method implicitly assumes that the model
error has a constant variance (homoscedasticity) over the whole range of observations.
However, this is not always the case, leading to inadequate or incomplete models as the
changing variance (heteroscedasticity) is neglected. In this report, a method is proposed for
describing the heteroscedastic behavior of the regression model residuals. The method uses
weighted least squares minimization to fit the confidence intervals of the regression from a
model of the standard error. The weights used are related to the confidence level considered.
In addition, a test of heteroscedasticity is proposed based on the coefficient of variation of the
model of standard error obtained by optimization. Various practical examples are presented
for illustrating the proposed method.
Keywords
1. Introduction
Regression analysis allows us to obtain the best possible parameter values of a pre-defined
mathematical model, providing the minimum squared deviations from a set of experimental
observations. The regression procedure involves several implicit assumptions [1], which might
easily be ignored like for example:
Cite as: Hernandez, H. (2023). Heteroscedastic Regression Models. ForsChem Research Reports, 8, 2023-
08, 1 - 29. Publication Date: 31/05/2023.
Heteroscedastic Regression Models
Hugo Hernandez
ForsChem Research
[email protected]
The model error is linear. That is, it is simply added to the value predicted by the
regression equation.
The error term has a mean value of zero.
All independent variables, as well as other external factors (such as observation
sequence), are uncorrelated with the error term.
The error term has a constant variance (that is, it is homoscedastic).
The error term is normally distributed. While this last assumption is not strictly
necessary for regression, it is implicitly assumed when analyzing the significance of the
model.
Considering the previous assumptions, we may write the general regression model as follows:
( )
(1.1)
where represents the observed variable, is the vector of independent variables considered
in the model, is the vector of parameter values considered in the model, ( ) is the
regression equation or prediction model, is a constant parameter representing the standard
deviation of the model error (or standard error), and is the standard normal random variable.
The standard error is commonly estimated from the differences between predictions and
actual observations using the following expression:
∑ ( ( ̂))
̂
(1.2)
where the circumflex accent ( ̂ ) indicates a point estimation of the parameter, is the
number of observations considered, and represent the values of the response variable
and independent variables (respectively) for the -th observation, is the number of
parameters considered in the regression equation (including the intercept), and
represents the degrees of freedom available for estimating the standard error of the model.
The sum considered in Eq. (1.2) is denoted as the Sum of Squared Errors or :
( ̂) ∑( ( ̂ ))
(1.3)
Notice that the sum of squared errors is a function of the experimental observations and the
estimated parameter values.
Ideally, the best estimation of the parameter values of a model should guarantee the minimum
estimated standard error, represented by the following optimization problem:
̂
̂
(1.4)
or equivalently,
∑ ( ( ̂ ))
̂
(1.5)
Now, since the number of observations ( ), and the number of parameters ( ) in a regression
equation are fixed for a typical regression analysis, then the number of degrees of freedom
( ) is a constant having no effect on the optimal values of the parameters. In addition,
considering Eq. (1.3) we may simply transform the optimization problem into:
( ̂)
̂
(1.6)
which is commonly known as a least squares minimization problem.
Unfortunately, not all real situations satisfy all implicit assumptions of conventional regression.
Particularly in this report, we will consider the case where the variance of the error term is not
constant. This condition is known as heteroscedasticity. In this case, the general representation
of a heteroscedastic model is:
( ) ( )
(1.7)
This means that heteroscedastic regression requires fitting not only the conventional regression
equation ( ) but also the standard error model ( ), while at the same time
identifying the standard random distribution model .
In this report, a strategy for performing heteroscedastic regression is proposed based on the
minimization of weighted sums of squared deviations. Some illustrative examples are
presented for clarity.
∑ ( ( ̂ ))
̂
(2.1)
where represents the weight of the -th observation. The sum in Eq. (2.1) will be denoted as
a weighted (or ).
In an ordinary least squares ( ) minimization, all observations have the same unit weight,
that is, for .
In theory, the optimal weights for weighted least square ( ) regression are [3]:
(2.2)
where is the variance of the -th observation (allowing for heteroscedasticity).
Unfortunately, is also unknown, but there are various estimation methods available [3].
However, the approach considered here is different.
The idea presented in this work is the identification of functions describing the limits of a
bilateral confidence interval of the regression model, as illustrated in Figure 1 (for a simple
linear regression model). Black markers represent a particular set of observations. The green
line represents the prediction model obtained by regression. The blue and red lines
indicate the lower and upper limits of the confidence interval for the prediction of individual
observations.
( ̂ ̂ ) ( ̂ ) ( ) ( ̂ )
(2.3)
( ̂ ̂ ) ( ̂ ) ( ) ( ̂ )
(2.4)
where is the function describing the lower limit of the confidence interval, is the
function describing the upper limit of the confidence interval, is the significance level
(complement of the confidence level), and and are coefficients representing the
standardized location of the confidence interval limits in terms of the number of standard error
deviations about the prediction, which are also dependent on the type of random error
model considered ( ).
Figure 1. Linear limits of a confidence interval for a simple linear regression model
Let us now consider that the confidence interval limits can be represented by the general
functions (independent of the random error model):
( ̂ ̂ ) ( ̂ )
(2.5)
( ̂ ̂ ) ( ̂ )
(2.6)
In principle, the limit functions represent functions that minimize the deviation of the different
observations with respect to the corresponding limit, while at the same time maintaining a
certain proportion of observations at each side of the line.
( ̂ )
{
( ̂ )
(2.7)
( ̂ )
{
( ̂ )
(2.8)
where and are the weights for the lower and upper limit, respectively.
Thus, the confidence interval limit functions can be obtained by minimization as follows:
∑ ( ( ̂ ))
̂
(2.9)
∑ ( ( ̂ ))
̂
(2.10)
As an example, let us consider the data presented in Figure 1, and summarized in Table 1.
Assuming a linear model for the limits of the confidence interval we simply have:
( ̂ ̂ ) ̂ ( ) ̂ ( )
(2.11)
( ̂ ̂ ) ̂ ( ) ̂ ( )
(2.12)
The results obtained for different confidence levels are summarized in Figure 2 and Table 2.
Figure 2. Evolution of confidence interval limits for different significance levels using simple
weighted linear regression. Horizontal axes: Independent variable . Vertical axes: Response
variable .
Table 2. Coefficients of the confidence interval limits linear model for different significance
levels, estimated by weighted least squares minimization
Confidence ̂ ̂ ̂ ̂
Level
0% 100% 1.5709 1.2599 1.5709 1.2599 42.20 42.20
50% 50% 1.3159 0.9014 1.8405 1.6547 62.03 71.63
70% 30% 1.1971 0.7539 1.9933 1.8648 71.59 89.47
80% 20% 1.1120 0.6888 2.1416 1.9614 77.41 101.30
90% 10% 1.0093 0.6294 2.3450 2.0503 85.79 117.58
95% 5% 0.9137 0.5664 2.4571 2.1428 93.57 132.04
98% 2% 0.7711 0.5307 2.5056 2.3291 100.68 145.34
99% 1% 0.7012 0.5227 2.5527 2.3851 103.49 150.64
99.9% 0.1% 0.6320 0.5162 2.5984 2.4395 106.22 155.81
99.99999% 0.00001% 0.6255 0.5076 2.5293 2.6317 107.10 170.72
While this strategy is not rigorous in the sense that the confidence intervals obtained do not
necessarily provide the exact confidence level required, there is a clear relation between the
confidence level and the amplitude of the corresponding confidence intervals. This effect will
now be used to provide an initial estimate of the standard error function, which can be
corrected later.
3. Heteroscedastic Regression
In the previous Section, it was mentioned that the type of random error model ( ) is unknown
before performing the regression procedure. However, it is possible to check if a particular
random model fits the standardized residuals of the regression equation obtained. For
example, if a normal distribution is considered, it is possible to check the normality of the
standardized model residuals using a wide variety of tests [4]. For the previous example
considered, assuming a normal model of standardized residuals, the linear confidence interval
limits (for simple linear regression) can now be described as follows:
( ̂ ̂ ̂ ̂ ) ̂ ̂ (̂ ̂ )
(3.1)
( ̂ ̂ ̂ ̂ ) ̂ ̂ (̂ ̂ )
(3.2)
level might also be included in an optimization problem for maximizing the normality of the
standardized residuals.
While the terms ̂ and ̂ are obtained by minimization, the coefficients ̂ and ̂
can be estimated after solving the following minimization problem (using the suggested
confidence level of in this case):
∑ ( ̂ ̂ (̂ ̂ ))
̂ ̂
∑ ( ̂ ̂ (̂ ̂ ))
(3.3)
where
( ̂ )
{
( ̂ )
(3.4)
( ̂ )
{
( ̂ )
(3.5)
Considering again the previous example, the solution of Eq. (3.3) yields the following complete
regression model (including the error model):
( )
(3.6)
The first two terms represent the regression equation (model of the mean value), while the last
expression is the error model, where is a correction factor, compensating for the error in the
estimation of confidence intervals. Figure 3 shows the normality plot of the model residuals
standardized with respect to the uncorrected model of standard error. The term is found to
be for this particular example, which was obtained from the linear fit of the
uncorrected standardized residuals to the normal quantiles. Thus, the regression model can be
alternatively expressed as follows:
( )
(3.7)
Of course, not only the regression equation, but also the standard error equation, can be
nonlinear. In fact, there are many practical situations where the model residuals show
hourglass, rhomboid, or more complex shapes. In most cases, a quadratic model of standard
error may suffice to describe heteroscedasticity. The heteroscedastic regression model can be
expressed, in general, for a normal error model as follows:
( ) | ∑ ∑( ) |
(3.8)
where is the scalar intercept, is the column vector of coefficients for the linear terms,
is a lower triangular matrix of coefficients for the quadratic terms and interactions
between independent variables where is the corresponding element at the -th row and
-th column, represents the -th column of , and is the Hadamard product. The function is
in absolute value to guarantee only positive values.
The parameters of the standard deviation equation can be determined using the following
procedure:
1. Estimate the parameters of the regression equation ( ̂ ) using ordinary least squares,
by solving the minimization problem stated in Eq. (1.6).
2. Calculate the model residuals ( ( ̂ )) and test the normality of the residuals.
represents the vector of values of the independent variables for the i-th observation.
3. Assume a normal model for the residuals and solve the following minimization
problem:
∑ ( ( ̂ ) |̂ ̂ ∑∑ ̂ |)
̂ ̂ ̂
∑ ( ( ̂ ) |̂ ̂ ∑∑ ̂ |)
(3.9)
where
( ̂ ) |̂ ̂ ∑∑ ̂ |
( ̂ ) |̂ ̂ ∑∑ ̂ |
{
(3.10)
( ̂ ) |̂ ̂ ∑∑ ̂ |
( ̂ ) |̂ ̂ ∑∑ ̂ |
{
(3.11)
∑ ( ( ̂ ) ( )|̂ ̂ ∑∑ ̂ |)
̂ ̂ ̂
∑ ( ( ̂ ) ( )|̂ ̂ ∑∑ ̂ |)
(3.15)
where
( ̂ ) ( )|̂ ̂ ∑∑ ̂ |
( ̂ ) ( )|̂ ̂ ∑∑ ̂ |
{
(3.16)
( ̂ ) ( )|̂ ̂ ∑∑ ̂ |
( ̂ ) ( )|̂ ̂ ∑∑ ̂ |
{
(3.17)
and estimate from the linear fit to the corresponding quantiles of the random model
considered (assuming a zero intercept).
4. Heteroscedasticity Test
After fitting the parameters of the standard error equation using the minimization
approach, a highly relevant question may arise: Can the model obtained be truly considered
heteroscedastic of is it simply homoscedastic?
The answer to such questions will usually be: it depends. Particularly it depends on the range of
values considered for the independent variables, but it also depends on the number of
observations performed.
First of all, let us recall that the standard deviation of normal populations can be approximated
by the or Helmert distribution [5]. Thus, if we compare the behavior of the standard error
model with that of the distribution we can assess whether or not the residuals can be
considered homoscedastic or heteroscedastic.
The decision criterion can simply be stated as follows: The model residuals can be considered
heteroscedastic if the coefficient of variation of the standard error model in the range of values
of the independent variables considered is larger than the coefficient of variation of the
distribution with degrees of freedom. That is, the model residuals can be considered
heteroscedastic if:
( )
√ ( )
( ) ( )
( )
( )
(4.1)
( )
√
(4.2)
Table 3 summarizes the behavior of the exact (Eq. 4.1) and approximated (Eq. 4.2) coefficient
of variation of the distribution for different sample sizes. These results are also illustrated in
Figure 4.
Table 3. Exact and approximated coefficients of variation of the or Helmert distribution for
different sample sizes
Sample size ( ) Exact ( ) Approximated
2 0.52272 0.52223
3 0.42202 0.42206
4 0.36300 0.36328
5 0.32321 0.32356
6 0.29410 0.29444
7 0.27164 0.27194
8 0.25362 0.25388
9 0.23876 0.23897
10 0.22624 0.22639
15 0.18404 0.18398
20 0.15907 0.15888
25 0.14211 0.14184
30 0.12963 0.12929
40 0.11215 0.11175
50 0.10025 0.09981
100 0.07080 0.07034
200 0.05003 0.04962
500 0.03163 0.03131
1000 0.02236 0.02212
On the other hand, the coefficient of variation of the standard error model is obtained as
follows:
∫ ( ( ) ∫ ( ) )
√
∫
√ ( )
( )
( ) ∫ ( )
∫
(4.3)
The previous expression tests the overall heteroscedasticity of the model residuals. However, it
is also possible to test heteroscedasticity for each independent variable in the model by
keeping all other at a constant level.
∫ ( )
( )
∫
(4.4)
∫ ( )
( )
∫
(4.5)
√ ( ) √
( )
( )
(4.6)
Now, since we have:
( )
(4.7)
and given that ( ) ( ) we can conclude that the model residuals are heteroscedastic.
5. Illustrative Examples
Zeng and Zhang [6] report high-performance liquid chromatography (HPLC) calibration data
for acetaldehyde-2,4-dinitrophenylhydrazone (DNPH) solutions. The calibration data is
summarized in Table 4.
Table 4. HPLC calibration data for acetaldehyde-DNPH reported by Zeng and Zhang [6]
Acetaldehyde (µg) Area (arb.units) Acetaldehyde (µg) Area (arb.units)
3.700E-04 1.420E+01 1.539E-01 3.893E+03
3.730E-03 8.700E+01 1.539E-01 3.974E+03
9.230E-03 2.377E+02 3.080E-01 7.808E+03
1.524E-02 3.751E+02 3.080E-01 7.800E+03
1.537E-02 3.713E+02 6.156E-01 1.601E+04
1.539E-02 3.736E+02 6.156E-01 1.557E+04
1.539E-02 3.755E+02 1.231E+00 3.138E+04
1.539E-02 3.875E+02 1.231E+00 3.094E+04
1.539E-02 3.793E+02 1.538E+00 4.130E+04
1.539E-02 3.718E+02 1.539E+00 4.062E+04
1.539E-02 3.903E+02 1.847E+00 4.717E+04
1.693E-02 4.320E+02 1.847E+00 4.830E+04
3.078E-02 7.744E+02 2.462E+00 6.130E+04
3.078E-02 7.697E+02 2.462E+00 6.223E+04
3.950E-02 9.676E+02 3.078E+00 7.436E+04
7.633E-02 2.007E+03 3.078E+00 7.401E+04
7.694E-02 1.955E+03
The simple linear regression model obtained by ordinary least squares for this data set,
assuming an intercept of zero is:
[ ]
(5.1)
Figure 5 shows the regression line in original and logarithm scales, and the behavior of the
model residuals. Despite the high determination coefficient of the model, and its excellent
performance in both original and logarithm scales, we also notice that the model residuals
(under the homoscedastic assumption) are not normal, and that the model residuals seem
heteroscedastic.
Figure 5. Graphical summary of the linear model given in Eq. (5.1). Top left: Fit of regression
model in original scale. Top right: Fit of regression model in logarithm scale. Bottom left:
Heteroscedasticity plot (log-scale). Bottom right: Normality plot.
Assuming that the residuals are heteroscedastic, the following quadratic standard error model
is obtained:
( ) | |
(5.2)
where represents the random model of standardized residuals, and is the correction
factor.
The model residuals are standardized with the uncorrected standard deviation model and the
results are summarized in Figure 6.
Figure 6. Uncorrected standardized model residuals plots considering Eq. (5.2). Left:
Heteroscedasticity plot (log-scale). Right: Normality plot.
Assuming that the normal model can be considered valid for the standardized residuals, the full
regression model can be expressed as follows:
[ ]
| |
(5.3)
where is the standard normal random variable.
The regression model plot and model residuals plot, including the limits of the
confidence interval, are shown in Figure 7.
Figure 7. Acetaldehyde calibration model (Eq. 5.3) plot (left) and model residuals plot (right)
including confidence interval limits for a quadratic standard deviation model and a
normal model of standardized residuals.
Tamhane and Logan [7] reported the results obtained in a 90-day routine rat study used to
evaluate the toxicity of a crop protection compound. Different doses of the compound were
added to the diet of different rodents, and their final kidney weight to body weight ratio is
determined, as a measure of the toxic impact of the compound. The reported data is
summarized in Table 5, and also illustrated in Figure 8. The authors suggest fitting the data
using a quadratic model, and found that the data was heteroscedastic using both Bartlett’s and
Levene’s heteroscedasticity tests.
Table 5. Toxicity data reported by Tamhane and Logan [7]. Measured variable: Kidney weight to
body weight ratio of rats after a 90-day exposure to a crop protection compound in their diet.
Dose [a.u.]: 0 1 2 3
0.006593 0.007062 0.007006 0.009569
0.007480 0.007347 0.008706 0.009362
0.006930 0.007733 0.007257 0.010911
0.005662 0.007396 0.007743 0.009961
0.006789 0.008173 0.007026 0.009497
0.007268 0.006983 0.008561 0.009911
0.006647 0.006988 0.007674 0.008544
0.006443 0.006621 0.007450 0.010404
0.006713 0.007508 0.008188 0.010421
0.006057 0.006657 0.008150 0.010065
0.006253 0.007787 0.007619 0.009670
0.007045 0.006537 0.008722 0.008194
0.006552 0.007369 0.007387 0.008989
0.005668 0.006623 0.006798 0.007347
0.006354 0.006456 0.007617 0.007260
0.006511 0.006507 0.008071 0.009017
0.007111 0.006154 0.007020 0.008847
0.006015 0.005934 0.007821 0.008723
0.006909 0.007063
0.007252
In this example, a model for the standard deviation will be obtained from the reported data,
and heteroscedasticity will be tested with the procedure proposed in Section 4.
The best quadratic model obtained by ordinary least squares minimization is:
(5.4)
with a coefficient of determination of .
Figure 8. Toxicity data reported by Tamhane and Logan [7] with the corresponding suggested
regression model (Eq. 5.4)
The model residuals are used to fit a quadratic standard deviation model, assuming a normal
model of standardized residuals, resulting in:
( )
(5.5)
Figure 9. Uncorrected standardized model residuals plots considering Eq. (5.5). Left:
Heteroscedasticity plot (log-scale). Right: Normality plot.
The coefficient of variation of the standard error model obtained from Eq. (5.5) in the range of
doses from to , is:
√ ( ) √
( )
( )
(5.6)
( ) ( )
(5.7)
we observe that the residuals are heteroscedastic, being consistent with Bartlett’s and
Levene’s tests.
Figure 10. Toxicity model (Eq. 5.4 and 5.5) plot (left) and model residuals plot (right) including
confidence interval limits for quadratic regression equation, quadratic standard
deviation model and a normal model of standardized residuals.
Liu et al. [8] reported the combined effect of Gaminitrib ( ) and BKM120 ( ) on the fractional
growth inhibition ( ) of cancer cells. The data obtained is summarized in Table 6.
The multiple quadratic regression model obtained by least squares minimization is:
(5.8)
with .
Assuming a normal distribution of standardized residuals, and minimizing the weighted sum of
squares, the model error can be expressed as a quadratic function as follows:
| |
(5.9)
with (obtained from the normality plot).
A graphical summary of the model prediction and model residuals obtained from Eq. (5.8) and
(5.9) is presented in Figure 11.
The expected value of the standard error model shown in Eq. (5.9) for the range of doses from
to for both substances is:
( )
∫ ∫ | |
∫ ∫
(5.10)
Figure 11. Left: Predicted vs. observed fractional growth inhibition for the combined dose
response model (Eq. 5.8 and 5.9), considering a quadratic regression equation, quadratic
standard deviation model, a normal model of standardized residuals, and confidence
interval limits. Right: Normality plot for the uncorrected standardized model residuals.
( )
(5.11)
Thus, the overall coefficient of variation is:
√
( )
(5.12)
( ) ( )
(5.13)
then, the overall data is heteroscedastic.
In the absence of BKM120, the coefficient of variation in the standard error model due to
Gaminitrib only is:
∫ ( | | ( | ))
√
∫
( | )
∫ | |
∫
√
(5.14)
Whereas the coefficient of variation due to BKM120 in the absence of Gaminitrib is:
∫ ( | | ( | ))
√
∫
( | )
∫ | |
∫
√
(5.15)
It can be concluded that the data is also heteroscedastic for each independent variable
analyzed individually.
Goethals and Cho [9] report the experimental results obtained in an experimental design with
three factors, each of them with three levels, used to optimize a wire bonding process used in
the semiconductor manufacturing industry. The factors considered are flow rate ( ), flow
temperature ( ) and block temperature ( ). The response variable is the bond temperature
( ). A variable number of observations are reported for each treatment. The data is
summarized in Table 7.
Table 7. Experimental design used for optimizing a wire bonding process, as reported by
Goethals and Cho [9]
̅
-1 -1 -1 110 110 110
1 -1 -1 125 126 125.5
-1 1 -1 184 151 133 147 140 151
1 1 -1 210 176 169 199 169 171 182.33
-1 -1 1 130 129 129.5
1 -1 1 130 134 132
-1 1 1 175 151 153 143 155.5
1 1 1 180 152 154 152 150 171 159.83
-1 -1 -1 103 101 102
1 -1 -1 206 143 138 176 141 135 156.5
-1 1 -1 157 139 148
1 1 -1 181 180 184 175 190 182
-1 -1 1 172 135 133 155 148.75
0 0 0 190 149 145 161 161.25
0 0 0 180 141 139 153.33
The bond temperature is modeled using a quadratic model in terms of the three factors
considered. The regression model obtained is:
(5.16)
with .
Considering a linear model for the standard error, and normal standardized residuals, the
following equation is obtained:
( )
(5.17)
These results are graphically summarized in Figure 12. The correction factor found was
.
Considering a target value for the bond temperature of [9], several predicted optimal
operation conditions can be identified within the search region [ ] considered for each
factor. A sample of possible optimal conditions, obtained from different random initial
starting points in the search region, is summarized in Table 8 in no particular order. It is
possible to evidence that the predicted variability of each potential solution greatly changes
from to , according to Eq. (5.17).
Thus, a different optimization strategy is considered, where the standard error is minimized
while the bond temperature is constrained to a value of .
Figure 12. Left: Predicted vs. observed bond temperature using the model given by Eq. 5.16 and
5.17, considering a quadratic regression equation, linear standard deviation model, a normal
model of standardized residuals, and confidence interval limits. Right: Normality plot
for the uncorrected standardized model residuals.
Optimum # ( )
1 1 -0.2478 1 145.00 18.21
2 -1 -0.5671 0.9933 145.00 14.39
3 -0.6063 -0.8714 0.5649 145.00 18.45
4 -0.7234 -0.8844 0.8066 145.00 17.28
5 -1 0.2901 -0.2886 145.00 15.30
6 -0.4591 -0.8213 0.1581 145.00 20.13
7 0.0459 -1 -0.6645 145.00 25.39
8 1 -0.6012 0.3773 145.00 22.18
9 0.1299 -1 -1 145.00 26.87
10 0.9461 -0.9271 -0.3277 145.00 26.17
11 -0.1474 -1 -0.1309 145.00 22.86
12 0.1167 -1 -0.94 145.00 26.61
13 -0.1920 -1 -0.0304 145.00 22.37
14 -0.2035 -0.8470 -0.4734 145.00 23.31
15 -0.2406 -0.8593 -0.3191 145.00 22.68
6. Conclusion
A new strategy for modeling the heteroscedastic error in regression analysis is presented. The
strategy consists in estimating the limits of a confidence interval by solving a weighted least
squares minimization problem after assuming a suitable distribution model for the residual
error. The squared deviations with respect to each limit of the confidence interval are weighted
as a function of the confidence level considered. Since the confidence limits are not exact, a
correction factor ( ) is introduced, which can be easily determined by checking the validity of
the distribution model initially assumed. In addition, while any general heteroscedastic model
can be used, most problems can be satisfactorily solved using a quadratic model. Finally, an
alternative heteroscedasticity test is proposed as an alternative to statistical tests, based on
the coefficient of variation observed for the heteroscedastic model for the range of values of
the independent variables considered. Different practical examples were included in Section 5
to illustrate the proposed strategy.
1. Estimate the parameters of the regression equation ( ̂ ) using ordinary least squares,
by solving the minimization problem stated in Eq. (1.6).
2. Calculate the model residuals ( ( ̂ )) and test the normality of the residuals.
represents the vector of values of the independent variables for the i-th observation.
3. Assume a normal model for the residuals and solve the following minimization
problem:
∑ ( ( ̂ ) |̂ ̂ ∑∑ ̂ |)
̂ ̂ ̂
∑ ( ( ̂ ) |̂ ̂ ∑∑ ̂ |)
(3.9)
where
( ̂ ) |̂ ̂ ∑∑ ̂ |
( ̂ ) |̂ ̂ ∑∑ ̂ |
{
(3.10)
( ̂ ) |̂ ̂ ∑∑ ̂ |
( ̂ ) |̂ ̂ ∑∑ ̂ |
{
(3.11)
( ) ( )
(3.14)
The suggested value of is , but any other significance level could also be used.
Then, solve a new minimization problem (the set of parameters obtained under the
normal assumption can be used as starting point for the new optimization):
∑ ( ( ̂ ) ( )|̂ ̂ ∑∑ ̂ |)
̂ ̂ ̂
∑ ( ( ̂ )
( )|̂ ̂ ∑∑ ̂ |)
(3.15)
where
( ̂ ) ( )|̂ ̂ ∑∑ ̂ |
( ̂ ) ( )|̂ ̂ ∑∑ ̂ |
{
(3.16)
( ̂ ) ( )|̂ ̂ ∑∑ ̂ |
( ̂ ) ( )|̂ ̂ ∑∑ ̂ |
{
(3.17)
and estimate from the linear fit to the corresponding quantiles of the random model
considered (assuming a zero intercept).
7. For testing heteroscedasticity, first determine the coefficient of variation of the
standard error model obtained as follows:
∫ ( ( ) ∫ ( ) )
√
∫
√ ( )
( )
( ) ∫ ( )
∫
(4.3)
( )
√
(6.1)
The data can then be considered homoscedastic if:
( ) ( )
(6.2)
This report provides data, information and conclusions obtained by the author(s) as a result of original
scientific research, based on the best scientific knowledge available to the author(s). The main purpose
of this publication is the open sharing of scientific knowledge. Any mistake, omission, error or inaccuracy
published, if any, is completely unintentional.
This research did not receive any specific grant from funding agencies in the public, commercial, or not-
for-profit sectors.
This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International (CC BY-NC
4.0). Anyone is free to share (copy and redistribute the material in any medium or format) or adapt
(remix, transform, and build upon the material) this work under the following terms:
Attribution: Appropriate credit must be given, providing a link to the license, and indicating if
changes were made. This can be done in any reasonable manner, but not in any way that
suggests endorsement by the licensor.
NonCommercial: This material may not be used for commercial purposes.
References
[1] Frost, J. (2019). Regression Analysis. An Intuitive Guide for Using and Interpreting Linear Models.
Statistics by Jim Publishing. https://statisticsbyjim.com/regression/regression-analysis-intuitive-
guide/.
[2] Hernandez, H. (2018). Multidimensional Randomness, Standard Random Variables and Variance
Algebra. ForsChem Research Reports, 3, 2018-02, 1-35. doi: 10.13140/RG.2.2.11902.48966.
[3] Strutz, T. (2011). Data Fitting and Uncertainty: A Practical Introduction to Weighted Least Squares
and Beyond. 1st Edition. Vieweg+Teubner, Wiesbaden (Germany). Chapter 3: Weights and Outliers,
pp. 47-104. https://link.springer.com/book/9783658114558.
[4] Hernandez, H. (2021). Testing for Normality: What is the Best Method? ForsChem Research
Reports, 6, 2021-05, 1-38. doi: 10.13140/RG.2.2.13926.14406.
[5] Hernandez, H. (2023). Probability Distribution and Bias of the Sample Standard Deviation.
ForsChem Research Reports, 8, 2023-02, 1 - 26. doi: 10.13140/RG.2.2.22144.51205.
[6] Zeng, Q. C., Zhang, E., & Tellinghuisen, J. (2008). Univariate calibration by reversed regression of
heteroscedastic data: A case study. Analyst, 133 (12), 1649-1655. doi: 10.1039/b808667b.
[7] Tamhane, A. C., & Logan, B. R. (2004). Finding the maximum safe dose level for heteroscedastic
data. Journal of Biopharmaceutical Statistics, 14 (4), 843-856. doi: 10.1081/BIP-200035413.
[8] Liu, Q., Yin, X., Languino, L. R., & Altieri, D. C. (2018). Evaluation of drug combination effect using a
bliss independence dose–response surface model. Statistics in Biopharmaceutical Research, 10 (2),
112-122. doi: 10.1080/19466315.2018.1437071.
[9] Goethals, P. L., & Cho, B. R. (2011). Solving the optimal process target problem using response
surface designs in heteroscedastic conditions. International Journal of Production Research, 49
(12), 3455-3478. doi: 10.1080/00207543.2010.484556.