Our Blog: Solving The Problem of Heteroscedasticity Through Weighted Regression
Our Blog: Solving The Problem of Heteroscedasticity Through Weighted Regression
OUR BLOG
K N O W W H AT W E A R E T H I N K I N G A N D D O I N G
Create PDF in your applications with the Pdfcrowd HTML to PDF API PDFCROWD
goal, one rst needs to understand the factors a ecting web tra c. The vast majority of small businesses try to increase website hits or visits via
advertisements.
We took a look at small business website statistics and saw how important advertising is. Let us review the arti cially generated data. The summary
of the dataset is presented below.
1st Qu.: 250.8 1st Qu.: 299.8 1st Qu.:4228 Outdoor Ads :199
Mean : 500.5 Mean : 549.5 Mean :4554 Social Media Ads :187
3rd Qu.: 750.2 3rd Qu.: 799.2 3rd Qu.:4799 Video Ads :204
The data consists of 4 variables and 1000 observations without any missing values. The variable Company shows the unique number of the
company whose website is being examined, variable Visits is the number of website visits per week. The variables AdType and Budget
show the main type of advertising done by the company and the average monthly amount spent on this advertisement, respectively. There are the
5 types of advertisement in the data: Radio and Podcasts, Direct Mail, Video Ads, Social Media Ads, Outdoor Ads.
Create PDF in your applications with the Pdfcrowd HTML to PDF API PDFCROWD
Create PDF in your applications with the Pdfcrowd HTML to PDF API PDFCROWD
The left graph indicates that there is a positive correlation between the money spent on advertisement and the number of website visits. The
coloring of the plot has been done based on the variable AdType , and the result shows that there is no interaction e ect of two explanatory
variables on the popularity of the website. In general, website owners spend an approximately equal amount of money on di erent types of
advertisements. Roughly there is no multicollinearity between explanatory variables. Based on the second graph, as the medians and spread of
data are approximately the same, we can claim that the way one chooses to increase the visibility of a website plays no signi cant role.
To understand the e ect of advertising let us consider the following multiple linear regression model:
Create PDF in your applications with the Pdfcrowd HTML to PDF API PDFCROWD
## Video Ads -10.368
## (28.737)
##
## Constant 3,995.437***
## (26.096)
##
## -----------------------------------------------
## Observations 1,000
## R2 0.506
## Adjusted R2 0.504
## Residual Std. Error 293.017 (df = 994)
## F Statistic 203.633*** (df = 5; 994)
## ===============================================
## Note: *p<0.1; **p<0.05; ***p<0.01
It is not surprising that the coe cients for the unique levels of variable AdType are not signi cant, because there is no e ect on the response
variable Visits . However, the coe cient for the variable Budget is statistically signi cant and positive (see the graph). So, the multiple
regression analysis shows that with the increase in the amount of money spent on advertising by $100 the number of visitors will increase by, on
average, 102. Thus, the number of visitors can be predicted based on the ad budget.
And yet, this is not a reliable result, since an important factor has been omitted. We will now discuss brie y the concepts of heteroscedasticity, the
causes and e ects of nonconstant variance and the ways of solving this problem.
THE PROBLEM
NONCONSTANT VARIANCE
One of the Gauss--Markov conditions states that the variance of the disturbance term in each observation should be constant. This assumption,
however, is clearly violated in most of the models resulting in heteroscedasticity. Mathematically, homoscedasticity and heteroscedasticity may be
de ned as:
Homoscedasticity: σ2ϵ i
2
= σϵ the same for all observations
Heteroscedasticity: σϵ is not the same for all observations.
2
Create PDF in your applications with the Pdfcrowd HTML to PDF API PDFCROWD
See the visual demonstration of homoscedasticity and heteroscedasticity below:
The left picture illustrates homoscedasticity. Let us start with the rst observation, where X has the value of X 1 . If there was no disturbance term in
the model, the observation would be represented by the circle lied on line Y = β1 + β2 X . The e ect of the disturbance term is to shift the
observation upwards or downwards vertically (downwards in case of X 1 ). The potential distribution of the disturbance term, before the observation
was generated, is shown by the normal distribution.
Although homoscedasticity is often taken for granted in regression analysis, it is common to suppose that the distribution of the disturbance term is
di erent for di erent observations in the sample. Suppose the variance of the distribution of the disturbance term rises as X increases (right picture).
This does not mean that the disturbance term will necessarily have a particularly large (positive or negative) value in an observation where X is
large, but it does mean that the a priori probability of having an erratic value will be relatively high.
The rst graph of the relationship between the budget and visitors illustrates typical scatter diagram of heteroscedastic data - there is a tendency
for their dispersion to rise as X increases. It means that even though there is a positive relationship between the variables, starting at a particular
point large amount of money fails to imply a large number of visitors. In other words, one can spend huge sums without the guarantee of large
tra c.
Create PDF in your applications with the Pdfcrowd HTML to PDF API PDFCROWD
Heteroscedasticity is more likely to occur, for example, when
The values of the variables in the sample vary substantially in di erent observations.
The explanatory variable increases, the response tends to diverge. For example, families with low incomes will spend relatively little on luxury
goods, and the variations in expenditures across such families will be small. But for families with large incomes, the amount of discretionary
income will be higher.
The model is misspeci ed (using response instead of the log of response or instead of X^2 using X etc). Important variables may be omitted from
the model.
Why does heteroscedasticity matter? As a matter of fact, the evidence for the absence of bias in the OLS regression coe cients did not use this
condition. So we can be sure that the coe cients are still unbiased.
The variances of the regression coe cients: if there is no heteroscedasticity, the OLS regression coe cients have the lowest variances of all the
unbiased estimators that are linear functions of the observations of Y . If heteroscedasticity is present, the OLS estimators are ine cient because
it is possible to nd other estimators that have smaller variances and are still unbiased.
The estimators of the standard errors of the regression coe cients will be wrong and, as a consequence, the t-tests as well as the usual F tests
will be invalid. It is quite likely that the standard errors will be underestimated, so the t statistics will be overestimated and you will have a
misleading impression of the precision of your regression coe cients. You may be led to believe that a coe cient is signi cantly di erent from 0,
at a given signi cance level, when, in fact, it is not.
HOW TO DETECT
Since there is no limit to the possible variety of heteroscedasticity, a large number of di erent tests appropriate for di erent circumstances has
been proposed. There are also a lot of statistical tests called to test whether heteroscedasticity is present. The list includes but is not limited to the
following:
Create PDF in your applications with the Pdfcrowd HTML to PDF API PDFCROWD
Despite the large number of the available tests, we will opt for a simple technique to detect heteroscedasticity, which is looking at the residual plot
of our model. We can diagnose the heteroscedasticity by plotting the residual against the predicted response variable.
>library(ggResidpanel)
resid_auxpanel(residuals = resid(model),
predicted = fitted(model),
plots = c("resid", "index"))
Create PDF in your applications with the Pdfcrowd HTML to PDF API PDFCROWD
In our case we can conclude that as budget increases, the website visits tend to diverge.
Create PDF in your applications with the Pdfcrowd HTML to PDF API PDFCROWD
THE SOLUTION
The two most common strategies for dealing with the possibility of heteroskedasticity is heteroskedasticity-consistent standard errors (or robust
errors) developed by White and Weighted Least Squares.
WLS
OLS does not discriminate between the quality of the observations, giving equal weight to each, irrespective of whether they are good or poor
guides to the location of the line. Thus, it may be concluded that if we can nd a way of assigning more weight to high-quality observations and less
to the unreliable ones, we are likely to obtain a better t. In other words, our estimators of β1 and β2 will be more e cient. WLS works by
incorporating extra nonnegative constants (weights) associated with each data point into the tting criterion. We shall see how to do this below.
Suppose the true relationship is
Yi = β1 + β2 X i + ϵi
and
2
var(ϵi ) = σϵ
i
So we have a heteroscedastic model. We could eliminate the heteroscedasticity by dividing each observation by its value of σϵ
i
. The model
becomes
Yi 1 Xi ϵi
= β1 + β2 +
σϵi σϵi σϵi σϵi
ϵi
The disturbance term σϵ
is homoscedastic because
i
ϵi 2
1 2
1 2
E [( ) ] = E (ϵi ) = σϵ = 1
2 2 i
σϵi σϵ σϵ
i i
Therefore, every observation will have a disturbance term drawn from a distribution with population variance 1, and the model will be
homoscedastic. By rewriting the model, we will have
′ ′ ′
Yi = β1 h i + β2 X i + ϵi ,
Create PDF in your applications with the Pdfcrowd HTML to PDF API PDFCROWD
Yi Xi ϵi
where Yi ′ =
σϵ
, hi =
1
σϵ
, X i′ =
σϵ
, ϵ′i =
σϵ
i i i i
Note that there should not be a constant term in the equation. By regressing Y ′ on h and X ′ , we will obtain e cient estimates of β1 and β2 with
unbiased standard errors. The general solution to this is
T −1 T
^ = (X
β WX ) (X WY),
where W is the diagonal martrix with diagonal entries equal to weights and Var(ϵ) = W
−1 2
σ .
In some cases, the values of the weights may be based on theory or prior research. In our model, the standard deviations tend to increase as the
value of Budget increases, so the weights tend to decrease as the value of Budget increases, thus the weights are known. Where the weights
are unknown, we can try di erent models and choose the best one based on, for instance, the distribution of the error term. There are the following
common types of situations and weights:
When the i th value of y is an average of n i observations var(yi ) , thus we set w i (this situation often occurs in cluster surveys).
σ
= = ni
ni
If the structure of weights is unknown, we have to perform a two-stage estimation procedure. We need to estimate an ordinary least squares
regression to obtain the estimate of σ2i for i th squared residual and the absolute value of standard deviation (in case of outliers). Thus, we can have
di erent weights depending on σ2i . Often the weights are determined by tted values rather than the independent variable. Let us show these
di erent models via statistical package R. Fortunately, the R function lm() ,which is used to perform the ordinary least squares, provides the
argument weights to perform WLS. By default the value of weights in lm() is NULL , weighted least squares are used with weights
weights , minimizing the sum of w ∗ e
2
.
Suppose we do not know the pattern of weights, and we want to t the models with the following weights wi =
1
xi
, wi =
1
x
2
, wi =
1
y
2
,w =
y
1
2
,
i i h at
wi =
1
2
, wi =
1
|σi |
.
σ
i
Create PDF in your applications with the Pdfcrowd HTML to PDF API PDFCROWD
wols4 <- lm(Visits ~ Budget + AdType, data = web, weights = 1/fitted(model)^2)
wols5 <- lm(Visits ~ Budget + AdType, data = web, weights = 1/resid(model)^2)
wols6 <- lm(Visits ~ Budget + AdType, data = web, weights = 1/abs(resid(model)))
The result of tted models will be:
##
## WOLS Results
## ===========================================================================================================================
## Dependent variable:
## --------------------------------------------------------------------------------------------
## Visits
## - 1/Budget 1/Budget^2 1/y 1/y^2 1/e^2 1/|e|
## (1) (2) (3) (4) (5) (6) (7)
## ---------------------------------------------------------------------------------------------------------------------------
## Budget 1.017*** 1.014*** 1.018*** 1.015*** 1.014*** 1.018*** 1.014***
## (0.032) (0.024) (0.022) (0.031) (0.031) (0.001) (0.008)
##
## Ad Type: Outdoor Ads 17.623 9.016 1.778 17.291 16.927 18.380*** 16.810**
## (28.957) (19.540) (10.354) (28.251) (27.531) (1.405) (8.426)
##
## Ad Type: Radio and Podcasts 31.784 15.184 1.457 30.884 29.894 31.647*** 28.276***
## (29.003) (19.823) (10.732) (28.302) (27.591) (1.562) (9.309)
##
## Ad Type: Social Media Ads -40.288 -10.390 -0.402 -36.504 -32.869 -39.380*** -36.515***
## (29.366) (19.315) (10.069) (28.470) (27.571) (1.498) (9.223)
##
## Ad Type: Video Ads -10.368 3.876 11.703 -7.915 -5.622 -8.910*** -8.182
## (28.737) (20.532) (12.335) (27.977) (27.217) (1.597) (9.493)
##
## Constant 3,995.437*** 3,993.525*** 3,992.827*** 3,995.256*** 3,995.106*** 3,994.459*** 3,996.948***
## (26.096) (14.600) (7.216) (24.978) (23.908) (1.388) (7.472)
##
Create PDF in your applications with the Pdfcrowd HTML to PDF API PDFCROWD
## ---------------------------------------------------------------------------------------------------------------------------
## Observations 1,000 1,000 1,000 1,000 1,000 1,000 1,000
## R2 0.506 0.645 0.691 0.517 0.528 1.000 0.940
## Adjusted R2 0.504 0.644 0.689 0.515 0.526 1.000 0.939
## Residual Std. Error (df = 994) 293.017 11.263 0.492 4.242 0.061 1.000 14.521
## F Statistic (df = 5; 994) 203.633*** 361.792*** 444.545*** 213.209*** 222.603*** 585,907.100*** 3,091.199***
## ===========================================================================================================================
## Note: *p<0.1; **p<0.05; ***p<0.01
Weighted least squares estimates of the coe cients will usually be nearly the same as the "ordinary" unweighted estimates. In the models with
explanatory variables such as weight weights = 1/Budget^2 produces the smallest standard errors. The summary of models shows that the
tted equations are highly similar yet again. Overall, the smallest standard errors are presented by the model with weights =
1/resid(model)^2 .
However, as we know the pattern of weight allows to examine the residual plots for the rst two weighted LS models.
Create PDF in your applications with the Pdfcrowd HTML to PDF API PDFCROWD
Create PDF in your applications with the Pdfcrowd HTML to PDF API PDFCROWD
Apparently, the nonconstant variance of the residuals still results in heteroscedasticity. The issue is that the plots above use unweighted residuals;
whereas, with weighted least squares, we need to use weighted residuals to evaluate the suitability of the model since these take into account the
weights which change variance. The usual residuals fail to do this and will maintain the same non-constant variance pattern irrelevant to the
weights used in the analysis.
Create PDF in your applications with the Pdfcrowd HTML to PDF API PDFCROWD
Create PDF in your applications with the Pdfcrowd HTML to PDF API PDFCROWD
>resid_auxpanel(residuals = sqrt(1/web$Budget^2)*resid(wols2),
predicted = fitted(wols2),
plots = c("resid", "index"))
Create PDF in your applications with the Pdfcrowd HTML to PDF API PDFCROWD
Create PDF in your applications with the Pdfcrowd HTML to PDF API PDFCROWD
It seems that the second WLS model with the following weights , because the variability of residuals is the same for all predicted values.
1
wi = 2
x
i
We can now be more con dent in results and state that with every $100 increase in the amount of money spent on advertising the number of
website visitors will rise by, on average, 102. The absence of heteroscedasticity and the fact that the standard deviation of coe cient is less than in
the original model allow to make predictions with higher level of certainty.
CONCLUSION
Overall, the weighted ordinary least squares is a popular method of solving the problem of heteroscedasticity in regression models, which is the
application of the more general concept of generalized least squares. WLS implementation in R is quite simple because it has a distinct argument
for weights. As we saw, weights can be estimated directly from sample variances of the response variable at each combination of predictor
variables. WLS can sometimes be used where di erent observations have been measured by various instruments, importance or accuracy, and
where weights are used to take these circumstances into account.
The disadvantage of weighted least squares is that the theory behind this method is based on the assumption that exact weight sizes are known.
However, when it comes to practice, it can be quite di cult to determine weights or estimates of error variances. Note that WLS is neither the only
nor the best method of addressing the issue of heteroscedasticity. The alternative methods include estimating heteroskedasticity-consistent
standard errors, and other types of WLS (e.g. iteratively reweighted least squares).
REFERENCE LIST
Oscar L. Olvera, Bruno D. Zumb, Heteroskedasticity in Multiple Regression Analysis: What it is, How to Detect it and How to
Solve it with Applications in R and SPSS.
R. Williams, "Heteroskedasticity".
Create PDF in your applications with the Pdfcrowd HTML to PDF API PDFCROWD
©2018 Datamotus. All rights reserved.
CONTACT US
Create PDF in your applications with the Pdfcrowd HTML to PDF API PDFCROWD
Location: Israelyan 37/4, Yerevan, Armenia
Create PDF in your applications with the Pdfcrowd HTML to PDF API PDFCROWD