Trapti Chap3
Trapti Chap3
CHAPTER 3
LITERATURE SURVEY
We have studied some of the paper related with our research work.
In 2015 Michael Olusegun Akinwande et al proposed “Variance Inflation Factor: As a
Condition for the Inclusion of Suppressor Variable(s) in Regression Analysis”. They reviewed
several literatures of interest which treats the concept and types of suppressor variables. They
highlighted systematic ways to identify suppression effect in multiple regressions using
statistics such as: R2, sum of squares, regression weight and comparing zero-order correlations
with Variance Inflation Factor (VIF) respectively. They also established that suppression effect
is a function of multicollinearity; however, a suppressor variable should only be allowed in a
regression analysis if its VIF is less than five. To avoid underestimating the regression
coefficient of a particular independent variable, it is important to understand the nature of its
relationship with other independent variables. The concept of suppression provokes researchers
to think about the presence of outcome-irrelevant variation in an independent variable that may
mask that variable’s genuine relationship with the outcome variable. Only when a predictor
variable that is uncorrelated with other predictors is included in a multiple regression, will the
regression weight of other predictor variables remain stable and not change [1].
Department Of Computer Science & Engineering, LNCT (Bhopal) Indore Campus, Indore (M.P.)
18
Analysis Problem of Multicollinearity to Identify Effect of Independent Variables
In 2016 Sudhanshu K. Mishra et al proposed “Shapley value regression and the resolution of
multicollinearity”. Multicollinearity in empirical data violates the assumption of independence
among the repressors in a linear regression model that often leads to failure in rejecting a false
null hypothesis. It also may assign wrong sign to coefficients. Shapley value regression is
perhaps the best methods to combat this problem. They simplified the algorithm of Shapley
value decomposition of R2 and develops a Fortran computer program that executes it. It also
retrieves regression coefficients from the Shapley value. Shapley value regression becomes
increasingly impracticable as the number of repressor variables exceeds 10, although, in
practice, a good regression model may not have more than ten repressors Multicollinearity in
empirical data violates the assumption of independence among the explanatory variables in a
linear regression model and by inflating the standard error of estimates of the estimated
regression coefficients leads to failure in rejecting a false null hypothesis of ineffectiveness of
the repressor variable to the regress and variable (type II error). Very frequently, it also affects
Department Of Computer Science & Engineering, LNCT (Bhopal) Indore Campus, Indore (M.P.)
19
Analysis Problem of Multicollinearity to Identify Effect of Independent Variables
the sign of the regression coefficients. Shapley value regression is one of the best methods to
combat this adversity to empirical analysis. They made two contributions, first in simplifying
the algorithm to compute the Shapley value (decomposition of R2 as fair shares to individual
regressor variables) and secondly developing a computer program that works it out easily[4].
In 2016 Hanan Duzan et al proposed “Solution to the Multicollinearity Problem by adding some
Constant to the Diagonal”. Ridge regression is an alternative to ordinary least-squares (OLS)
regression. It is believed to be superior to least-squares regression in the presence of
multicollinearity. The robustness of this method is investigated, and comparison is made with
the least squares method through simulation studies. They showed that the system stabilizes in a
region of k, where k is a positive quantity less than one and whose values depend on the degree
of correlation between the independent variables. The results also illustrate that k is a linear
function of the correlation between the independent variables. The main goal of this study was
Department Of Computer Science & Engineering, LNCT (Bhopal) Indore Campus, Indore (M.P.)
20
Analysis Problem of Multicollinearity to Identify Effect of Independent Variables
to identify the most relevant k value for ridge regression in a four-variable regression model.
Since it is not possible to achieve this mathematically, a simulation study was conducted to
study the behavior of ridge regression in such a case. It was assumed that both the form of the
model and the nature of the errors the y values were generated using a set of predetermined
values of parameters. One thousand random data sets were used in each simulation study. The
p-variable linear regression model was fitted by the least-squares method. Thereupon, in each
simulation study, one thousand ridge regression estimates of, VIF, and SSE for different k and 2
Rj values were computed[6].
Department Of Computer Science & Engineering, LNCT (Bhopal) Indore Campus, Indore (M.P.)
21
Analysis Problem of Multicollinearity to Identify Effect of Independent Variables
affect the unemployment rate in Iraq, using the ridge regression method as one of the most
widely used methods for solving the multicollinearity problem. The results are compared with
those obtained with the OLS method, in order to produce the best possible model that expresses
the studied phenomenon. After applying indicators such as the condition number (CN) and the
variance inflation factor (VIF) in order to detect the multicollinearity problem and after using R
packages for simulations and computations, we have proven that in Iraq, as an Arabic
developing economy, unemployment seems to be significantly affected by investments, working
population size and inflation. The solution adopted in our research is the ridge regression model,
which 18 was tested for identifying the factors that could explain the unemployment rate in an
Arabic developing country, namely Iraq. The study showed that the use of the ridge regression
method in the cases when explanatory variables are affected by multicollinearity is one of the
successful ways to solve this issue. Therefore, applying the ridge regression method in other
studies is recommended, since it provides better estimators than the ordinary least square
method when the explanatory variables are related, without omitting any of the explanatory
variables[8].
In 2017 Gary H. McClelland et al proposed “Multicollinearity is a red herring in the search for
moderator variables”. Multicollinearity is like the red herring in a mystery novel that distracts
the statistical detective from the pursuit of a true moderator relationship. They showed
multicollinearity is completely irrelevant for tests of moderator variables. They noted those
errors, but more positively, we describe a variety of methods researchers might use to test and
interpret their moderated multiple regression models, including two-stage testing, mean-
centering, spotlighting, orthogonalizing, and floodlighting without regard to putative issues of
multicollinearity. They cited a number of recent studies in the psychological literature in which
the researchers used these methods appropriately to test, to interpret, and to report their
moderated multiple regression models. They concluded with a set of recommendations for the
analysis and reporting of moderated multiple regression that should help researchers better
understand their models and facilitate generalizations across studies. They need not use mean-
centering or the orthogonal transformation or do anything else to avoid the purported problems
Department Of Computer Science & Engineering, LNCT (Bhopal) Indore Campus, Indore (M.P.)
22
Analysis Problem of Multicollinearity to Identify Effect of Independent Variables
Department Of Computer Science & Engineering, LNCT (Bhopal) Indore Campus, Indore (M.P.)
23
Analysis Problem of Multicollinearity to Identify Effect of Independent Variables
methodology performs better as compared to MLR model in terms of the criterion of MSE. The
approach does not require any additional survey or conducting extra crop cutting experiments
(CCE) for crop production estimate at the district level. They demonstrated an application of
small area estimates by using RR model. Least squares (LS) method is the oldest techniques for
estimating the parameters of linear regression model under some assumptions. These
assumptions are violated, LS method does not assure the desirable results. The influence of the
multicollinearity is one of these problems, which occurs when the number of explanatory
variable is relatively large in comparison to the sample or if the variables are almost collinear.
The Ridge regression (RR) method is used to deal with this problem. Time series data on
production of rice, area under rice, irrigated area under rice and fertilizer consumption
pertaining to the period 1990- 91 to 2002-03 for Uttarakhand state of India is taken from the
Bulletin of Agricultural statistics, published by the government of Uttarakhand, India[11].
Department Of Computer Science & Engineering, LNCT (Bhopal) Indore Campus, Indore (M.P.)
24
Analysis Problem of Multicollinearity to Identify Effect of Independent Variables
caused by the reason that valuable players act more cautiously to do not get any penalty.
Overall, the study could be improved by a more reasonable collection of data [12].
Department Of Computer Science & Engineering, LNCT (Bhopal) Indore Campus, Indore (M.P.)
25
Analysis Problem of Multicollinearity to Identify Effect of Independent Variables
In 2019 Jong Hae Kim et al proposed “Multicollinearity and misleading statistical results”.
Multicollinearity represents a high degree of linear inter correlation between explanatory
variables in a multiple regression model and leads to incorrect results of regression analyses.
Diagnostic tools of multicollinearity include the variance inflation factor, condition index and
Department Of Computer Science & Engineering, LNCT (Bhopal) Indore Campus, Indore (M.P.)
26
Analysis Problem of Multicollinearity to Identify Effect of Independent Variables
Department Of Computer Science & Engineering, LNCT (Bhopal) Indore Campus, Indore (M.P.)
27
Analysis Problem of Multicollinearity to Identify Effect of Independent Variables
that are significantly correlated not only with the dependent variable but also to each other.
Multicollinearity makes some of the significant variables under study to be statistically
insignificant. They discussed on the three primary techniques for detecting the multicollinearity
using the questionnaire survey data on customer satisfaction. The first two techniques are the
correlation coefficients and the variance inflation factor, while the third method is eigenvalue
method. It is observed that the product attractiveness is more rational cause for the customer
satisfaction than other predictors. Furthermore, advanced regression procedures such as
principal components regression, weighted regression, and ridge regression method can be used
to determine the presence of multicollinearity the relationship between customer satisfaction
with the major factors product quality, brand experience, product feature, product attractiveness,
and product price are significant with p [19].
Department Of Computer Science & Engineering, LNCT (Bhopal) Indore Campus, Indore (M.P.)
28
Analysis Problem of Multicollinearity to Identify Effect of Independent Variables
then fit a regression model using uncorrelated variables ‘components’. This model can explain
95 % of variability[21].
Department Of Computer Science & Engineering, LNCT (Bhopal) Indore Campus, Indore (M.P.)
29
Analysis Problem of Multicollinearity to Identify Effect of Independent Variables
that should guide decisions on keeping or omitting a variable in a model. When statistical
modeling is used to pursue a predictive aim, two highly correlated independent variables will
lead to high variance in the predictions, even if both variables are relevant for prediction. In
small samples, it may then be beneficial to omit one of the pair in order to decrease that
variance, even if this incurs some new bias in the predictions[23].
In 2022 Katrina I. Sundus , Bassam “Solving the multicollinearity problem to improve the
stability of machine learning algorithms applied to a fully annotated breast cancer dataset” . In
this study, we presented a novel, fully-annotated national breast cancer dataset built from the
cancer database registry of King Hussein Cancer Center, a medical center in Amman, Jordan, to
predict recurrent breast cancer cases. Initially, the dataset had 35 attributes and 7562 instances
of patients diagnosed with breast cancer between 2006 and 2021. Although this is still an
ongoing project, research on breast cancer by the international community may benefit from the
Department Of Computer Science & Engineering, LNCT (Bhopal) Indore Campus, Indore (M.P.)
30
Analysis Problem of Multicollinearity to Identify Effect of Independent Variables
JBRCA dataset. The dataset can be used in its current state to predict breast cancer and recurrent
breast cancer cases. It can also be beneficial for developing new prediction tools. Recurrent
breast cancer after initial treatment is prevalent. Prediction of possible recurrent is crucial for
follow-up planning and treatment.They presented the JBRCA dataset construction stages and
the challenges we encountered. They applied the CRISP-DM extension for the medical domain
methodology to design and construct the dataset. We experimented with the JBRCA dataset to
solve many problems and issues related to the dataset’s construction during a one-year journey.
Data encoding, different data types, scaling, balancing, and multicollinearity problems were
among the few. We provided solutions for the most raised issues by applying the most common
techniques used in DM and ML. We demonstrated that it is indispensable for the recurrent
breast cancer prediction system to have an original, extensive, versatile, adequately classified,
and richly annotated reference dataset. At the end of the study, the JBRCA dataset has 20
attributes and 7562 instances.
In 2023 Amin Otoni Harefa , Yulisman Zega , Ratna Natalia The Application of the Least
Squares Method to Multicollinear Data Regression analysis is an analysis that aims to determine
whether there is a statistically dependent relationship between two variables, namely the
predictor variable and the response variable. One of the methods for estimating multiple linear
regression parameters is the Least Squares Method. Therefore, careful and meticulous analysis
and selection of appropriate techniques are required to overcome the multicollinearity problem
and ensure accurate and meaningful regression analysis results. Descriptive statistical table of
response variables and predictor variables, where the average results are rounded. The
regression equation using the OLS method is as follows: 𝑌̂ = 2,037 + 0.302𝑋1 + 0.206𝑋2 +
0.172𝑋3 + 0.342𝑋4. Therefore, it is important to use special techniques such as regularization or
PCA to overcome the multicollinearity problem in the data before applying the least squares
method. Thus, we can obtain more stable and accurate regression coefficient estimates and a
more reliable linear regression model. Multicollinearity can cause problems in estimating
accurate regression coefficients. The least squares method is used to determine the coefficient
values in a straight line equation by minimizing the sum of the squared differences between the
observed values and the values estimated by the mathematical model. However, when data is
subject to multicollinearity,
Department Of Computer Science & Engineering, LNCT (Bhopal) Indore Campus, Indore (M.P.)
31