0% found this document useful (0 votes)
32 views14 pages

Trapti Chap3

thesis chapter
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
32 views14 pages

Trapti Chap3

thesis chapter
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 14

Analysis Problem of Multicollinearity to Identify Effect of Independent Variables

CHAPTER 3
LITERATURE SURVEY

We have studied some of the paper related with our research work.
In 2015 Michael Olusegun Akinwande et al proposed “Variance Inflation Factor: As a
Condition for the Inclusion of Suppressor Variable(s) in Regression Analysis”. They reviewed
several literatures of interest which treats the concept and types of suppressor variables. They
highlighted systematic ways to identify suppression effect in multiple regressions using
statistics such as: R2, sum of squares, regression weight and comparing zero-order correlations
with Variance Inflation Factor (VIF) respectively. They also established that suppression effect
is a function of multicollinearity; however, a suppressor variable should only be allowed in a
regression analysis if its VIF is less than five. To avoid underestimating the regression
coefficient of a particular independent variable, it is important to understand the nature of its
relationship with other independent variables. The concept of suppression provokes researchers
to think about the presence of outcome-irrelevant variation in an independent variable that may
mask that variable’s genuine relationship with the outcome variable. Only when a predictor
variable that is uncorrelated with other predictors is included in a multiple regression, will the
regression weight of other predictor variables remain stable and not change [1].

In 2015 Ahmad A. Suleiman et al proposed “Analysis of Multicollinearity in Multiple


Regressions”. They concentrated on residuals analysis to check the assumptions for a multiple
linear regression model by using graphical method. Specifically, they plotted the residuals and
standardized residuals given by model against predicted values of the dependent variables,
normal probability plot, histogram of residuals and Quantile plot of residuals. They introduced
the concept of multicollinearity to check whether one of the assumptions of the linear regression
model that there is no multicollinearity among the explanatory variables is satisfied. They gave
an example that indicated the presence of multicollinearity in the regression model using review
(statistical software)[2].

Department Of Computer Science & Engineering, LNCT (Bhopal) Indore Campus, Indore (M.P.)
18
Analysis Problem of Multicollinearity to Identify Effect of Independent Variables

In 2016 Kristina Vatcheva et al proposed “Multicollinearity in Regression Analyses Conducted


in Epidemiologic Studies”. They simulated datasets and real-life data from the Cameron
County Hispanic Cohort to demonstrate the adverse effects of multicollinearity in the regression
analysis and encourage researchers to consider the diagnostic for multicollinearity as one of the
steps in regression analysis. A majority of researchers do not report multicollinearity diagnostic
when analyzing data using regression models. Using simulated datasets and real-life data from
the Cameron County Hispanic Cohort, they demonstrated the adverse effects of
multicollinearity in the regression analysis and encourage researchers to consider the diagnostic
for multicollinearity as one of the major steps in the regression analysis process. Based on our
simulation and CCHC study they recommend that along with the bivariate correlation
coefficients between the predictors in the model and the VIFs, researchers should always
examine the changes in the coefficient estimates along with the changes in their standard errors
and even the changes in VIF. VIF less than 5 (VIF) does not always indicate low
multicollinearity. Caution must be taken when more than two predictors in the model have even
weak pairwise correlation coefficients (r=0.25) as they can result in a significant
multicollinearity effect[3].

In 2016 Sudhanshu K. Mishra et al proposed “Shapley value regression and the resolution of
multicollinearity”. Multicollinearity in empirical data violates the assumption of independence
among the repressors in a linear regression model that often leads to failure in rejecting a false
null hypothesis. It also may assign wrong sign to coefficients. Shapley value regression is
perhaps the best methods to combat this problem. They simplified the algorithm of Shapley
value decomposition of R2 and develops a Fortran computer program that executes it. It also
retrieves regression coefficients from the Shapley value. Shapley value regression becomes
increasingly impracticable as the number of repressor variables exceeds 10, although, in
practice, a good regression model may not have more than ten repressors Multicollinearity in
empirical data violates the assumption of independence among the explanatory variables in a
linear regression model and by inflating the standard error of estimates of the estimated
regression coefficients leads to failure in rejecting a false null hypothesis of ineffectiveness of
the repressor variable to the regress and variable (type II error). Very frequently, it also affects

Department Of Computer Science & Engineering, LNCT (Bhopal) Indore Campus, Indore (M.P.)
19
Analysis Problem of Multicollinearity to Identify Effect of Independent Variables

the sign of the regression coefficients. Shapley value regression is one of the best methods to
combat this adversity to empirical analysis. They made two contributions, first in simplifying
the algorithm to compute the Shapley value (decomposition of R2 as fair shares to individual
regressor variables) and secondly developing a computer program that works it out easily[4].

In 2016 Christopher Winship et al proposed “Multicollinearity and Model Misspecification”


Multicollinearity in linear regression is typically thought of as a problem of large standard
errors due to near-linear dependencies among independent variables. This problem can be
solved by more informative data, possibly in the form of a larger sample. The near collinearity
of independent variables can also increase the sensitivity of regression estimates to small errors
in the model misspecification. They examined the classical assumption that independent
variables are uncorrelated with the errors. With collinearity, small deviations from this
assumption can lead to large changes in estimates. They presented a Bayesian estimator that
specifies a prior distribution for the covariance between the independent variables and the error
term. This estimator can be used to calculate confidence intervals that reflect sampling error and
uncertainty about the model specification. A Monte Carlo experiment indicates that the
Bayesian estimator has good frequents properties in the presence of specification errors. They
illustrated the new method by estimating a model of the black–white gap in earnings. They
showed that parameter estimates are likely to be highly sensitive to model specification when
multicollinearity is present. They demonstrated how this methodology could be used in two
empirical situations in which multicollinearity is a potential problem [5].

In 2016 Hanan Duzan et al proposed “Solution to the Multicollinearity Problem by adding some
Constant to the Diagonal”. Ridge regression is an alternative to ordinary least-squares (OLS)
regression. It is believed to be superior to least-squares regression in the presence of
multicollinearity. The robustness of this method is investigated, and comparison is made with
the least squares method through simulation studies. They showed that the system stabilizes in a
region of k, where k is a positive quantity less than one and whose values depend on the degree
of correlation between the independent variables. The results also illustrate that k is a linear
function of the correlation between the independent variables. The main goal of this study was

Department Of Computer Science & Engineering, LNCT (Bhopal) Indore Campus, Indore (M.P.)
20
Analysis Problem of Multicollinearity to Identify Effect of Independent Variables

to identify the most relevant k value for ridge regression in a four-variable regression model.
Since it is not possible to achieve this mathematically, a simulation study was conducted to
study the behavior of ridge regression in such a case. It was assumed that both the form of the
model and the nature of the errors the y values were generated using a set of predetermined
values of parameters. One thousand random data sets were used in each simulation study. The
p-variable linear regression model was fitted by the least-squares method. Thereupon, in each
simulation study, one thousand ridge regression estimates of, VIF, and SSE for different k and 2
Rj values were computed[6].

In 2017 Jamal I. Daoud et al proposed “Multicollinearity and Regression Analysis”. In


regression analysis it is obvious to have a correlation between the response and predictor(s), but
having correlation among predictors is something undesired. Increased standard errors mean
that the coefficients for some or all independent variables may be found to be significantly
different from 0. They focused on multicollinearity, reasons and consequences on the reliability
of the regression model. When two or more predictors are highly correlated, the relationship
between the independent variables and the dependent variables is distorted by the very strong
relationship between the independent variables, leading to the likelihood that our interpretation
of relationships will be incorrect. In the worst case, if the variables are perfectly correlated, the
regression cannot be computed. Multicollinearity is detected by examining the tolerance for
each independent variable. Tolerance is the amount of variability in one independent variable
that is no explained by the other independent variables, and it is in fact R .Tolerance values less
than 0.10 indicate collinearity. They discovered collinearity in the regression output; we should
reject the interpretation of the relationships as false until the issue is resolved. Multicollinearity
can be resolved by combining the highly correlated variables through principal component
analysis, or omitting a variable from the analysis that associated with other variable(s) highly.
Multicollinearity is one of serious problems that should be resolved before starting the process
of modeling the data[7].

In 2017 Bager, Ali et al proposed “Addressing multicollinearity in regression models: a ridge


regression application” They determined the most important macroeconomic factors which

Department Of Computer Science & Engineering, LNCT (Bhopal) Indore Campus, Indore (M.P.)
21
Analysis Problem of Multicollinearity to Identify Effect of Independent Variables

affect the unemployment rate in Iraq, using the ridge regression method as one of the most
widely used methods for solving the multicollinearity problem. The results are compared with
those obtained with the OLS method, in order to produce the best possible model that expresses
the studied phenomenon. After applying indicators such as the condition number (CN) and the
variance inflation factor (VIF) in order to detect the multicollinearity problem and after using R
packages for simulations and computations, we have proven that in Iraq, as an Arabic
developing economy, unemployment seems to be significantly affected by investments, working
population size and inflation. The solution adopted in our research is the ridge regression model,
which 18 was tested for identifying the factors that could explain the unemployment rate in an
Arabic developing country, namely Iraq. The study showed that the use of the ridge regression
method in the cases when explanatory variables are affected by multicollinearity is one of the
successful ways to solve this issue. Therefore, applying the ridge regression method in other
studies is recommended, since it provides better estimators than the ordinary least square
method when the explanatory variables are related, without omitting any of the explanatory
variables[8].

In 2017 Gary H. McClelland et al proposed “Multicollinearity is a red herring in the search for
moderator variables”. Multicollinearity is like the red herring in a mystery novel that distracts
the statistical detective from the pursuit of a true moderator relationship. They showed
multicollinearity is completely irrelevant for tests of moderator variables. They noted those
errors, but more positively, we describe a variety of methods researchers might use to test and
interpret their moderated multiple regression models, including two-stage testing, mean-
centering, spotlighting, orthogonalizing, and floodlighting without regard to putative issues of
multicollinearity. They cited a number of recent studies in the psychological literature in which
the researchers used these methods appropriately to test, to interpret, and to report their
moderated multiple regression models. They concluded with a set of recommendations for the
analysis and reporting of moderated multiple regression that should help researchers better
understand their models and facilitate generalizations across studies. They need not use mean-
centering or the orthogonal transformation or do anything else to avoid the purported problems

Department Of Computer Science & Engineering, LNCT (Bhopal) Indore Campus, Indore (M.P.)
22
Analysis Problem of Multicollinearity to Identify Effect of Independent Variables

of multicollinearity. The only purpose of those transformations is to facilitate understanding of


MMR models[9].

In 2017 Dr Manoj Kumar Mishra et al proposed “A Study of Multicollinearity in Estimation of


Coefficients in Ridge Regression”. The traditional solution is to collect more data or to drop one
or more variables. Collecting more data may often be expensive or not practicable in many
situations and to drop one or more variables from the model to alleviate the problem of
multicollinearity may lead to the specification bias and hence the solution may be worse than
the disease in certain situations. One may be interested in squeezing out maximum information
from whatever data one has at one’s disposal. This has motivated the researchers to the
development of some very ingenious statistical methods namely ridge regression (RR), principal
component regression, partial least squares regression and generalized inverse regression. These
could fruitfully be applied to solve the problem of multicollinearity. This paper looks into RR
only to solve the problem of multicollinearity. Various methods are available in literature for
detection of multicollinearity such as examination of correlation matrix, Chi-Square test,
looking pattern of eigenvalues and others. The complete elimination of multicollinearity is not
possible but the degree of multicollinearity present in the data may be reduced. Several remedial
measures can be applied to tackle the problem of multicollinearity. But our study relates to RR
only. They applied test for a test for multicollinearity by extracting the VIF quantities.
Therefore, a multicollinearity problem has been observed at our constructed model. The
technique of RR is used to deal with the problem of multicollinearity at the constructed model.
By using the SYSTAT package, all values of coefficients are estimated based on suitable values
of k and estimate of the model has been discussed [10].

In 2018 Neeraj Tiwari et al proposed “Diagnostics of Multicollinearity in Multiple Regression


Model for Small Area Estimation” They discussed the multicollinearity problem in regression
models for small area estimation and propose Ridge Regression Model (RRM) to deal with the
problem of multicollinearity. The proposed model has been empirically compared with the
existing Multiple Linear Regression (MLR) Model. Analysis of data obtained from a survey
carried out from Directorate of Economics and Statistics, Uttarakhand, India revealed that RR

Department Of Computer Science & Engineering, LNCT (Bhopal) Indore Campus, Indore (M.P.)
23
Analysis Problem of Multicollinearity to Identify Effect of Independent Variables

methodology performs better as compared to MLR model in terms of the criterion of MSE. The
approach does not require any additional survey or conducting extra crop cutting experiments
(CCE) for crop production estimate at the district level. They demonstrated an application of
small area estimates by using RR model. Least squares (LS) method is the oldest techniques for
estimating the parameters of linear regression model under some assumptions. These
assumptions are violated, LS method does not assure the desirable results. The influence of the
multicollinearity is one of these problems, which occurs when the number of explanatory
variable is relatively large in comparison to the sample or if the variables are almost collinear.
The Ridge regression (RR) method is used to deal with this problem. Time series data on
production of rice, area under rice, irrigated area under rice and fertilizer consumption
pertaining to the period 1990- 91 to 2002-03 for Uttarakhand state of India is taken from the
Bulletin of Agricultural statistics, published by the government of Uttarakhand, India[11].

In 2018 Yunus Kologlu et al proposed “A Multiple Linear Regression Approach For


Estimating the Market Value of Football Players in Forward Position”. They used values of the
football players in the forward positions are estimated using multiple linear regressions by
including the physical and performance factors in 2017-2018 seasons. Players from 4 major
leagues of Europe are examined, and by applying Breusch – Pagan test for homoscedasticity, a
reasonable regression model within 0.10 significance level is built, and the most and the least
affecting factors are explained in detail. They evaluated football players with several criteria in
different significance levels within the consideration of multicollinearity. They achieved to
build a regression model in 0.10 significance level with 52 attributes, %20 MAPE..
Economically prosper clubs gather the successful and valuable players into their clubs, and
mostly the young talents are valuable, such as Kylian Mbabpe and Paulo Dybala. Height of 180
and 184 has been proved to be successful, which can be assumed that taller players lack
dribbling skills and shorter players lack the control of the ball in the air. Also as another rule of
thumb, premier league is accepted as the most challenging league in the Europe, and it is no
surprise that the players that are English are more valuable. The only interesting thing in the
study is the fact that card numbers didn’t affect the market value of the players; which can be

Department Of Computer Science & Engineering, LNCT (Bhopal) Indore Campus, Indore (M.P.)
24
Analysis Problem of Multicollinearity to Identify Effect of Independent Variables

caused by the reason that valuable players act more cautiously to do not get any penalty.
Overall, the study could be improved by a more reasonable collection of data [12].

In 2019 Alhassan Umar et al proposed “Detection of Collinearity Effects on Explanatory


Variables and Error Terms in Multiple Regressions”. They investigated the effects and
consequences of multicollinearity on both standard error and explanatory variables in multiple
regression, the correlation between X1 to X6 (independent variables) measure their individual
effect and performance on Y (Response variable) and it is carefully observes how those
explanatory variables inter correlated with one another and to the response variable. There are
many procedures available in literature for detecting presence, degree and severity of
multicollinearity in multiple regression analysis here they used correlation analysis to discover
it is presence; we use variance inflation factors, tolerance level, indices number, eigenvalues to
access fluctuation and influence of multicollinearity present in the model. Multicollinearity was
discovered to work with a severe proportion using arrays of correlation analysis procedure
which affects the performance of the explanatory variables present in the model by making it
less independent and more redundant as it should not be. Complete elimination of collinearity is
not possible but they reduced it is degree of intensity to enhance the performance of
independent variables and error term in the model Multicollinearity from is not a problematic
some time especially if the aims of the analysis is to use multiple regression for prediction
purposes, it will be accurate as it is supposed to be despite the presence of multicollinearity,
where the problem lies is if to check the contribution of each individual independent
variables[14].

In 2019 Katerina M. Marcoulides et al proposed “Evaluation of Variance Inflation Factors in


Regression Models Using Latent Variable Modeling Methods”. A procedure that can be used to
evaluate the variance inflation factors and tolerance indices in linear regression models is
discussed. The method permits both point and interval estimation of these factors and indices
associated with explanatory variables considered for inclusion in a regression model. The
approach makes use of popular latent variable modeling software to obtain these point and
interval estimates. The procedure allows more informed evaluation of these quantities when

Department Of Computer Science & Engineering, LNCT (Bhopal) Indore Campus, Indore (M.P.)
25
Analysis Problem of Multicollinearity to Identify Effect of Independent Variables

addressing multicollinearity-related issues in empirical research using regression models. The


method is illustrated on an empirical example using the popular software Mplus. Results of a
simulation study investigating the capabilities of the procedure are also presented. This article
discussed a procedure for interval estimation of VIFs and TIs in regression models. The method
is readily and widely applied using popular LVM software, such as Mplus. The approach
permits more informed decisions about potentially important near multicollinearity situations to
be sensed in empirical research. A particularly useful feature of the procedure lies in the fact
that it provides confidence level for the VIFs and TIs[15].

In 2019 N. A. M. R. Senaviratna et al proposed” Diagnosing Multicollinearity of Logistic


Regression Model”. One of the key problems arises in binary logistic regression model is that
explanatory variables being considered for the logistic regression model are highly correlated
among themselves. Multicollinearity will cause unstable estimates and inaccurate variances that
affect confidence intervals and hypothesis tests. They discussed some diagnostic measurements
to detect multicollinearity namely tolerance, Variance Inflation Factor (VIF), condition index
and variance proportions. The adapted diagnostics are illustrated with data based on a study of
road accidents. The response variable is accident severity that consists of two levels particularly
grievous and non-grievous. Multicolinearity is identified by correlation matrix, tolerance and
VIF values and confirmed by condition index and variance proportions. The range of solutions
available for logistic regression such as increasing sample size, dropping one of the correlated
variables and combining variables into an index. It is safely concluded that without increasing
sample size, to omit one of the correlated variables can reduce multicollinearity considerably.
The problem of multicollinearity arises when one explanatory variable is not a linear function of
another explanatory variable. The presence of multicollinearity specifies the biased coefficient
estimates and very large standard errors for the logistic regression coefficients[16].

In 2019 Jong Hae Kim et al proposed “Multicollinearity and misleading statistical results”.
Multicollinearity represents a high degree of linear inter correlation between explanatory
variables in a multiple regression model and leads to incorrect results of regression analyses.
Diagnostic tools of multicollinearity include the variance inflation factor, condition index and

Department Of Computer Science & Engineering, LNCT (Bhopal) Indore Campus, Indore (M.P.)
26
Analysis Problem of Multicollinearity to Identify Effect of Independent Variables

condition number, and variance decomposition proportion). The multicollinearity can be


expressed by the coefficient of determination of a multiple regression model with one
explanatory variable as the model’s response variable and the others as its explanatory
variables. The variance of the regression coefficients constituting the final regression model are
proportional to the. An increase in increases. The larger produces unreliable probability values
and confidence intervals of the regression coefficients. The square root of the ratio of the
maximum eigenvalue to each eigenvalue from the correlation matrix of standardized
explanatory variables is referred to as the condition index. The condition number is the
maximum condition index. Multicollinearity is present when the VIF is higher than 5 to 10 or
the condition indices are higher than 10 to 30[17].

In 2019 P. Sai Shankar et al proposed “Application of principal component regression analysis


in agricultural studies”. They presented a combined principal component analysis-regression
analysis model to study the effects of independent variables namely exposed branches , yield
bearing branches total inflorescence , average fruits per cluster and average females per cluster
on total fruit count . Multiple linear regressions is applied to develop a forecasting model for
total fruit count. Multicollinearity often causes a huge explanatory problem in multiple linear
regression analysis. In presence of multi-collinearity the ordinary least squares (OLS) estimators
are inaccurately estimated and the co-efficient of independent variables appearing to have the
wrong sign. They detected multicollinearity by using observing correlation matrix and variance
influence factor (VIF). They revealed that principal component regression method facilitates to
solve the multi-collinearity problem. A combined principal component-regression analysis
model to study the effects of independent variables on total fruit count was presented. The TFC
was measured with reference to five independent variables: When the multi-collinearity is
present in the data, ordinary least square estimators are estimated inaccurately. The principal
component analysis helps to solve the problem of multi-collinearity, reduced into two principal
components[18].

In 2020 Noora Shrestha et al proposed “Detecting Multicollinearity in Regression Analysis”


Multicollinearity occurs when the multiple linear regression analysis includes several variables

Department Of Computer Science & Engineering, LNCT (Bhopal) Indore Campus, Indore (M.P.)
27
Analysis Problem of Multicollinearity to Identify Effect of Independent Variables

that are significantly correlated not only with the dependent variable but also to each other.
Multicollinearity makes some of the significant variables under study to be statistically
insignificant. They discussed on the three primary techniques for detecting the multicollinearity
using the questionnaire survey data on customer satisfaction. The first two techniques are the
correlation coefficients and the variance inflation factor, while the third method is eigenvalue
method. It is observed that the product attractiveness is more rational cause for the customer
satisfaction than other predictors. Furthermore, advanced regression procedures such as
principal components regression, weighted regression, and ridge regression method can be used
to determine the presence of multicollinearity the relationship between customer satisfaction
with the major factors product quality, brand experience, product feature, product attractiveness,
and product price are significant with p [19].

In 2020 Solly Matshonisa at al proposed “Dealing with Multicollinearity in Regression


Analysis: A Case in Psychology”. In regression analysis, the main interest is to predict the
response variable using the exploratory variables by estimating parameters of the linear model.
In reality, the exploratory variables may share similar characteristics. This interdependency
between the exploratory variables is called multicollinearity and causes parameter estimation in
regression analysis to be unreliable. Different approaches to address the multicollinearity
problem in regression modelling include variable selection, principal component regression and
ridge regression. The performances of these techniques in handling multicollinearity in
simulated data are compared. Out of the four regression models compared, principal regression
model produced the best model to explain the variability and its parameter estimates were
precise and addressing multicollinearity. Multicollinearity is a common problem in research
and statistical analyses. It is particularly significant when the aim is to predict a dependent
variable among a set of independent that share some characteristics. The OLS Model produced
in this research paper, this model has the highest R2 among the other models but with imprecise
regression coefficients. The advantage of stepwise regression model is that it can produce as
many different models using variable selection. Three psychometric measures were examined
but Model 2 excluded self-esteem. The PCR firstly removes collinearity factor in that data and

Department Of Computer Science & Engineering, LNCT (Bhopal) Indore Campus, Indore (M.P.)
28
Analysis Problem of Multicollinearity to Identify Effect of Independent Variables

then fit a regression model using uncorrelated variables ‘components’. This model can explain
95 % of variability[21].

In 2021 Alhassan Umar Ahmad et al U.V proposed. “A Study of Multicollinearity Detection


and Rectification under Missing Values”. They discussed the consequences of missing
observations on data-based multicollinearity were analyzed. Different missing values have a
different effect on multicollinearity in the system of multiple regression models. Similarly, the
comparison was done to investigate each response of multicollinearity on each pattern of the
missing values with the same informatics data. They found that tolerance and variance inflation
factor fluctuates due to the missing of information from the sample analyzed at different
percentages of the missing values. They was observed that the more missing values available in
the sample obtain from either population statistics or survey than multicollinearity will be found
in the system of multiple regression, this is because as the number of Missing ness increase it
shows a drastic decrease from the tolerance level on both monotone and arbitrary types as
observed from the analysis , it brings out categorically that no missing values are small no
matter how it is, can change the nature of correlation, Tolerance, and variance inflation factors
which finally in return will affect the linear relationships among the response and predictor
variables and end up producing a severe multicollinearity[22].

In 2021 Mariella Gregorich et al proposed “Regression with Highly Correlated Predictors:


Variable Omission Is Not the Solution”. Regression models have been in use for decades to
explore and quantify the association between a dependent response and several independent
variables in environmental sciences, epidemiology and public health. However, researchers
often encounter situations in which some independent variables exhibit high bivariate
correlation, or may even be collinear. They demonstrated how diagnostic tools for collinearity
or near-collinearity may fail in guiding the analyst. Instead, the most appropriate way of
handling collinearity should be driven by the research question at hand and, in particular, by the
distinction between predictive or explanatory aims. They demonstrated, using two examples,
how the diagnostic tools for collinearity and near-collinearity may fail in guiding the analyst in
selecting the most appropriate way of handling collinearity. Hence, it is the aim of the analysis

Department Of Computer Science & Engineering, LNCT (Bhopal) Indore Campus, Indore (M.P.)
29
Analysis Problem of Multicollinearity to Identify Effect of Independent Variables

that should guide decisions on keeping or omitting a variable in a model. When statistical
modeling is used to pursue a predictive aim, two highly correlated independent variables will
lead to high variance in the predictions, even if both variables are relevant for prediction. In
small samples, it may then be beneficial to omit one of the pair in order to decrease that
variance, even if this incurs some new bias in the predictions[23].

In 2021 Ahmed Mohamed et al proposed “Using Different Methods to Overcome Modeling


Problems”. There are many problems with the modeling process. Multicollinearity phenomenon
one happens when there are high collinearity between the independent variables. It makes it
hard to interpret the coefficients and reduces the power of the model. They tried to solve this
problem using two methods. The first one used the ridge Regression model (RRM). It is
compared with a traditional linear regression model (LRM). The second one modified the
original dataset by differencing and scaling processes. They supposed three cases of the
independent variables for this justify this purpose. Independent, Dependent, and Combination
linear cases. Also, the presence of strong collinearity between the independent variables
increases the VIF as well as RP. The strong collinearity between the independent variables does
not reflect the Multicollinearity. Finally, the second method for overcoming the
Multicollinearity is an effective way to eliminate the Multicollinearity phenomenon and make
the regression model results well. Multicollinearity criteria occur when the independent
variables in a regression model are correlated. This collinearity affects the fitted results. The
presence of strong collinearity between the independent variables increases somewhat the VIF
as well as RP. Also, the strong collinearity between the independent variables does not reflect
Multicollinearity phenomenon [24].

In 2022 Katrina I. Sundus , Bassam “Solving the multicollinearity problem to improve the
stability of machine learning algorithms applied to a fully annotated breast cancer dataset” . In
this study, we presented a novel, fully-annotated national breast cancer dataset built from the
cancer database registry of King Hussein Cancer Center, a medical center in Amman, Jordan, to
predict recurrent breast cancer cases. Initially, the dataset had 35 attributes and 7562 instances
of patients diagnosed with breast cancer between 2006 and 2021. Although this is still an
ongoing project, research on breast cancer by the international community may benefit from the
Department Of Computer Science & Engineering, LNCT (Bhopal) Indore Campus, Indore (M.P.)
30
Analysis Problem of Multicollinearity to Identify Effect of Independent Variables

JBRCA dataset. The dataset can be used in its current state to predict breast cancer and recurrent
breast cancer cases. It can also be beneficial for developing new prediction tools. Recurrent
breast cancer after initial treatment is prevalent. Prediction of possible recurrent is crucial for
follow-up planning and treatment.They presented the JBRCA dataset construction stages and
the challenges we encountered. They applied the CRISP-DM extension for the medical domain
methodology to design and construct the dataset. We experimented with the JBRCA dataset to
solve many problems and issues related to the dataset’s construction during a one-year journey.
Data encoding, different data types, scaling, balancing, and multicollinearity problems were
among the few. We provided solutions for the most raised issues by applying the most common
techniques used in DM and ML. We demonstrated that it is indispensable for the recurrent
breast cancer prediction system to have an original, extensive, versatile, adequately classified,
and richly annotated reference dataset. At the end of the study, the JBRCA dataset has 20
attributes and 7562 instances.
In 2023 Amin Otoni Harefa , Yulisman Zega , Ratna Natalia The Application of the Least
Squares Method to Multicollinear Data Regression analysis is an analysis that aims to determine
whether there is a statistically dependent relationship between two variables, namely the
predictor variable and the response variable. One of the methods for estimating multiple linear
regression parameters is the Least Squares Method. Therefore, careful and meticulous analysis
and selection of appropriate techniques are required to overcome the multicollinearity problem
and ensure accurate and meaningful regression analysis results. Descriptive statistical table of
response variables and predictor variables, where the average results are rounded. The
regression equation using the OLS method is as follows: 𝑌̂ = 2,037 + 0.302𝑋1 + 0.206𝑋2 +
0.172𝑋3 + 0.342𝑋4. Therefore, it is important to use special techniques such as regularization or
PCA to overcome the multicollinearity problem in the data before applying the least squares
method. Thus, we can obtain more stable and accurate regression coefficient estimates and a
more reliable linear regression model. Multicollinearity can cause problems in estimating
accurate regression coefficients. The least squares method is used to determine the coefficient
values in a straight line equation by minimizing the sum of the squared differences between the
observed values and the values estimated by the mathematical model. However, when data is
subject to multicollinearity,

Department Of Computer Science & Engineering, LNCT (Bhopal) Indore Campus, Indore (M.P.)
31

You might also like