Identifying The Most Important Independent Variables in Regression Models - Statistics by Jim
Identifying The Most Important Independent Variables in Regression Models - Statistics by Jim
Independent Variables in
Regression Models
By Jim Frost 14 Comments
In this blog post, I’ll help you determine which independent variable is the
most important while keeping these issues in mind. First, I’ll reveal
surprising statistics that are not related to importance. You don’t want to get
tripped up by them! Then, I’ll cover statistical and non-statistical approaches
for identifying the most important independent variables in your regression
model. I’ll also include an example regression model where we’ll try these
methods out.
Standardized coefficients
Fit the regression model using the standardized independent variables and
compare the standardized coefficients. Because they all use the same scale,
you can compare them directly. Standardized coefficients signify the mean
change of the dependent variable given a one standard deviation shift in an
independent variable.
Key point: Identify the independent variable that has the largest absolute
value for its standardized coefficient.
Related post: Standardizing your variables can also help when your model
contains polynomials and interaction terms.
Many statistical software packages include a very helpful analysis. They can
calculate the increase in R-squared when each variable is added to a model
that already contains all of the other variables. In other words, how much
does the R-squared increase for each variable when you add it to the model
last?
This analysis might not sound like much, but there’s more to it than is readily
apparent. When an independent variable is the last one entered into the
model, the associated change in R-squared represents the improvement in
the goodness-of-fit that is due solely to that last variable after all of the other
variables have been accounted for. In other words, it represents the unique
portion of the goodness-of-fit that is attributable only to each independent
variable.
Key point: Identify the independent variable that produces the largest R-
squared increase when it is the last variable added to the model.
The statistical output displays the coded coefficients, which are the
standardized coefficients. Temperature has the standardized coefficient with
the largest absolute value. This measure suggests that Temperature is the
most important independent variable in the regression model.
The manner in which you obtain and measure your sample can bias these
statistics and throw off your assessment of importance.
When you collect a random sample, you can expect the sample variability of
the independent variable values to reflect the variability in the population.
Consequently, the change in R-squared values and standardized coefficients
should reflect the correct population values.
When the goal of your analysis is to change the mean of the independent
variable, you must be sure that the relationships between the independent
variables and the dependent variable are causal rather than just correlation.
If these relationships are not causal, then intentional changes in the
independent variables won’t cause the desired changes in the dependent
variable despite any statistical measures of importance.
For instance, when you want to affect the value of the dependent variable by
changing the independent variables, use your knowledge to identify the
variables that are easiest to change. Some variables can be difficult,
expensive, or even impossible to change.