0% found this document useful (0 votes)
154 views8 pages

Identifying The Most Important Independent Variables in Regression Models - Statistics by Jim

This document discusses different methods for identifying the most important independent variable in a regression model. It begins by explaining that neither regression coefficients nor p-values should be used to determine importance, since coefficients can vary based on variable scaling and p-values reflect statistical rather than practical significance. It recommends considering standardized coefficients and changes in R-squared value when variables are added last to identify potentially important variables. The document provides an example model and cautions that statistical measures don't determine practical importance, advising the use of subject knowledge.

Uploaded by

Hector Gutierrez
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
154 views8 pages

Identifying The Most Important Independent Variables in Regression Models - Statistics by Jim

This document discusses different methods for identifying the most important independent variable in a regression model. It begins by explaining that neither regression coefficients nor p-values should be used to determine importance, since coefficients can vary based on variable scaling and p-values reflect statistical rather than practical significance. It recommends considering standardized coefficients and changes in R-squared value when variables are added last to identify potentially important variables. The document provides an example model and cautions that statistical measures don't determine practical importance, advising the use of subject knowledge.

Uploaded by

Hector Gutierrez
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 8

Identifying the Most Important

Independent Variables in
Regression Models
By Jim Frost 14 Comments

You’ve settled on a regression model that contains independent variables that


are statistically significant. By interpreting the statistical results, you can
understand how changes in the independent variables are related to shifts in
the dependent variable. At this point, it’s natural to wonder, “Which
independent variable is the most important?”

Surprisingly, determining which


variable is the most important is more
complicated than it first appears. For a
start, you need to define what you
mean by “most important.” The
definition should include details about
your subject-area and your goals for
the regression model. So, there is no
one-size fits all definition for the most
important independent variable.
Furthermore, the methods you use to
collect and measure your data can affect the seeming importance of the
independent variables.

In this blog post, I’ll help you determine which independent variable is the
most important while keeping these issues in mind. First, I’ll reveal
surprising statistics that are not related to importance. You don’t want to get
tripped up by them! Then, I’ll cover statistical and non-statistical approaches
for identifying the most important independent variables in your regression
model. I’ll also include an example regression model where we’ll try these
methods out.

Related post: When Should I Use Regression Analysis?

Do Not Associate Regular Regression


Coefficients with the Importance of
Independent Variables
The regular regression coefficients that you see in your statistical output
describe the relationship between the independent variables and the
dependent variable. The coefficient value represents the mean change of the
dependent variable given a one-unit shift in an independent variable.
Consequently, you might think you can use the absolute sizes of
the coefficients to identify the most important variable. After all, a larger
coefficient signifies a greater change in the mean of the independent variable.

However, the independent variables can have dramatically different types of


units, which make comparing the coefficients meaningless. For example,
the meaning of a one-unit change differs considerably when your variables
measure time, pressure, and temperature.

Additionally, a single type of measurement can use different units. For


example, you can measure weight in grams and kilograms. If you fit two
regression models using the same dataset, but use grams in one model and
kilograms in the other, the weight coefficient changes by a factor of a
thousand! Obviously, the importance of weight did not change at all even
though the coefficient changed substantially. The model’s goodness-of-fit
remains the same.

Key point: Larger coefficients don’t necessarily represent more important


independent variables.

Do Not Link P-values to Importance


You can’t use the coefficient to determine the importance of an independent
variable, but how about the variable’s p-value? Comparing p-values seems to
make sense because we use them to determine which variables to include in
the model. Do lower p-values represent more important variables?

Calculations for p-values include various properties of the variable, but


importance is not one of them. A very small p-value does not indicate that the
variable is important in a practical sense. An independent variable can have a
tiny p-value when it has a very precise estimate, low variability, or a large
sample size. The result is that effect sizes that are trivial in the practical sense
can still have very low p-values. Consequently, when assessing statistical
results, it’s important to determine whether an effect size is practically
significant in addition to being statistically significant.

Key point: Low p-values don’t necessarily represent independent variables


that are practically important.

Do Assess These Statistics to Identify Variables


that might be Important
I showed how you can’t use several of the more notable statistics to determine
which independent variables are most important in a regression model. The
good news is that there are several statistics that you can use. Unfortunately,
they sometimes disagree because each one defines “most important”
differently.

Standardized coefficients

As I explained previously, you can’t compare the regular regression


coefficients because they use different scales. However, standardized
coefficients all use the same scale, which means you can compare them.

Statistical software calculates standardized regression coefficients by first


standardizing the observed values of each independent variable and then
fitting the model using the standardized independent variables.
Standardization involves subtracting the variable’s mean from each observed
value and then dividing by the variable’s standard deviation.

Fit the regression model using the standardized independent variables and
compare the standardized coefficients. Because they all use the same scale,
you can compare them directly. Standardized coefficients signify the mean
change of the dependent variable given a one standard deviation shift in an
independent variable.

Key point: Identify the independent variable that has the largest absolute
value for its standardized coefficient.

Related post: Standardizing your variables can also help when your model
contains polynomials and interaction terms.

Change in R-squared for the last variable added to


the model

Many statistical software packages include a very helpful analysis. They can
calculate the increase in R-squared when each variable is added to a model
that already contains all of the other variables. In other words, how much
does the R-squared increase for each variable when you add it to the model
last?

This analysis might not sound like much, but there’s more to it than is readily
apparent. When an independent variable is the last one entered into the
model, the associated change in R-squared represents the improvement in
the goodness-of-fit that is due solely to that last variable after all of the other
variables have been accounted for. In other words, it represents the unique
portion of the goodness-of-fit that is attributable only to each independent
variable.

Key point: Identify the independent variable that produces the largest R-
squared increase when it is the last variable added to the model.

Example of Identifying the Most Important


Independent Variables in a Regression Model
The example output below shows a regression model that has three
independent variables. You can download the CSV data file to try it yourself:
ImportantVariables.

The statistical output displays the coded coefficients, which are the
standardized coefficients. Temperature has the standardized coefficient with
the largest absolute value. This measure suggests that Temperature is the
most important independent variable in the regression model.

The graphical output below shows the incremental impact of each


independent variable. This graph displays the increase in R-squared
associated with each variable when it is added to the model last. Temperature
uniquely accounts for the largest proportion of the variance. For our example,
both statistics suggest that Temperature is the most important variable in the
regression model.
Cautions for Using Statistics to Pinpoint
Important Variables
Standardized coefficients and the change in R-squared when a variable is
added to the model last can both help identify the more important
independent variables in a regression model—from a purely statistical
standpoint. Unfortunately, these statistics can’t determine the practical
importance of the variables. For that, you’ll need to use your knowledge of the
subject area.

The manner in which you obtain and measure your sample can bias these
statistics and throw off your assessment of importance.

When you collect a random sample, you can expect the sample variability of
the independent variable values to reflect the variability in the population.
Consequently, the change in R-squared values and standardized coefficients
should reflect the correct population values.

However, if the sample contains a restricted range (less variability) for a


variable, both statistics tend to underestimate the importance. Conversely, if
the variability of the sample is greater than the population variability, the
statistics tend to overestimate the importance of that variable.

Also, consider the quality of measurements for your independent variables. If


the measurement precision for a particular variable is relatively low, that
variable can appear to be less predictive than it truly is.

When the goal of your analysis is to change the mean of the independent
variable, you must be sure that the relationships between the independent
variables and the dependent variable are causal rather than just correlation.
If these relationships are not causal, then intentional changes in the
independent variables won’t cause the desired changes in the dependent
variable despite any statistical measures of importance.

Typically, you need to perform a randomized experiment to determine


whether the relationships are causal.

Non-Statistical Issues to Help Find Important


Variables
The definition of “most important” should depend on your goals and the
subject-area. Practical issues can influence which variable you consider to be
the most important.

For instance, when you want to affect the value of the dependent variable by
changing the independent variables, use your knowledge to identify the
variables that are easiest to change. Some variables can be difficult,
expensive, or even impossible to change.

“Most important” is a subjective, context sensitive quality. Statistics can


highlight candidate variables, but you still need to apply your subject-area
expertise.

If you’re learning regression, check out my Regression Tutorial!

Note: I wrote a different version of this post that appeared


elsewhere. I’ve completely rewritten and updated it for my blog
site.

Related Posts on Statistics by Jim

You might also like