MULTIPLE LINEAR REGRESSION
JUST A LITTLE POKE
GERNIMO MALDONADO-MARTNEZ, RPT; MPH, PHD(C)
DIANA M. FERNNDEZ-SANTOS, MS; EDD
Sir Francis Galton
Widely promoted regression techniques.
Cousin of Charles Darwin.
Making Sense of Regression
My emphasis here is on
Understanding the key elements of regression
Requirements
Application
Limitations
Regression Is A Powerful Analytical
Technique
Enables researchers to do two things:
1. Determine the strength of the relationship
The r-squared value
Regression Is A Powerful Analytical
Technique
2. Determine the impact of the independent
variable(s) on the dependent variable
The regression coefficient is the predicted
change in the dependent variable for every one
unit of change in the independent variable
Collectively, the regression coefficients enable
the researchers to make estimates of how the
dependent variable will change using different
scenarios for the independent variables
Assumptions
Variables are normally distributed.
Continuous nature
Assumption of a linear relationship
between the independent and
dependent variables.
Assumption of homoscedasticity.
Multiple Regression Equation
Y = a + bX+bX2+bXk + e
Where:
Y = predicted value of the dependent variable
a = the constant or Y intercept (where the
imaginary line crosses the Y access)
b = the regression coefficient
X = the independent variable
e = error
Theoretical Linear Model
Linear Regression
Model Type X=size of house, Y=cost of house
Deterministic Model: an equation or set of equations
that allow us to fully determine the value of the dependent
variable from the values of the independent variables.
Probabilistic Model: a method used to capture the
randomness that is part of a real-life process.
R-square And Its Companions
r = correlation coefficient (overall fit or measure of
association, which is also called r, Pearsons r, Pearson
Product Moment Correlation coefficient, or zero-order
coefficient).
r-square = proportion of the explained variance of the
dependent variable (also called the coefficient of
determination)
1 minus r-square = proportion of unexplained variance
in the dependent variable
Dirty Interpretation
Example: Researchers look at GRE scores and academic
performance in graduate school as measured by grade point
average
The hypothesis is that people who have high GRE scores will
also have high GPAs
From an admissions committee perspective: the belief
that GRE scores are a good predictor of future academic
success and are, therefore, a good criteria for admission
decisions
The researchers report an r-squared of .2
GREs explain 20 percent of the change in GPAs
This means that 80 percent of the changes in GPA are
explained by other factors.
A quick example
Dont get lost
.
.
.
.
. . .. . . .
. . ..
. . .
.
. .
Y Axis: Plane Maintenance Costs
$1,000
$500
Predicted values
if perfect relationship
X Axis: Age of Planes
5 years
10 years
20 years
How It Is Applied
Analysts collect data over the past two years and
crunch it. The computer gives these results:
Y = 100 and .020X
The constant is 100:
If they do not fly at all, the computer estimates
there is still a cost of $100
The .020 is the regression coefficient:
This gets interpreted as: for every mile flown, there
is $.02 change in maintenance costs.
How It Is Applied
Y = 100 and .020X
Interpreting the regression coefficient:
For every mile flown, the maintenance costs
goes up by 2 cents.
For every 100 miles flown, costs are $2
For every 1,000 miles, the costs are $20
For every 100,000 miles, the costs are
$20,000
Making Maintenance Cost Estimates
They can then solve the equation:
Assuming 100,000 miles will be flown, how much
will they need to budget for maintenance?
100,000 multiplied by .020 = $20,000
Y= 100 + $20,000 + error
The estimate maintenance will cost:
$20,100 + error
Practicality
Simple Regression: Another Example
Hypothesis: If schools have a higher
percentage of poor children, then they
will have lower test scores.
A regression analysis shows:
A regression coefficient of -.04
An r-squared value of .25
Even More
Interpretation?
Regression coefficient: For every increase in the
percent of children in poverty within a school, the
average test score goes down by .04
R-squared: 25% of the test scores are explained
by the percent of children in poverty in the school
Researchers will ask: what other factors might
explain differences in test scores in the schools?
Multiple Regression Equation
Y = a + bX1 + bX2 + bX3 + bX4 + e.
Y = dependent variable
X1 = independent variable 1,
for X2, X3, X4
X2 = independent variable 2
controlling for X1, X3, X4
X3 = independent variable 3
controlling for X1, X2, X4
X4= independent variable 4
controlling for X1, X2, X3
controlling
Multiple Regression Equation
It has the same basic structure of simple
regression
Y is still the dependent variable
There is still a constant (a) and some
amount of error (e) that the computer
calculates
But there are more Xs to represent the
multiple independent variables
Multiple Regression:
An Example
Hypothesis: Income is a function of education
and seniority?
We suggest that income (the dependent
variable) will increase as both education and
seniority increases (two independent
variables)
Y (Income) = a + education + seniority +
error
Multiple Regression: Interpretation
Results:
Y= 6000 + 400X1 (education) + 200X2 (seniority)
R square = .67
First look at the R-Square: This shows a strong
relationshipso analysis can continue
Partial regression coefficients:
For every year of education, holding seniority
constant, income increases by $400.
For every year of seniority, holding education
constant, income increases by $200.
Multiple Regression: Application
Estimate the income of someone who has
10 years of education and
5 years of seniority
We solve the regression equation:
Multiply the 10 years of education by the regression
coefficient of 400: equals 4,000
Multiply 5 years of senior by the regression coefficient of
200: equals 1,000
Put it together with the constant and you have
Y=6000 + 400(10) + 200(5) + error
Y= $ 11,000 + error
Demystifying the monster
Statistics
Statistics
Data
Data
x
17
12
Information
Multivariate regression pitfalls
Multi-collinearity
Residual
confounding
Overfitting
Multicollinearity
Multicollinearity
arises when two variables
that measure the same thing or similar things
(e.g., weight and BMI) are both included in a
multiple regression model; they will, in
effect, cancel each other out and generally
destroy your model.
VIF>1
is bad
Tolerance: low is bad
Residuals Diagnostics
You cannot completely wipe out confounding simply
by adjusting for variables in multiple regression
unless variables are measured with zero error (which
is usually impossible).
Example: meat eating and mortality
A clean example in PRISM
A clean example in PRISM
A clean example in PRISM
A clean example in PRISM
A clean example in PRISM
A clean example in PRISM
A real linear regression output
A real linear regression output
A real linear regression output
A real linear regression output
References
Kleinbaum, D. G. Applied
Regression Analysis and
Multivariable Methods.
Third Edition (2011)