0% found this document useful (0 votes)
17 views32 pages

MLR Eda Model

Uploaded by

hanandeh0791
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
17 views32 pages

MLR Eda Model

Uploaded by

hanandeh0791
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 32

Multiple Linear

Regression - EDA and


Model
A Real Study

Background: cIMT is a measure of cardiovascular disease


Question: what is relationship between physical activity, cardiovascular fitness,
perceived functional ability, and cIMT
Conclusions: The most predictive variables of cIMT were: age (p = 0.000), gender (p =
0.001), BMI (p = 0.05), SBP (p = 0.000), total cholesterol (p = 0.000), and triglycerides (p =
0.000).

In this unit:

How do we simultaneously use many explanatory variables to understand and predict a


response?
Reminder
The process of statistical analysis:

1. Identify research question and the corresponding population and parameter you are
interested in.
2. Collect data.
3. Posit a statistical model based on information in the sample.
4. Draw inference about the population using your model.
Research Objective
Research Question: What determines a person’s height?
Population: All BYU students.
Parameter of Interest:

Some number measuring the “relationship” between height and various other
explanatory variables such as fathers height, mother’s height, etc.

Sample: A convenience sample of 1727 BYU students who are in Stat 121.
More Problem Definitions
Response Variable (y): The height of a student.

This is a continuous quantitative variable meaning it can be any number (including


decimals)

Explanatory Variable (x):

Lots! The goal is to relate multiple explanatory variables to a single quantitative


response variable.
Variable Encoding
(Part of) Your Student Survey Data
Height MotherHeight FatherHeight SportsInHS Sex ShoeSize
70 64 72 Yes Male 11.0
71 67 72 Yes Male 9.0
71 65 68 Yes Male 10.5
70 60 69 Yes Male 11.0
74 69 72 Yes Male 11.5
What do we do with the “Yes/No” variables?

Encoding - the process of assigning categorical variables numerical values.


One-hot-encoding (aka Dummy Variable encoding) - uses 1’s and 0’s.
Yes=1, No=0 or Female=0, Male=1 (alphabetical)
Much more on this in more stats classes but we’ll keep in simple here.
EDA Tool #1 - Plots
EDA Tool #2 - Correlations

Reminder on Properties of Correlation (r):


−1 < r < 1
Only appropriate for LINEAR relationships
NOT impacted by scale of data (scale invariant).
Highly impacted by outliers
Cor(X, Y ) = Cor(Y , X)
Using the Analysis Tool
Using the Analysis Tool
Using the Analysis Tool
Multiple Linear Regression Model
In specifying a model for the population relationship between height and all the
explanatory variables, we want to,

1. include all the variables at the same time,


2. keep it linear in all the variables at the same time,
3. account for the fact that the data is not a perfect relationship.
Multiple Linear Regression Model

where:

X 1i is the i th observation the of 1st explanatory variable


E.g X 13 the mother’s height for the 3rd observation in our dataset.
P = total number of explanatory variables you have
Multiple Linear Regression Model

How do we interpret β 0 , β 1 , … , β 5 (these are called slopes” or “effects”)?

β 1 (MotherHeight): Holding everything else constant (or all else being equal), as the
height of the mother goes up by 1, we expect height to go up by β 1 on average.

β 2 (FatherHeight): Holding everything else constant (or all else being equal), as the
height of the father goes up by 1, we expect height to go up by β 2 on average.

β 3 (Sports): Holding everything else constant (or all else being equal), student’s who
play sports in high school are expected to be β 3 inches taller than those who didn’t.
Multiple Linear Regression Model

How do we interpret β 0 , β 1 , … , β 5 (these are called “slopes” or “effects”)?

β 5 (shoe size): Holding everything else constant (or all else being equal), as shoe size
increases by 1, students get β 5 inches taller on average.

β 0 : Female student’s whose parents are 0 inches tall, did not play sports in HS and wear
a 0 shoe size, we expect their height to be β 0 on average.
Assumptions of the MLR Model
Easy way to remember what we are assuming about the population in a multiple linear
regression model:

L - Linear relationship between y and all the quantitative x’s simultaneously


I - Independence (one obs. doesn’t impact the other)
N - Normal residuals (distance from “line” is normal)
E - Equal spread of residuals around the “line”

More on why these assumptions are important and how to check these in the next
subunit.
Parameter Estimation
Parameters we want to estimate: β 0 & β 1 , … , β P (which defines the line) and σ (so we
know how spread out things are)
Goal: Find the predictions that goes “closest” to the data points.
Parameter Estimation
What do we mean by “line closest to points”? We want to find β^0 , β^1 , … , β^P so that:
n n
∑(Obs i − Pred i ) 2 = ∑(Y i − (β^0 + β^1 X 1i + β^2 X 2i + ⋯ + β^P X P i )) 2
i=1 i=1
n
= ∑(residual i ) 2
i=1

is as small as possible. This is called the least squares regression line.


A few notes:

1. We “square” distances to account for “above” and “below” the line distances.
2. We sum squared residuals because we look at all the data.
3. We use “hats” to denote estimates from sample (for example, β^1 is our estimate of β 1 )
4. We include all the explanatory variables simultaneously.
Parameter Estimation
How do we find β^0 , … , β^P that minimizes
n n
∑(Obs i − Line i ) 2 = ∑(Y i − (β^0 + β^1 X 1i + β^2 X 2i + ⋯ + β^P X P i )) 2
i=1 i=1
n
= ∑(residual i ) 2 ?
i=1

1. Guess and check


2. Use calculus

In both cases, we’ll let the computer do the hard work for us.
The Fitted MLR Model
Fitted MLR Model Output
Estimate Std. Error t value Pr(>|t|)
(Intercept) 23.2568815 1.2804735 18.162720 0.0000000
MotherHeight 0.2825800 0.0162393 17.400990 0.0000000
FatherHeight 0.2104869 0.0148010 14.221140 0.0000000
SportsInHSYes 0.3482241 0.1083163 3.214883 0.0013292
SexMale 3.1944662 0.1264956 25.253583 0.0000000
ShoeSize 1.0635041 0.0365141 29.125834 0.0000000
Fitted Regression Line Equation:

y^ = 23.26 + 0.28 × MotherHeight i + 0.21 × FatherHeight i + 0.35 × Sports i +


3.19 × Sex i + 1.06 × ShoeSize i
The Fitted MLR Model
How do we interpret β^0 = 23.257?

β 0 : For female children with 0 inch tall parents who do not play sports in HS and wear a
0 shoe size, we expect their height to be 27.28in on average.

How do we interpret β^3 = 0.348 (sports)?

All else being equal, students who play sports in high school are 0.348 inches taller, on
average.

How do we interpret β^5 = 1.064 (shoe size)?

Holding everything else constant (or all else being equal), as the shoe size goes up by 1,
we expect height to go up by 1.064 on average.
Using the Analysis Tool
Using the Analysis Tool

Fitted regression equation:


y^ = 349.2369 − 5.495 × Lat + 21.7976 × Ocean + 0.1219 × Long
Visualizing the Fitted MLR Model
When we only had 1 explanatory variable, we could visualize the fitted model:

But we can’t do that here because we have multiple explanatory variables that all work
together.
Visualizing the Fitted MLR Model
Added variable plots (also known as partial regression plots):

Intuition: Make a scatterplot of one x vs y AFTER “adjusting” for the other x’s (math
detail beyond this course so we’ll just let the computer do it for us).
Parameter Estimation
An estimate of σ is more complicated to explain (take more stats courses), so for purposes
of this class, the computer estimates it for us.

^ = 1.776
σ

How do we interpret σ
^?
On average, the actual heights are about 1.776 far away from the estimated heights.
Is this “better” or “worse” than if we just included mother’s height?
^ = 3.776 if we only use mother height.
σ
It’s hard to tell just from σ
^ how good a model is. A better measure is R 2 .
Assessing Model Fit
Mathematical formula:

∑ ni=1 (Y i − (β^0 + β^1 X 1i + ⋯ + β^P X P i )) 2


R2 = 1 − = 0.808
∑ ni=1 (Y i − ȳ) 2

Intuition:

Formal interpretation: The percent of variability in Y that is explained by all X’s


simultaneously.
R 2 is between 0 and 1 with 1 meaning the explanatory variables perfectly explain the
response.
R 2 is a percentage grade on how well all the X’s are doing in telling us about Y .
For our study, 80.8% of the variation in student’s height can be explained by mother’s
height, father’s height, if you played sports in HS, biological sex and shoe size.
Using the Analysis Tool
Additional MLR Practice
Measuring possum head size can be difficult. What is the relationship between possum
head size and sex, age, skull width, total length and tail length? Use a multiple linear
regression model (and the course app) to answer the following questions:

1. What is the estimated head size for a newborn, female possum with 0 skull width,
length and tail length?
2. How much should head size go up (or down) as the possum gets 1 cm bigger?
3. How much are male head sizes bigger (or smaller) than female head sizes (on average)?
4. On average, how far away are true head sizes from estimated head sizes?
5. How well do the explanatory variables explain head size?
Additional MLR Practice
1. What is the estimated head size for a newborn, female possum with 0 skull width,
length and tail length?

β^0 = 33.4974481
2. How much should head size go up (or down) as the possum gets 1 cm bigger?

β^1 = 0.4528877
3. How much are male head sizes bigger (or smaller) than female head sizes (on average)?

β^2 = 1.1695384
4. On average, how far away are true head sizes from estimated head sizes?
This is the σ
^ = 2.080432
5. How well do the explanatory variables explain head size?
This is R 2 = 0.669
Homework Choices for Unit 7
Same as Unit 6 but we’re going to add more variables to the regression:

1. Rate my professor - what matters in determining a rate my professor score?


2. Supervisor - what makes people like their manager?
3. Body Fat - what body measurements are predictive of your BMI?
4. Basketball Salary - what skills lead to a higher salary?
Key Terminology
EDA for MLR Interpretation of Coefficients
Multiple linear regression model Added-variable Plots
R2 Least squares estimation

You might also like