MLR Eda Model
MLR Eda Model
In this unit:
1. Identify research question and the corresponding population and parameter you are
interested in.
2. Collect data.
3. Posit a statistical model based on information in the sample.
4. Draw inference about the population using your model.
Research Objective
Research Question: What determines a person’s height?
Population: All BYU students.
Parameter of Interest:
Some number measuring the “relationship” between height and various other
explanatory variables such as fathers height, mother’s height, etc.
Sample: A convenience sample of 1727 BYU students who are in Stat 121.
More Problem Definitions
Response Variable (y): The height of a student.
where:
β 1 (MotherHeight): Holding everything else constant (or all else being equal), as the
height of the mother goes up by 1, we expect height to go up by β 1 on average.
β 2 (FatherHeight): Holding everything else constant (or all else being equal), as the
height of the father goes up by 1, we expect height to go up by β 2 on average.
β 3 (Sports): Holding everything else constant (or all else being equal), student’s who
play sports in high school are expected to be β 3 inches taller than those who didn’t.
Multiple Linear Regression Model
β 5 (shoe size): Holding everything else constant (or all else being equal), as shoe size
increases by 1, students get β 5 inches taller on average.
β 0 : Female student’s whose parents are 0 inches tall, did not play sports in HS and wear
a 0 shoe size, we expect their height to be β 0 on average.
Assumptions of the MLR Model
Easy way to remember what we are assuming about the population in a multiple linear
regression model:
More on why these assumptions are important and how to check these in the next
subunit.
Parameter Estimation
Parameters we want to estimate: β 0 & β 1 , … , β P (which defines the line) and σ (so we
know how spread out things are)
Goal: Find the predictions that goes “closest” to the data points.
Parameter Estimation
What do we mean by “line closest to points”? We want to find β^0 , β^1 , … , β^P so that:
n n
∑(Obs i − Pred i ) 2 = ∑(Y i − (β^0 + β^1 X 1i + β^2 X 2i + ⋯ + β^P X P i )) 2
i=1 i=1
n
= ∑(residual i ) 2
i=1
1. We “square” distances to account for “above” and “below” the line distances.
2. We sum squared residuals because we look at all the data.
3. We use “hats” to denote estimates from sample (for example, β^1 is our estimate of β 1 )
4. We include all the explanatory variables simultaneously.
Parameter Estimation
How do we find β^0 , … , β^P that minimizes
n n
∑(Obs i − Line i ) 2 = ∑(Y i − (β^0 + β^1 X 1i + β^2 X 2i + ⋯ + β^P X P i )) 2
i=1 i=1
n
= ∑(residual i ) 2 ?
i=1
In both cases, we’ll let the computer do the hard work for us.
The Fitted MLR Model
Fitted MLR Model Output
Estimate Std. Error t value Pr(>|t|)
(Intercept) 23.2568815 1.2804735 18.162720 0.0000000
MotherHeight 0.2825800 0.0162393 17.400990 0.0000000
FatherHeight 0.2104869 0.0148010 14.221140 0.0000000
SportsInHSYes 0.3482241 0.1083163 3.214883 0.0013292
SexMale 3.1944662 0.1264956 25.253583 0.0000000
ShoeSize 1.0635041 0.0365141 29.125834 0.0000000
Fitted Regression Line Equation:
β 0 : For female children with 0 inch tall parents who do not play sports in HS and wear a
0 shoe size, we expect their height to be 27.28in on average.
All else being equal, students who play sports in high school are 0.348 inches taller, on
average.
Holding everything else constant (or all else being equal), as the shoe size goes up by 1,
we expect height to go up by 1.064 on average.
Using the Analysis Tool
Using the Analysis Tool
But we can’t do that here because we have multiple explanatory variables that all work
together.
Visualizing the Fitted MLR Model
Added variable plots (also known as partial regression plots):
Intuition: Make a scatterplot of one x vs y AFTER “adjusting” for the other x’s (math
detail beyond this course so we’ll just let the computer do it for us).
Parameter Estimation
An estimate of σ is more complicated to explain (take more stats courses), so for purposes
of this class, the computer estimates it for us.
^ = 1.776
σ
How do we interpret σ
^?
On average, the actual heights are about 1.776 far away from the estimated heights.
Is this “better” or “worse” than if we just included mother’s height?
^ = 3.776 if we only use mother height.
σ
It’s hard to tell just from σ
^ how good a model is. A better measure is R 2 .
Assessing Model Fit
Mathematical formula:
Intuition:
1. What is the estimated head size for a newborn, female possum with 0 skull width,
length and tail length?
2. How much should head size go up (or down) as the possum gets 1 cm bigger?
3. How much are male head sizes bigger (or smaller) than female head sizes (on average)?
4. On average, how far away are true head sizes from estimated head sizes?
5. How well do the explanatory variables explain head size?
Additional MLR Practice
1. What is the estimated head size for a newborn, female possum with 0 skull width,
length and tail length?
β^0 = 33.4974481
2. How much should head size go up (or down) as the possum gets 1 cm bigger?
β^1 = 0.4528877
3. How much are male head sizes bigger (or smaller) than female head sizes (on average)?
β^2 = 1.1695384
4. On average, how far away are true head sizes from estimated head sizes?
This is the σ
^ = 2.080432
5. How well do the explanatory variables explain head size?
This is R 2 = 0.669
Homework Choices for Unit 7
Same as Unit 6 but we’re going to add more variables to the regression: