Statistics Week3
Statistics Week3
Outline
I. Introduction to the “regression” problem
II. Simple linear regression
• Introducing Equations and Terminology
• Different Types of Regression Lines
• Estimating the Regression Coefficients
• Centring
• Inference
• Model Fit
III. Conclusions & Questions
IV. Multiple linear regression
• Introduction
• Centring
• Explained Variance
• Model Comparison
• Deriving the F-Statistic
• F-Statistic Application
• Model Building
• Continuing Model Comparison
• Continuing Model Building
• Thinking About Partialing
V. Conclusions & Questions
1
The distributions we consider in regression problems have conditional means.
The value of Y that we expect for each observation is defined by the observation’s
individual characteristics.
This type of distribution is called “conditional”:
We want to explain the relationship between Y and X by finding the line that passes through
the scatterplot as “closely” as possible to each point.
This line is called the “best fit line.”
For any value of X, the corresponding point on the best fit line is the model’s best
guess for the value of Y.
2
In the population regression equation Y = β0 + β 1 Χ +ε , the 𝜀 term represents a vector of
Residuals vs Errors:
errors.
In the context of data science and statistics, the notation N(0, 1) typically represents a probability distribution,
specifically the standard normal distribution. A normal distribution with a mean (μ) of 0 and a standard
deviation (σ) of 1 is known as the standard normal distribution.
∑ ( Χ n −Χ ) ( Y n−Y )
^β 1= n=1
N
∑ ( Χ n− Χ )2
n=1
^β =Y − ^β Χ
0 1
3
These are the ordinary least squares (OLS) estimates of 𝛽1 and 𝛽0.
Example:
This is how we obtained the ordinary least squares estimates of our parameters for the
example we saw earlier. Horsepower is the dependent variable. And price is the
independent variable. And if we run the model and use the coef() function that you see here
to extract the parameters, then this is what we get.
5. Inference
4
We have to answer how generalizable our results are, and therefore we need to use
statistical inference to account for the precision with which we’ve estimated ^β 0 and ^β 1.
Sampling Distributions
Our regression intercept and slope both have sampling distributions that we can use to
judge the precision of our estimates.
The mean of the distribution is going to be our estimated statistic.
In OLS regression, the coefficients are normally distributed.
The standard deviation of these distributions is equal to the standard error(SE).
When the standard deviation (s/σ^ ) of a distribution is equal to the standard error(SE), it
usually means that you are dealing with a situation where you're estimating a
population parameter based on a sample statistic.
Sample Standard Deviation (s / σ^ ) represents the measure of the spread or variability
of individual data points within the sample. It tells you how much the data points
deviate from the sample mean.
√
N
∑ ( X n− X )2
s/ σ^ = n=1
N−1
The Standard Error(SE) is a measure of the variability or precision of a statistic
(particularly the mean) when it's estimated from a sample of data. It quantifies how
much the sample mean is expected to vary from the true population mean. In other
words, it provides a way to estimate the uncertainty or margin of error associated
with a sample statistic.
The standard error is an important concept in statistics because it's often used to
calculate confidence intervals, which help researchers and analysts make inferences
about population parameters based on sample data.
The standard error is typically calculated using the standard deviation (σ) and the
sample size (n):
σ^
SE=
√n
Standard Errors
The standard deviations of the preceding sampling distributions quantify the precision of
our estimated ^β 0∧ β^ 1
𝑆𝐸( ^β 1).
5
𝑆𝐸 0
√[ ]
2
1 X
( ^β )= σ^
2
N
+ N
∑ ( X n−X ) 2
n=1
𝑆𝐸 1
√
2
σ^
( ^β )= N
∑ ( X n− X )2
n=1
CI = ^β ± t × SE ( β^ )
The t value is obtained based on the confidence level that we want and the degrees of
freedom in our data.
CI 95: We are 95% certain/confident in the sense that if we repeat this analysis an infinite
number of times, 95% of the CIs that we calculate will surround the true value of β 1.
In terms of sampling distributions, our inferential task is to say something about how
CIs give us a plausible range for the population value of 𝛽, so we can use CIs to support
distinct the null and alternative distributions are.
inference.
6
6. Model Fit
We quantify the proportion of the outcome’s variance that is explained by our model using
the R2statistic:
2 TSS−RSS RSS
R= =1−
TSS TSS
where
N
TSS=∑ ( Y n −Y ) =Var (Y ) ×(N −1)
2
n=1
R =0.62 indicates that car price explains 62% of the variability in horsepower. 汽车价格
2
可以解释 62%的马力变化。
( )
N N p 2
1
∑ ( Y n−Y^ n ) = N1 ∑ Y n − ^β 0−∑ ^β p X np = RSS
2
MSE=
N n=1 n=1 p=1 N
RMSE=3 2.06 : When using price as the only predictor of horsepower, we expect
prediction errors with magnitudes of 32.06 horsepower.
7
R-squared ( R2): In simple linear regression or multiple regression, R2
quantifies the proportion of the total variance in the dependent variable that
is explained by the regression model. It ranges from 0 to 1, with 0 indicating
that the model explains none of the variance and 1 indicating that the model
explains all of the variance. The higher the R-squared, the better the model
fits the data.
When interpreting MSE, you're assessing the overall goodness-of-fit of a
model. A lower MSE suggests that the model's predictions are closer to the
true values.
Question 1
In the estimated regression model, Y^ = ^β0 + β^ 1 Χ + ε^ , the ε^ term represent a vector of ____
a) errors.
b) residuals.
Question 2
Consider that we regress income (Y) on years of experience (X). Without centring X, the
intercept in the regression is defined as the expected value of Y when X = _0____.
After mean-centring X (with a mean of 5 years of experience), the intercept in the regression
of Y on the mean-centred version of X (X_c) corresponds to the expected value of Y when
the original X = __5____.
8
IV. Multiple Linear Regression
Simple linear regression: A single outcome is predicted by a single independent variable.
Simple linear regression implies a 1D line in 2D space. Adding another predictor leads to a
3D point cloud.
Partial effects
In MLR, we want to examine the partial effects of the predictors.
What is the effect of a predictor after controlling for some other set of variables?
This approach is crucial to controlling confounds and adequately modelling real-world
phenomena.
For example, when investigating the effect of income on career satisfaction, we
might want to control for tenure.
If we conducted experiments and randomization worked out well, we would not
need to control for confounds. But usually, we work with observational data and
not with experimental data.
MLR asking: What is the effect of BMI on average blood pressure, after controlling
for age? (or “above and beyond age” or “keeping age fixed”).
o We’re partialing age out of the effect of BMI on blood pressure.
o Interpretation:
9
The expected average blood pressure for an unborn patient with zero
weight is 52.25.
for each year older, average blood pressure is expected to increase by
0.29 points after controlling for BMI.
There's a 1.08 increase in blood pressure for every unit increase in BMI
after controlling for age.
The expected average blood pressure for a 30-year-old patient with a BMI of 25 is 87.85.
For each year older, average blood pressure is expected to increase by 0.29 points, after
controlling for BMI.
For each additional point of BMI, average blood pressure is expected to increase by 1.08
points, after controlling for age.
According to the corresponding p-values, both partial predictors are significant, but we
cannot say anything about the two predictors combined.
Explained Variance
Multiple R2
How much variation in blood pressure is explained by the two models?
10
Check the R2 values.
It means that about 11% and 22% of the variability in blood pressure are accounted for.
SSM
Mean squares for the model: MSM=
df M
The bad variability is the error variability (i.e. the variability in Y not explained by our
model):
N
Sum of squared errors(i.e. residual sum of squares): SSE=∑ ( Y n−Y^ n)
2
n =1
SSE
Mean squared errors: MSE=
df E
11
SSM
df M MSM
F= =
SSE MSE
df E
4. Model Comparison
How do we quantify the additional variation explained by BMI, above and beyond
age?
o We compute the ∆ R 2
12
How do we know if ∆ R 2=0.115 represents a significantly greater degree of
explained variation?
o We use an F -test of the ∆ R 2
Although correct, the above definition is a special case that does not generalize in an
obvious way. In general, the F-statistic is defined in terms of model comparisons:
( E R / EF ) / ( df R −df F )
F=
E F /df F
E Rand E F are the errors (i.e., SSEs) of the restricted and full models.
df R and df F are the restricted and full degrees of freedom.
13
The Compared Models
The full model is always the model with more estimated parameters.
The model with more predictor variables.
The restricted model is the model with fewer estimated parameters.
The restricted model must be nested within the full model.
o If our full model is the one with age and BMI as predictors, then our
restricted model cannot be one was cholesterol as predicted because
cholesterol was not part of the full model.
When we use the F-statistic to test if R2 > 0 (i.e. when we’re not explicitly comparing two
models):
The full model is our estimated model.
The restricted model is an intercept-only model. ( Y^ n=Y )
An intercept-only model is a model that assumes there is no relationship between
the dependent variable and any independent variables, and it only predicts the
mean of the dependent variable. ŷ = β₀ = y bar
Model Building
We’ll take Y bp=β 0 + β 1 X age 30 +ε as our baseline model.
Next we simultaneously add predictors of LDL and HDL cholesterol.
Age, LDL, and HDL explain a combined 14.4% of the variation in blood pressure. That
proportion of variation is significantly greater than zero.
14
Adding LDL and HDL produces a model that explains 3.1% more variation in blood
pressure than a model with age as the only predictor.
This increase in variance explained is significantly greater than zero.
Adding LDL and HDL produces a model with lower prediction error (i.e., MSE = 163.4
vs. MSE = 169.4).
Further example:
So far we’ve established that age, LDL, and HDL are all significant predictors of average
blood pressure.
We’ve also established that LDL and HDL, together, explain a significant amount of
additional variation, above and beyond age.
Next, we’ll add BMI to see what additional predictive role it can play above and beyond age
and cholesterol.
15
1. BMI seems to have a pretty strong effect on average blood pressure, after controlling for
age and cholesterol levels.
after controlling for BMI, cholesterol levels no longer seem to be important
predictors.
Let’s take a look at what happens to the cholesterol effects when we add BMI.
2. How much additional variability in blood pressure is explained by BMI above and beyond
age and cholesterol levels?
## [1] 0.08595543
anova(out3, out4)
1. How much explained variation did we lose by removing the LDL and HDL variables?
## [1] 0.002330906
2. Is it significant? No??
16
3. How do the prediction errors compare?
mse2.5 <- MSE(y_pred = predict(out5), y_true = dDat$bp)
mse2.4
## [1] 146.9918
mse2.5
## [1] 147.4367
Thinking about Partialing: Partial effect of a specific predictor, above and beyond other
predictors
17
V. Conclusions & Questions
Conclusion
1. Each variable in a regression model corresponds to a dimension in the data-space.
A regression model with P predictors implies a P- dimensional (hyper)-plane in
(P + 1)-dimensional space.
2. The coefficients in MLR are partial coefficients.
Each effect is interpreted as controlling for other predictors.
Question 1
Imagine we regress the outcome variable “satisfaction with life” (swl) on the four
predictor variables “age”, “income”, “number of facebook friends” (nr_fb_friend), and
“weekly working hours”(wwh):
lm(swl ̃ age + income + nr_fb_friend + wwh, data = dDat)
Which of the following models would be nested in this “full” model (you can select more
than one answer)
a) lm(swl ̃ age + wwh + nr_pets, data = dDat)
b) lm(swl ̃ age + nr_fb_friend, data = dDat)
c) lm(swl ̃ income + nr_fb_friend + wwh, data = dDat)
d) lm(swl ̃ age + income + nr_fb_friend, data = dDat)
Question 2
Consider the following output from a regression of blood pressure on age.
18
True or false: Based on the output, the effect of age on blood pressure is significant.
a) True
b) False
Question 3
Consider the following output from a regression of blood pressure on age and bmi.
True or false: Based on the output, the effect of age on blood pressure is significant.
a) True
b) False
20230912
19