0% found this document useful (0 votes)
4 views19 pages

Statistics Week3

The document provides an overview of simple and multiple linear regression, detailing the regression problem, estimation of coefficients, and model fitting techniques. It covers key concepts such as the best fit line, centering for improved interpretation, statistical inference, and the use of R-squared and mean squared error for assessing model fit. The lecture concludes with a discussion on the implications of regression analysis and the importance of understanding confidence intervals.

Uploaded by

x.zhang.ds
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views19 pages

Statistics Week3

The document provides an overview of simple and multiple linear regression, detailing the regression problem, estimation of coefficients, and model fitting techniques. It covers key concepts such as the best fit line, centering for improved interpretation, statistical inference, and the use of R-squared and mean squared error for assessing model fit. The lecture concludes with a discussion on the implications of regression analysis and the importance of understanding confidence intervals.

Uploaded by

x.zhang.ds
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 19

Simple & Multiple Linear Regression Lecture 4 & 5 (Week3)

Outline
I. Introduction to the “regression” problem
II. Simple linear regression
• Introducing Equations and Terminology
• Different Types of Regression Lines
• Estimating the Regression Coefficients
• Centring
• Inference
• Model Fit
III. Conclusions & Questions
IV. Multiple linear regression
• Introduction
• Centring
• Explained Variance
• Model Comparison
• Deriving the F-Statistic
• F-Statistic Application
• Model Building
• Continuing Model Comparison
• Continuing Model Building
• Thinking About Partialing
V. Conclusions & Questions

I. Introduction to the “regression” problem


Regression Problem
Some of the most common and useful statistical models are regression models.
 Regression problems involve modelling a quantitative outcome. (As opposed to
classification problems, which involve modelling categorical outcomes)
o The regression problem begins with a random outcome variable, Y.
o We hypothesize that the mean of Y is dependent on some set of fixed
covariates, X.

Flavours of Probability Distribution


The distributions we’ve considered thus far imply a constant mean.
 Each observation is expected to have the same value of Y, regardless of their
individual characteristics.
Here we have a constant mean of 0, which means all observations are expected to
have the same value of Y regardless of individual characteristics.

 This type of distribution is called “marginal” or “unconditional.”

1
The distributions we consider in regression problems have conditional means.
 The value of Y that we expect for each observation is defined by the observation’s
individual characteristics.
 This type of distribution is called “conditional”:

In practice, we only interact with the X-Y plane of the 3D figure.


 On the Y-axis, we plot our outcome variable.
 The X-axis represents the predictor variable upon which we condition the mean of Y.

We want to explain the relationship between Y and X by finding the line that passes through
the scatterplot as “closely” as possible to each point.
 This line is called the “best fit line.”
 For any value of X, the corresponding point on the best fit line is the model’s best
guess for the value of Y.

II. Simple Linear Regression


1. Introducing Equations and Terminology
 The best fit line is defined by a simple equation: Y^ = ^β0 + β^ 1 Χ
^β is the intercept.
0
• So the Y^ value when X = 0.
• And the expected value of Y^ when X = 0
^β is the slope.
1
• The change in Y^ for a unit change in X.
• The expected change in Y for a unit change in X
 The equation only describes the best fit line. It does not fully quantify the
relationship between Y and X.
 We still need to account for the estimation error. Y^ = ^β0 + β^ 1 Χ + ε^

2
In the population regression equation Y = β0 + β 1 Χ +ε , the 𝜀 term represents a vector of
Residuals vs Errors:

errors.

 The errors, 𝜀, are unknown parameters, so we must estimate them.


 The differences between Y and the true regression line, β 0 + β 1 Χ

In the estimated regression model Y^ = ^β0 + β^ 1 Χ + ε^ , the ε^ term represent a vector of


residuals.
 The differences between Y and the estimated best fit line, ^β 0 + ^β 1 Χ
 The residuals, ε^ , are sample estimates of the errors, 𝜀. subset

2. Different Types of Regression Lines


We can think about our regression model in four ways:

In the context of data science and statistics, the notation N(0, 1) typically represents a probability distribution,
specifically the standard normal distribution. A normal distribution with a mean (μ) of 0 and a standard
deviation (σ) of 1 is known as the standard normal distribution.

3. Estimating the regression Coefficients


The purpose of regression analysis is to use a sample of N observed { Y n , Χ n } pairs to find the
best fit line defined by ^β 0 and ^β 1. Different methods exist to do this estimation. The most
popular involves minimizing the sum of the squared residuals (i.e., estimated errors).
 We sum the residuals and want them to be as small as possible.
 The ε^ are defined in terms of deviations between each observed Y nvalue and the
corresponding Y^ n .
 Each ε^ n is squared before summing to remove negative values and produce a
quadratic objective function.
N N N
RSS=∑ ε^ 2n =∑ ( Y n− Y^ n ) =∑ ( Y n− β^ 0− β^ 1 Χ n )
2 2

n=1 n=1 n=1
The RSS is a very well-behaved objective function that admits closed-form solutions
for the minimizing values of ^β 0 and ^β 1. (which means that we don't need any
numerical optimization methods, but we can simply derive the results with algebra.)
 The values of ^β 0 and ^β 1 that minimize the RSS are:
N

∑ ( Χ n −Χ ) ( Y n−Y )
^β 1= n=1
N

∑ ( Χ n− Χ )2
n=1
^β =Y − ^β Χ
0 1

3
 These are the ordinary least squares (OLS) estimates of 𝛽1 and 𝛽0.

Example:
This is how we obtained the ordinary least squares estimates of our parameters for the
example we saw earlier. Horsepower is the dependent variable. And price is the
independent variable. And if we run the model and use the coef() function that you see here
to extract the parameters, then this is what we get.

The estimated intercept is ^β 0 = 60.45.


 A free car is expected to have 60.45 horsepower.
The estimated slope is: ^β 1 = 4.27.
 For every additional $1,000 in price, a car is expected to gain 4.27 horsepower.
Always be aware of the interpretation of the intercept and slope and what type of scaling
we have!

4. Centring - Improving Interpretation 平移


We don’t care about the horsepower of a “free” car, so we may want to transform our data
in some way to improve interpretation.
 The intercept is defined as the expected value of Y when X = 0
o We can make the intercept interesting by shifting the scale of X so that X = 0
is a meaningful point.
 A common way to do this is (mean) centring. We mean-centre X by subtracting Χ
from each of the Χ n.
 After mean-centring, the zero-point of Χ C corresponds to the original Χ .
 Centring only translates the scale of the X-axis.
 Centring does not change the linear relationship in linear models.
 In general, the highest order relationship is not affected by centring.

Estimate the least squares coefficients with X mean-centred:

The estimated intercept is ^β 0 = 143.83.


 A car with average price is expected to have 143.83 horsepower.
The estimated slope is: ^β 1 = 4.27. Slope doesn’t change.
 For every additional $1,000 in price, a car is expected to gain 4.27 horsepower.

5. Inference

4
We have to answer how generalizable our results are, and therefore we need to use
statistical inference to account for the precision with which we’ve estimated ^β 0 and ^β 1.

Sampling Distributions

Our regression intercept and slope both have sampling distributions that we can use to
judge the precision of our estimates.
 The mean of the distribution is going to be our estimated statistic.
 In OLS regression, the coefficients are normally distributed.
 The standard deviation of these distributions is equal to the standard error(SE).
When the standard deviation (s/σ^ ) of a distribution is equal to the standard error(SE), it
usually means that you are dealing with a situation where you're estimating a
population parameter based on a sample statistic.
 Sample Standard Deviation (s / σ^ ) represents the measure of the spread or variability
of individual data points within the sample. It tells you how much the data points
deviate from the sample mean.


N

∑ ( X n− X )2
s/ σ^ = n=1
N−1
 The Standard Error(SE) is a measure of the variability or precision of a statistic
(particularly the mean) when it's estimated from a sample of data. It quantifies how
much the sample mean is expected to vary from the true population mean. In other
words, it provides a way to estimate the uncertainty or margin of error associated
with a sample statistic.
 The standard error is an important concept in statistics because it's often used to
calculate confidence intervals, which help researchers and analysts make inferences
about population parameters based on sample data.
 The standard error is typically calculated using the standard deviation (σ) and the
sample size (n):
σ^
SE=
√n
Standard Errors
The standard deviations of the preceding sampling distributions quantify the precision of
our estimated ^β 0∧ β^ 1

 In practice, we estimate the standard deviation of these distributions as 𝑆𝐸( ^β 0) and


 The sampling distributions shown above are theoretical entities.

𝑆𝐸( ^β 1).

5
𝑆𝐸 0
√[ ]
2
1 X
( ^β )= σ^
2
N
+ N

∑ ( X n−X ) 2
n=1

𝑆𝐸 1

2
σ^
( ^β )= N

∑ ( X n− X )2
n=1

We can use 𝑆𝐸( β^ )in combination with ^β for inference.


Inferential Tools: Wald Tests

We can construct a Wald statistic for ^β as follows:



t=
SE ( β^ )

Inferential Tools: Confidence Intervals

CI = ^β ± t × SE ( β^ )
The t value is obtained based on the confidence level that we want and the degrees of
freedom in our data.
CI 95: We are 95% certain/confident in the sense that if we repeat this analysis an infinite
number of times, 95% of the CIs that we calculate will surround the true value of β 1.

Interpreting Confidence Intervals


Say we estimate a regression slope of ^β 1= 0.5 with an associated 95% confidence interval of
CI = [0.25; 0.75].
 We cannot say that there is 95% chance that the true value of β 1 is between 0.25
and 0.75. because we must use the term certain/ confident
 We cannot say that the true value of β 1 is between 0.25 and 0.75, with probability
0.95. because we must use the term certain/ confident
The true value of β 1 is fixed; it’s a single quantity.
 Once the interval is calculated, β 1 is either in our interval or it is not; there is no
uncertainty.
 The probability that β 1 is within our estimated interval is either exactly 1 or exactly 0.
We don’t talk about 95% probabilities when interpreting CIs; instead, we talk about 95%
confidence.
 If we collected a new sample – of the same size – re-estimated our model and re-
computed the 95% CI for ^β 1, we would get a different interval.
 Repeating this process an infinite number of times would give us a distribution of CIs.
 95% of those CIs would surround the true value of β 1.

In terms of sampling distributions, our inferential task is to say something about how

CIs give us a plausible range for the population value of 𝛽, so we can use CIs to support
distinct the null and alternative distributions are.

inference.

6
6. Model Fit
We quantify the proportion of the outcome’s variance that is explained by our model using
the R2statistic:
2 TSS−RSS RSS
R= =1−
TSS TSS
where
N
TSS=∑ ( Y n −Y ) =Var (Y ) ×(N −1)
2

n=1

R =0.62 indicates that car price explains 62% of the variability in horsepower. 汽车价格
2

可以解释 62%的马力变化。

Model Fit for Prediction


When assessing predictive performance, we most often use the mean squared error(MSE)
as our criterion. N here is sample size.

( )
N N p 2
1
∑ ( Y n−Y^ n ) = N1 ∑ Y n − ^β 0−∑ ^β p X np = RSS
2
MSE=
N n=1 n=1 p=1 N

The MSE quantifies the average squared prediction error.


Taking the square root improves interpretation.
RMSE=√ MSE
The RMSE estimates the magnitude of the expected prediction error.

 RMSE=3 2.06 : When using price as the only predictor of horsepower, we expect
prediction errors with magnitudes of 32.06 horsepower.

III. Conclusions & Questions


Conclusion
1. Simple linear regression allows us to model the mean of one variable Y, as a function
another variable, X .
o The distribution of Y is a conditional distribution.
2. In regression modelling we find the best fit line Y^ .
o This line does not fully describe the X → Y relationship.
o We also need to consider the errors/residuals that account for the un-
modelled deviations between the best-fit line and the data points.
3. Even after accounting for residuals, our estimated regression line is only an
approximation to the true regression line.
 We can think about our regression model in four different ways.
4. We can use centering to improve the model’s interpretation.
5. After fitting a regression model, we can make inferences about the nature of the X →
Y relationship.
6. We can use confidence intervals for inference.
 When interpreting CIs, we must be careful not to talk about probabilities of
the interval surrounding the parameter. Certain/confident
7. We can assess the fit of the model using R2 and MSE.

7
 R-squared ( R2): In simple linear regression or multiple regression, R2
quantifies the proportion of the total variance in the dependent variable that
is explained by the regression model. It ranges from 0 to 1, with 0 indicating
that the model explains none of the variance and 1 indicating that the model
explains all of the variance. The higher the R-squared, the better the model
fits the data.
 When interpreting MSE, you're assessing the overall goodness-of-fit of a
model. A lower MSE suggests that the model's predictions are closer to the
true values.

Question 1
In the estimated regression model, Y^ = ^β0 + β^ 1 Χ + ε^ , the ε^ term represent a vector of ____
a) errors.
b) residuals.

Question 2
Consider that we regress income (Y) on years of experience (X). Without centring X, the
intercept in the regression is defined as the expected value of Y when X = _0____.
After mean-centring X (with a mean of 5 years of experience), the intercept in the regression
of Y on the mean-centred version of X (X_c) corresponds to the expected value of Y when
the original X = __5____.

8
IV. Multiple Linear Regression
Simple linear regression: A single outcome is predicted by a single independent variable.
Simple linear regression implies a 1D line in 2D space. Adding another predictor leads to a
3D point cloud.

Partial effects
In MLR, we want to examine the partial effects of the predictors.
 What is the effect of a predictor after controlling for some other set of variables?
This approach is crucial to controlling confounds and adequately modelling real-world
phenomena.
 For example, when investigating the effect of income on career satisfaction, we
might want to control for tenure.
 If we conducted experiments and randomization worked out well, we would not
need to control for confounds. But usually, we work with observational data and
not with experimental data.

 SLR asking: What is the effect of age on average blood pressure?

 MLR asking: What is the effect of BMI on average blood pressure, after controlling
for age? (or “above and beyond age” or “keeping age fixed”).
o We’re partialing age out of the effect of BMI on blood pressure.
o Interpretation:

9
 The expected average blood pressure for an unborn patient with zero
weight is 52.25.
 for each year older, average blood pressure is expected to increase by
0.29 points after controlling for BMI.
 There's a 1.08 increase in blood pressure for every unit increase in BMI
after controlling for age.

Centring – for improved interpretation

 The expected average blood pressure for a 30-year-old patient is 88.09.


 For each year older, average blood pressure is expected to increase by 0.35 points.

 The expected average blood pressure for a 30-year-old patient with a BMI of 25 is 87.85.
 For each year older, average blood pressure is expected to increase by 0.29 points, after
controlling for BMI.
 For each additional point of BMI, average blood pressure is expected to increase by 1.08
points, after controlling for age.
 According to the corresponding p-values, both partial predictors are significant, but we
cannot say anything about the two predictors combined.

Explained Variance
Multiple R2
How much variation in blood pressure is explained by the two models?

10
 Check the R2 values.

It means that about 11% and 22% of the variability in blood pressure are accounted for.

Significance Testing for R2


To judge if our model explains a significant proportion of variation in Y, we use an F statistic.
The F statistic represents a ratio of two measures of variability:

Var goo d Var effect Var model


F= = =
Var bad Var noise Var error

Computing the F-Statistic


 For MLR, the good variability is the variability in Y that is explained by our model:
N
SSM =∑ ( Y^ n−Y n )
2
Model sum of squares:
n=1

Degrees of freedom(the number of predictor variables) for SSM: df M =P

SSM
Mean squares for the model: MSM=
df M

 The bad variability is the error variability (i.e. the variability in Y not explained by our
model):
N
Sum of squared errors(i.e. residual sum of squares): SSE=∑ ( Y n−Y^ n)
2

n =1

Degrees of freedom for SSE: df E=N −P−1

SSE
Mean squared errors: MSE=
df E

11
SSM
df M MSM
F= =
SSE MSE
df E

The shape of the F -stat’s sampling distribution is defined by two numbers:


 The Numerator degrees of freedom: df M =P
 The Denominator degrees of freedom: df E=N −P−1

How do we know if these F -statistics are big enough?


 Compare them to their sampling distributions to compute p-values.

4. Model Comparison
 How do we quantify the additional variation explained by BMI, above and beyond
age?
o We compute the ∆ R 2

12
 How do we know if ∆ R 2=0.115 represents a significantly greater degree of
explained variation?
o We use an F -test of the ∆ R 2

 We can also compare models based on their prediction errors.


o For OLS regression, we usually compare MSE values.
o In this case, the MSE for the model with BMI included is smaller. We should
prefer the 2nd model.

5. Deriving the F-Statistic


General Definition of the F-Statistic
MSM
F=
MSEr

Although correct, the above definition is a special case that does not generalize in an
obvious way. In general, the F-statistic is defined in terms of model comparisons:

( E R / EF ) / ( df R −df F )
F=
E F /df F

 E Rand E F are the errors (i.e., SSEs) of the restricted and full models.
 df R and df F are the restricted and full degrees of freedom.

13
The Compared Models
The full model is always the model with more estimated parameters.
 The model with more predictor variables.
The restricted model is the model with fewer estimated parameters.
 The restricted model must be nested within the full model.
o If our full model is the one with age and BMI as predictors, then our
restricted model cannot be one was cholesterol as predicted because
cholesterol was not part of the full model.
When we use the F-statistic to test if R2 > 0 (i.e. when we’re not explicitly comparing two
models):
 The full model is our estimated model.
 The restricted model is an intercept-only model. ( Y^ n=Y )
An intercept-only model is a model that assumes there is no relationship between
the dependent variable and any independent variables, and it only predicts the
mean of the dependent variable. ŷ = β₀ = y bar

Model Building
 We’ll take Y bp=β 0 + β 1 X age 30 +ε as our baseline model.
 Next we simultaneously add predictors of LDL and HDL cholesterol.

 Age, LDL, and HDL explain a combined 14.4% of the variation in blood pressure. That
proportion of variation is significantly greater than zero.

Continuing Model Comparison

14
 Adding LDL and HDL produces a model that explains 3.1% more variation in blood
pressure than a model with age as the only predictor.
 This increase in variance explained is significantly greater than zero.

 Adding LDL and HDL produces a model with lower prediction error (i.e., MSE = 163.4
vs. MSE = 169.4).

Further example:
So far we’ve established that age, LDL, and HDL are all significant predictors of average
blood pressure.
 We’ve also established that LDL and HDL, together, explain a significant amount of
additional variation, above and beyond age.
Next, we’ll add BMI to see what additional predictive role it can play above and beyond age
and cholesterol.

out4 <- lm(bp ̃ age30 + ldl100 + hdl60 + bmi25, data = dDat)

15
1. BMI seems to have a pretty strong effect on average blood pressure, after controlling for
age and cholesterol levels.
 after controlling for BMI, cholesterol levels no longer seem to be important
predictors.
 Let’s take a look at what happens to the cholesterol effects when we add BMI.
2. How much additional variability in blood pressure is explained by BMI above and beyond
age and cholesterol levels?

r2.4 <- summary(out4)$r.squared r2.4 - r2.3

## [1] 0.08595543

3. Is the additional 8.6% variation explained a significant increase? Yes. Using

anova(out3, out4)

4. What about the relative predictive performance?


mse2.4 <- MSE(y_pred = predict(out4), y_true = dDat$bp)
mse2.3
## [1] 163.3983
mse2.4
## [1] 146.9918
Maybe cholesterol levels are not important features once we’ve accounted for BMI. Try:

## Take out the cholesterol variables:

out5 <- lm(bp ̃ age30 + bmi25, data = dDat)

1. How much explained variation did we lose by removing the LDL and HDL variables?

r2.5 <- summary(out5)$r.squared r2.4 - r2.5

## [1] 0.002330906

2. Is it significant? No??

16
3. How do the prediction errors compare?
mse2.5 <- MSE(y_pred = predict(out5), y_true = dDat$bp)
mse2.4
## [1] 146.9918
mse2.5
## [1] 147.4367

QUESTION: Can we compare model 3 and 5 using the change in R2test?


ANSWER: No! The restricted model must be nested within the full model.

Thinking about Partialing: Partial effect of a specific predictor, above and beyond other
predictors

Recall the definition of residual ε^ :


ε^ =Y n− β^ 0 − ^β1 Χ n
 The ε^ are the “leftover” part of Y (i.e., the residue).
 The ε^ contain all the information in Y that is not (linearly) associated with of X.
o The ε^ are (linearly) independent of X.

17
V. Conclusions & Questions
Conclusion
1. Each variable in a regression model corresponds to a dimension in the data-space.
 A regression model with P predictors implies a P- dimensional (hyper)-plane in
(P + 1)-dimensional space.
2. The coefficients in MLR are partial coefficients.
 Each effect is interpreted as controlling for other predictors.

Question 1
Imagine we regress the outcome variable “satisfaction with life” (swl) on the four
predictor variables “age”, “income”, “number of facebook friends” (nr_fb_friend), and
“weekly working hours”(wwh):
lm(swl ̃ age + income + nr_fb_friend + wwh, data = dDat)
Which of the following models would be nested in this “full” model (you can select more
than one answer)
a) lm(swl ̃ age + wwh + nr_pets, data = dDat)
b) lm(swl ̃ age + nr_fb_friend, data = dDat)
c) lm(swl ̃ income + nr_fb_friend + wwh, data = dDat)
d) lm(swl ̃ age + income + nr_fb_friend, data = dDat)

Question 2
Consider the following output from a regression of blood pressure on age.

18
True or false: Based on the output, the effect of age on blood pressure is significant.
a) True
b) False

Question 3
Consider the following output from a regression of blood pressure on age and bmi.

True or false: Based on the output, the effect of age on blood pressure is significant.
a) True
b) False

20230912

19

You might also like