0% found this document useful (0 votes)
5 views20 pages

L11 LinearRegression

The document discusses the Pfizer COVID-19 vaccine trial, detailing the enrollment of 43,548 participants and the calculation of vaccine efficacy using risk ratios and p-values. It also introduces linear regression as a method for modeling relationships between variables, specifically focusing on predicting a daughter's height based on her mother's height. The document outlines the least squares method for fitting a linear model and the probabilistic approach to linear regression using maximum likelihood estimation.

Uploaded by

Ed Z
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views20 pages

L11 LinearRegression

The document discusses the Pfizer COVID-19 vaccine trial, detailing the enrollment of 43,548 participants and the calculation of vaccine efficacy using risk ratios and p-values. It also introduces linear regression as a method for modeling relationships between variables, specifically focusing on predicting a daughter's height based on her mother's height. The document outlines the least squares method for fitting a linear model and the probabilistic approach to linear regression using maximum likelihood estimation.

Uploaded by

Ed Z
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 20

Linear Regression

Foundations of Data Analysis

March 3, 2022
Pfizer COVID-19 Vaccine Trial

Pfizer enrolled 43,548 participants, half received the


vaccine, half received a placebo. 1

Of the 18,508 completing vaccination, 9 got COVID-19


Of the 18,435 completing placebo, 169 got COVID-19

Was the vaccine effective?

1
https://www.nejm.org/doi/full/10.1056/NEJMoa2034577
Pfizer COVID-19 Vaccine Trial

Risk ratio:
risk of COVID-19 with vaccine
RR =
risk of COVID-19 with placebo
9/18508
=
169/18435
≈ 0.053

Vaccine Efficacy = 1 − RR ≈ 0.947


Pfizer COVID-19 Vaccine Trial
Contingency table:

Vaccine Placebo
Positive 9 169
Negative 18,499 18,266

Using hypergeometric probability, p(k), the p-value is:


9
X
P(X ≤ 9) = p(k) < 2 × 10−16
k=0

This is the probability for this result, or better, by random


chance if the vaccine were not effective.
Algebra

Geometry Statistics
Is there a relationship between the heights of mothers
and their daughters?
If you know a mother’s height, can you predict her
daughter’s height with any accuracy?
Linear regression is a tool for answering these types of
questions.
It models the relationship as a straight line.
Regression Setup

When we are given real-valued data in pairs:

(x1 , y1 ), (x2 , y2 ), . . . , (xn , yn ) ∈ R2

Example:
xi is the height of the ith mother
yi is the height of the ith mother’s daughter
Linear Regression
Model the data as a line:

y
(xi , yi )

�i � yi = α + βxi + i
1
α : intercept
β : slope

i : error
x
Geometry: Least Squares
We want to fit a line as close to the data as possible,
which means we want to minimize the errors, i .
y
(xi , yi )

�i � yi = α + βxi + i
1
α : intercept
β : slope

i : error
x
Geometry: Least Squares

Taking the line equation: yi = α + βxi + i

Rearrange to get: i = yi − α − βxi

We want to minimize the sum-of-squared errors (SSE):


n
X n
X
SSE(α, β) = 2i = (yi − α − βxi )2
i=1 i=1
Least Squares: Step 1

Center the data by removing the mean:

ỹi = yi − ȳ
x̃i = xi − x̄
Pn Pn
Note: i=1 ỹi = 0 and i=1 x̃i =0

We’ll first get a solution: ỹ = α + βx̃, then shift it back to


the original (uncentered) data at the end
Least Squares: Step 2
Take derivative of SSE(α, β) wrt α and set to zero:
n
∂ ∂ X
0= SSE(α, β) = (ỹi − α − βx̃i )2
∂α ∂α
i=1
Xn
= −2 (ỹi − α − βx̃i )
i=1
Xn n
X
= −2 ỹi + 2nα + 2β x̃i
i=1 i=1
P P
Using ỹi = x̃i = 0, we get
α̂ = 0
Least Squares: Step 3

With α = 0, we are left with

ỹi = βx̃i + i

Or, in vector notation:


     
ỹ1 x̃1 1
 .  = β x̃.2  + .2 
ỹ2     
 ..   ..   .. 
ỹn x̃n n
Least Squares: Step 3

~
y
     
ỹ1 x̃1 1

 .  = β x̃.2  + .2 
ỹ2     
 ..   ..   ..  ~
x
ỹn x̃n n �~
x

2i = kk2 is projection!


P
Minimizing SSE(α, β) =
hx̃,ỹi
Solution is β̂ = kx̃k2
Shifting Back to Uncentered Data

So far, we have:
ỹi = β̂x̃i + i
Expanding out x̃i and ỹi gives

(yi − ȳ) = β̂(xi − x̄) + i

Rearranging gives

yi = (ȳ − β̂x̄) + β̂xi + i

So, for the uncentered data, α̂ = ȳ − β̂x̄


Probability: Maximum Likelihood
So far, we have only used geometry, but if our data is
random, shouldn’t we be talking about probability?

To make linear regression probabilistic, we model the


errors as Gaussian:

i ∼ N(0, σ 2 )

The likelihood is
n
2i
 
Y 1
L(α, β) = √ exp − 2
2πσ 2σ
i=1
Probability: Maximum Likelihood

The log-likelihood is then


n
1 X 2
log L(α, β) = − 2 i + const.

i=1

Maximizing this is equaivalent to minimizing SSE!


X
max log L = min 2i = min SSE

You might also like