0% found this document useful (0 votes)
31 views

Linear Regression

Uploaded by

Imaad
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
31 views

Linear Regression

Uploaded by

Imaad
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 20

Linear regression

η(x) =isβ0a +
• Linear regression β1 x1 approach
simple + β2 x2 + .to. . supervised
β p xp
learning.
Almost always Itthought
assumes of that
as anthe dependence oftoYthe
approximation ontruth.
X
Functions
1 , Xin, . . .
2 nature X is linear.
p are rarely linear.
• True regression functions are never linear!
7
6
f(X)

5
4
3

2 4 6 8

• although it may seem overly simplistic, linear regression is


extremely useful both conceptually and practically.
1 / 48
Linear regression for the advertising data

Consider the advertising data shown on the next slide.


Questions we might ask:
• Is there a relationship between advertising budget and
sales?
• How strong is the relationship between advertising budget
and sales?
• Which media contribute to sales?
• How accurately can we predict future sales?
• Is the relationship linear?
• Is there synergy among the advertising media?

2 / 48
Advertising data
25

25

25
20

20

20
Sales

Sales

Sales
15

15

15
10

10

10
5

5
0 50 100 200 300 0 10 20 30 40 50 0 20 40 60 80 100

TV Radio Newspaper

3 / 48
Simple linear regression using a single predictor X.

• We assume a model

Y = β0 + β1 X + ,

where β0 and β1 are two unknown constants that represent


the intercept and slope, also known as coefficients or
parameters, and  is the error term.
• Given some estimates β̂0 and β̂1 for the model coefficients,
we predict future sales using

ŷ = β̂0 + β̂1 x,

where ŷ indicates a prediction of Y on the basis of X = x.


The hat symbol denotes an estimated value.

4 / 48
Estimation of the parameters by least squares
• Let ŷi = β̂0 + β̂1 xi be the prediction for Y based on the ith
value of X. Then ei = yi − ŷi represents the ith residual
• We define the residual sum of squares (RSS) as
RSS = e21 + e22 + · · · + e2n ,
or equivalently as
RSS = (y1 −β̂0 −β̂1 x1 )2 +(y2 −β̂0 −β̂1 x2 )2 +. . .+(yn −β̂0 −β̂1 xn )2 .

• The least squares approach chooses β̂0 and β̂1 to minimize


the RSS. The minimizing values can be shown to be
Pn
(x − x̄)(yi − ȳ)
β̂1 = i=1Pn i 2
,
i=1 (xi − x̄)
β̂0 = ȳ − β̂1 x̄,
P P
where ȳ ≡ n1 ni=1 yi and x̄ ≡ n1 ni=1 xi are the sample
means.
5 / 48
i=1 (xi − x̄) (3.4)
β̂0 = ȳ − β̂1 x̄,
P P
where ȳ ≡ n1 ni=1 yi and x̄ ≡ n1 ni=1 xi are the sample means. In other
Example: advertising data
words, (3.4) defines the least squares coefficient estimates for simple linear
regression.

25
20
Sales

15
10
5

0 50 100 150 200 250 300

TV

FIGURE 3.1. For the Advertising data, the least squares fit for the regression
The least squares fit for the regression of sales onto TV.
of sales onto TV is shown. The fit is found by minimizing the sum of squared
In this case a linear fit captures the essence of the relationship,
errors. Each grey line segment represents an error, and the fit makes a compro-
mise by averaging their squares. In this case a linear fit captures the essence of
although it is somewhat deficient in the left of the plot.
the relationship, although it is somewhat deficient in the left of the plot.

Figure 3.1 displays the simple linear regression fit to the Advertising
data, where β̂0 = 7.03 and β̂1 = 0.0475. In other words, according to this
6 / 48
Assessing the Accuracy of the Coefficient Estimates

• The standard error of an estimator reflects how it varies


under repeated sampling. We have
 
2 σ2 2 2 1 x̄2
SE(β̂1 ) = Pn 2
, SE(β̂0 ) = σ P
+ n 2
,
i=1 (xi − x̄) n i=1 (xi − x̄)

where σ 2 = Var()
• These standard errors can be used to compute confidence
intervals. A 95% confidence interval is defined as a range of
values such that with 95% probability, the range will
contain the true unknown value of the parameter. It has
the form
β̂1 ± 2 · SE(β̂1 ).

7 / 48
Confidence intervals — continued

That is, there is approximately a 95% chance that the interval


h i
β̂1 − 2 · SE(β̂1 ), β̂1 + 2 · SE(β̂1 )

will contain the true value of β1 (under a scenario where we got


repeated samples like the present sample)
For the advertising data, the 95% confidence interval for β1 is
[0.042, 0.053]

8 / 48
Hypothesis testing
• Standard errors can also be used to perform hypothesis
tests on the coefficients. The most common hypothesis test
involves testing the null hypothesis of
H0 : There is no relationship between X and Y
versus the alternative hypothesis
HA : There is some relationship between X and Y .

• Mathematically, this corresponds to testing

H0 : β1 = 0
versus
HA : β1 6= 0,
since if β1 = 0 then the model reduces to Y = β0 + , and
X is not associated with Y .
9 / 48
Hypothesis testing — continued

• To test the null hypothesis, we compute a t-statistic, given


by
β̂1 − 0
t= ,
SE(β̂1 )
• This will have a t-distribution with n − 2 degrees of
freedom, assuming β1 = 0.
• Using statistical software, it is easy to compute the
probability of observing any value equal to |t| or larger. We
call this probability the p-value.

10 / 48
Results for the advertising data

Coefficient Std. Error t-statistic p-value


Intercept 7.0325 0.4578 15.36 < 0.0001
TV 0.0475 0.0027 17.67 < 0.0001

11 / 48
Assessing the Overall Accuracy of the Model
• We compute the Residual Standard Error
v
r u n
1 u 1 X
RSE = RSS = t (yi − ŷi )2 ,
n−2 n−2
i=1
P
where the residual sum-of-squares is RSS = ni=1 (yi − ŷi )2 .
• R-squared or fraction of variance explained is
TSS − RSS RSS
R2 = =1−
TSS TSS
P
where TSS = ni=1 (yi − ȳ)2 is the total sum of squares.
• It can be shown that in this simple linear regression setting
that R2 = r2 , where r is the correlation between X and Y :
Pn
(xi − x)(yi − y)
r = Pn i=1
p pPn .
− 2 2
i=1 (x i x) i=1 (yi − y)

12 / 48
Advertising data results

Quantity Value
Residual Standard Error 3.26
R2 0.612
F-statistic 312.1

13 / 48
Multiple Linear Regression

• Here our model is

Y = β0 + β1 X1 + β2 X2 + · · · + βp Xp + ,

• We interpret βj as the average effect on Y of a one unit


increase in Xj , holding all other predictors fixed. In the
advertising example, the model becomes

sales = β0 + β1 × TV + β2 × radio + β3 × newspaper + .

14 / 48
Interpreting regression coefficients

• The ideal scenario is when the predictors are uncorrelated


— a balanced design:
- Each coefficient can be estimated and tested separately.
- Interpretations such as “a unit change in Xj is associated
with a βj change in Y , while all the other variables stay
fixed”, are possible.
• Correlations amongst predictors cause problems:
- The variance of all coefficients tends to increase, sometimes
dramatically
- Interpretations become hazardous — when Xj changes,
everything else changes.
• Claims of causality should be avoided for observational
data.

15 / 48
The woes of (interpreting) regression coefficients

“Data Analysis and Regression” Mosteller and Tukey 1977


• a regression coefficient βj estimates the expected change in
Y per unit change in Xj , with all other predictors held
fixed. But predictors usually change together!
• Example: Y total amount of change in your pocket;
X1 = # of coins; X2 = # of pennies, nickels and dimes. By
itself, regression coefficient of Y on X2 will be > 0. But
how about with X1 in model?
• Y = number of tackles by a football player in a season; W
and H are his weight and height. Fitted regression model
is Ŷ = b0 + .50W − .10H. How do we interpret β̂2 < 0?

16 / 48
Two quotes by famous Statisticians

“Essentially, all models are wrong, but some are useful”


George Box

“The only way to find out what will happen when a complex
system is disturbed is to disturb the system, not merely to
observe it passively”
Fred Mosteller and John Tukey, paraphrasing George Box

17 / 48
Estimation and Prediction for Multiple Regression
• Given estimates β̂0 , β̂1 , . . . β̂p , we can make predictions
using the formula

ŷ = β̂0 + β̂1 x1 + β̂2 x2 + · · · + β̂p xp .

• We estimate β0 , β1 , . . . , βp as the values that minimize the


sum of squared residuals
n
X
RSS = (yi − ŷi )2
i=1
n
X
= (yi − β̂0 − β̂1 xi1 − β̂2 xi2 − · · · − β̂p xip )2 .
i=1

This is done using standard statistical software. The values


β̂0 , β̂1 , . . . , β̂p that minimize RSS are the multiple least
squares regression coefficient estimates.
18 / 48
3.2 Multiple Linear Regression 15

X2

X1
19 / 48
Results for advertising data

Coefficient Std. Error t-statistic p-value


Intercept 2.939 0.3119 9.42 < 0.0001
TV 0.046 0.0014 32.81 < 0.0001
radio 0.189 0.0086 21.89 < 0.0001
newspaper -0.001 0.0059 -0.18 0.8599

Correlations:
TV radio newspaper sales
TV 1.0000 0.0548 0.0567 0.7822
radio 1.0000 0.3541 0.5762
newspaper 1.0000 0.2283
sales 1.0000

20 / 48

You might also like