0% found this document useful (0 votes)
27 views15 pages

BES - Lecture 10 - Simple Linear Regression

Download as pdf or txt
Download as pdf or txt
Download as pdf or txt
You are on page 1/ 15

8/25/22

Lecture 10:
Simple Linear Regression
and Correlation

Outline

• Simple linear regression model


• Required conditions for the model
• Model assessment
• Confidence and prediction intervals
• Model diagnostics
• Outliers

Introduction

• In this lecture, we employ regression analysis


to examine the relationship among
quantitative variables.
• The technique is used to predict the value of
one variable (the dependent variable – y)
based on the value of other variables
(independent variables x1, x2, … xk.)

1
8/25/22

Predicting home prices


Recent family home sales in San Antonio
provided the data displayed (partly) in the next
slide (San Antonio Realty Watch website,
November, 2008). We wish to predict the home
prices using the square footage.

Part of the data

What is the dependent variable?


What is the independent variable? 5

Linear relationship?

2
8/25/22

The model
• The first-order linear model or simple linear
regression model

b0 and b1 are unknown,


y = dependent variable therefore they are estimated
y from the data.
x = independent variable
b 0 = y-intercept
b 1 = slope of the line Rise b 1 = rise/run
e = error variable b0 Run

x
7

Least squares method


• Estimates of the coefficients are determined by
– drawing a sample from the population of interest
– calculating sample statistics
– producing a straight line that cuts into the data.

The question is:


y w
Which straight line fits the best?
w
w
w
w w w w w
w
w w w w
w
x 8

Least squares method


The best line is the one that minimises the sum of squared
vertical differences between the points and the line.
Sum of squared differences = (2 - 1)2 + (4 - 2)2 + (1.5 - 3)2 + (3.2 - 4)2 = 6.89
Sum of squared differences = (2 -2.5)2 + (4 - 2.5)2 + (1.5 - 2.5)2 + (3.2 - 2.5)2 = 3.99
Let us compare two lines.
4 (2, 4)
w The second line is horizontal.

3 w (4, 3.2)
2.5
2 The smaller the sum of
(1, 2)w (3, 1.5)
w squared differences,
1 the better the fit of the
line to the data.

1 2 3 4 9

3
8/25/22

Least squares method

The least squares line is:

10

R output for linear model

11

Interpreting estimated parameters

12

4
8/25/22

Error variable: Required conditions


• The error e is a critical part of the regression model.
• Five requirements involving the distribution of e must
be satisfied:
– The mean of e is zero: E(e) = 0.
– The standard deviation of e is a constant (s e) for all
values of x.
– The errors are independent.
– The errors are independent of the independent
variable x.
– The probability distribution of e is normal.
13

Error variable: Required conditions

E(y|x 3)
The standard deviation remains constant ...
b 0 + b 1x 3 µ3
E(y|x 2)
b 0 + b 1x 2 µ2

… but the mean value changes with x E(y|x 1)

b 0 + b 1x 1 µ1

x1 x2 x3

From the first three assumptions we have: y is normally distributed with


mean E(y) = b0 + b1x and a constant standard deviation se 14

Assessing the model


• The least squares method will produce a
regression line whether or not there is a linear
relationship between x and y.
• Consequently, it is important to assess how
well the linear model fits the data.
• Several methods are used to assess the
model:
– using descriptive measurements
– testing and/or estimating the coefficients

15

5
8/25/22

Sum of squares for errors


– This is the sum of the squared vertical differences
between the points and the regression line.
– It can serve as a measure of how well the line fits
the data.

– This statistic plays a role in every statistical


technique we employ to assess the model.

16

Standard error of estimate


– The mean error is equal to zero.
– If se is small, the errors tend to be close to zero (close
to the mean error). Then the model fits the data well.
– Therefore we can use se as a measure of the suitability
of using a linear model.
– An unbiased estimator of se2 is given by se2

17

Home Prices example (cont.)


Read the standard error of estimate from the R
output and describe what it tells you about the
model fit.

Note: We can find the sample mean price to be


$120,270

18

6
8/25/22

Coefficient of determination
• When we want to measure the strength of
the linear relationship, we use the
coefficient of determination.

19

Coefficient of determination

in p a
rt b y the regression model
in e d
e x p la
Overall variability in y
rema
ins, in
part,
unex
plaine the error
d

20

Coefficient of determination
y2

Two data points


(x1, y1) and (x2,
y2) of a certain
y
sample are
shown.
y1

x1 x2

Total variation in y = variation explained by + unexplained variation (error)


the regression line

21

7
8/25/22

Coefficient of determination
• R2 measures the proportion of the variation
in y that is explained by the variation in x.

¡ R2 takes on any value between zero and one.


R 2 = 1: perfect match between the line and the
data points.
R 2 = 0: there is no linear relationship between x
and y. SST = variation in y = SSR + SSE
22

Home prices example (cont.)


Find the coefficient of determination. What does
this statistic tell you about the model?

57% of the variation in the home prices is explained by


the variation in square footage. The rest (43%) remains
unexplained by this model.

23

Testing the slope


When no relationship exists between two variables,
the regression line should be horizontal.

q q
q
q q q
q q q
qq q q q q q
q q qq q qq qq qq q q
q qq
q
q q q q q q q q q
q q
q q qq q q qq qq q q qqq q
q q q qqq qq q qq
q q
q qq qqqq q
qq qqqqqq q qq q qq q q q
qq qq qq qqq qq q

Relationship. No relationship.
Different inputs (x) yield Different inputs (x) yield
different outputs (y). the same output (y).
The slope is not equal to zero. The slope is equal to zero.
24

8
8/25/22

Testing the slope

H 0: b 1 = 0
H A: b 1 ¹ 0 (or < 0, or > 0)
– The test statistic is

where

The standard error of

– If the error variable is normally distributed, the


statistic is Student t-distributed with d.f. = n – 2.
25

Testing the slope using the R output

26

Coefficient of correlation

• The coefficient of correlation is used to measure the


strength of a linear association between two variables.
• The coefficient values range between –1 and 1.
– If r = –1 (perfect negative linear association) or r =
+1 (perfect positive linear association): every point
falls on the regression line.
– If r = 0: there is no linear association.
• The coefficient can be used to test for linear
relationships between two variables.

27

9
8/25/22

Testing the coefficient of correlation


– When there is no linear
relationship between two
variables, r = 0.
– The hypotheses are:
H 0: r = 0 Y
H A: r ¹ 0
X
– The test statistic is:
The statistic is Student t-distributed
with d.f. = n – 2, provided the variables
are bivariate normally distributed.

28

Home prices example (cont.)


Test the coefficient of correlation to determine if a
linear relationship exists. R output is provided below:

29

Using the Regression equation


• Before using the regression model, we need
to assess how well it fits the data.
• If we are satisfied with how well the model fits
the data and the model assumptions are
satisfied, we can use it to make predictions for
y.
Example
– Predict the price of a home with square footage =
1500

30

10
8/25/22

Prediction interval and confidence


interval
§ Two intervals can be used to discover how closely
the predicted value will match the true value of y
• prediction interval – for a particular value of y
• confidence interval – for the expected value of y.

The prediction interval The confidence


interval

The prediction interval is wider than the confidence interval.


31

Home Prices example (cont.)


– Provide an 95% confidence interval estimate for
the price of a home with square footage = 1500
– R output:

32

Home Prices example (cont.)


– Provide an 95% confidence interval estimate for
the mean price of homes with square footage =
1500
– R output:

33

11
8/25/22

The effect of the given value of x


on the intervals
– As xg moves away from `x the interval becomes
longer. That is, the shortest interval is found at `x.

The confidence interval


when x g =

The confidence interval


when x g =

The confidence interval


when x g =

34

Regression diagnostics
• The three important conditions required for
the validity of the regression analysis are:
– The error variable is normally distributed.
– The error variance is constant for all values of x.
– The errors are independent of each other.
• How can we diagnose violations of these
conditions?

35

Regression diagnostics
• Examining the residuals (or standardized
residuals), we can identify violations of the
required conditions.

• For the details à Self-study (read the


textbooks for guidelines). We will give some
examples in the next few slides.

36

12
8/25/22

Example: Heteroscedasticity
When the requirement of a constant variance is
violated, we have heteroscedasticity.
+
^y
++
Residual
+
+ + + ++
+
+ + + ++ + +
+ + + +
+ + + ++ +
+ + + + y^
+ + ++ +
+ + +
+ ++
+ +++
+

The spread increases with ^y

37

Example: Heteroscedasticity
When the requirement of a constant variance is
not violated, we have homoscedasticity.

+
^y
++
Residual
+ +
+ + + ++
+
+ + + +
+ ++ + +
+ +
+ + + ++ ++ +
+ + + y^ ++
+ + + ++ +
+ + + + +
+ +++
+ ++
+
+
The spread of the data points
does not change much.
38

Example: Heteroscedasticity
When the requirement of a constant variance is
not violated, we have homoscedasticity.

+
^y +++ +
++ ++
Residual
+ +++
+ + +++ +
+ +++
+ + + +
+ ++ +
+ + +
+ ++
+ + + + ++
+ + y^ +
+ + +
+ + + ++ +
+ ++
+ ++
As far as the even spread, this is
a much better situation.
39

13
8/25/22

Example: Non-independence of the


error variable
• A time series is constituted if data were
collected over time.
• Examining the residuals over time, no pattern
should be observed if the errors are
independent.
• When a pattern is detected, the errors are said
to be auto-correlated.
• Autocorrelation can be detected by graphing
the residuals against time.
40

Patterns in the appearance of the residuals


over time indicate that autocorrelation exists.

Residual Residual

+
+ + +
+
+ + +
+ + +
0 + 0 + +
+ Time Time
+ + + + + +
+ + + +
++
+

Note the runs of positive Note the oscillating behaviour


residuals, replaced by runs of the residuals around zero.
of negative residuals.
41

Outliers
• An outlier is an observation that is unusually
small or large.
• Several possibilities need to be investigated when
an outlier is observed:
– There was an error in recording the value.
– The point does not belong in the sample.
– The observation is valid.
• Identify outliers from the scatter diagram.
• It is customary to suspect an observation is an
outlier if its |standard residual| > 2.
42

14
8/25/22

an outlier an influential observation

+++++++++++
+ +
+
+ + … but some outliers
+ +
+
may be very influential.
+ + + +
+
+ +
+

The outlier causes a shift


in the regression line ...

43

Procedure for simple linear regression


analysis
– Develop a model that has a theoretical basis.
– Gather data for the two variables in the model.
– Draw the scatter diagram to determine whether a linear
model appears to be appropriate.
– Check the required conditions for the errors.
– Assess the model fit.
– If the model fits the data and the assumptions are
satisfied, use the regression equation.

44

Summary

• Simple linear regression model


• Required conditions for the model
• Model assessment
• Confidence and prediction intervals
• Model diagnostics
• Outliers

45

15

You might also like