0% found this document useful (0 votes)
55 views29 pages

Lecture8 4

This document provides an introduction to simple linear regression. It defines key concepts such as the dependent and independent variables, scatterplots for examining relationships between variables, the correlation coefficient for measuring the strength and direction of relationships, and the assumptions and probabilistic model underlying simple linear regression. It also introduces residuals and the least squares regression line which minimizes the sum of squared residuals.

Uploaded by

victorleehc
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
55 views29 pages

Lecture8 4

This document provides an introduction to simple linear regression. It defines key concepts such as the dependent and independent variables, scatterplots for examining relationships between variables, the correlation coefficient for measuring the strength and direction of relationships, and the assumptions and probabilistic model underlying simple linear regression. It also introduces residuals and the least squares regression line which minimizes the sum of squared residuals.

Uploaded by

victorleehc
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 29

LECTURE 8: SIMPLE LINEAR

REGRESSION (PART I)
Weihua Zhou
Department of Mathematics and Statistics
Introduction
So far we have done statistics
on one variable at a time. We
now interested in relationships
between two variables and how to
use one variable to predict another
variable.
 Does weight depend on height?
 Does blood pressure level predict life expectancy?
 Do SAT scores predict college performance?
 Does taking Statistics course make you a better
person?
 Dependent and Independent Variables
Most statistical studies examine data on more than one
variable. In many of these settings, the two variables
play different roles.

Definition:
A dependent (response) variable measures an outcome
of a study. An independent (predictor) variable may
help explain or influence changes in a response variable.

Note: In many studies, the goal is to show that changes in


one or more explanatory variables actually cause changes
in a response variable. However, other predictor-response
relationships don’t involve direct causation.
 Displaying Relationships: Scatterplots
Make a scatterplot of the relationship between body weight and pack
weight.
Since Body weight is our predictor variable, be sure to place it on the X-
axis!
Body weight (lb) 120 187 109 103 131 165 158 116

Backpack weight (lb) 26 30 26 24 29 35 31 28


 Interpreting Scatterplots
To interpret a scatterplot, follow the basic strategy of data
analysis from Chapters 1 and 2. Look for patterns and
important departures from those patterns.

How to Examine a Scatterplot


As in any graph of data, look for the overall pattern and
for striking departures from that pattern.
• You can describe the overall pattern of a scatterplot by
the direction, form, and strength of the relationship.
• An important kind of departure is an outlier, an
individual value that falls outside the overall pattern of
the relationship.
 Interpreting Scatterplots

Outlier
 There is one possible outlier, the hiker with
the body weight of 187 pounds seems to be
carrying relatively less weight than are the
other group members.

Strength Direction Form


 There is a moderately strong, positive, linear relationship
between body weight and pack weight.
 It appears that lighter students are carrying lighter backpacks.
Examples

Positive linear - strong Negative linear -weak

Curvilinear No relationship
The Correlation Coefficient
The strength and direction of the relationship between x and y
are measured using the correlation coefficient (Pearson
product moment coefficient of correlation), r.

𝒔𝒔𝒙𝒚
𝒓= where
𝒔𝒔𝒙𝒙 ∙ 𝒔𝒔𝒚𝒚
2
1
𝑠𝑠𝑥𝑥 = ෍ 𝑥 2 − (෍ 𝑥)
𝑛
2
1
𝑠𝑠𝑦𝑦 = ෍ 𝑦 2 − (෍ 𝑦)
𝑛
1
𝑠𝑠𝑥𝑦 = ෍ 𝑥𝑦 − (෍ 𝑥) (෍ 𝑦)
𝑛
Example

The table shows the heights and weights of


n = 10 randomly selected college football players.
Player 1 2 3 4 5 6 7 8 9 10
Height, x 73 71 75 72 72 75 67 69 71 69
Weight, y 185 175 200 210 190 195 150 170 180 175

Use your calculator S xy  328 S xx  60.4 S yy  2610


to find the sums
328
and sums of r  .8261
squares. (60.4)(2610)
Football Players

Scatterplot of Weight vs Height

210

200

190
Weight

180
r = .8261
170

160
Strong positive
150
correlation
66 67 68 69 70 71 72 73 74 75
Height
As the player’s
height increases, so
does his weight.
Interpreting r
• -1  r  1 Sign of r indicates direction of
the linear relationship.

•r0 No relationship; random scatter of


points

• r  1 or –1 Strong relationship; either


positive or negative

All points fall exactly on a


• r = 1 or –1 straight line.
 Correlation Coefficient
Example

Suppose a experiment is conducted to study the relationship between the percentage of a certain drug in the
bloodstream (𝑥) and the length of the time it takes to react to a stimulus (𝑦). The results are below.

Amount of drug (𝒙) 2 1 4 3 5


Reaction time 𝑦 1 1 2 2 4

෍ 𝑥 = 15, ෍ 𝑦 = 10, ෍ 𝑥𝑦 = 37,

෍ 𝑥 2 = 55, ෍ 𝑦 2 = 26

Find the correlation coefficient and explain in the context of the problem.
Probabilistic Model
• Probabilistic model:
y = deterministic model + random error
• Random error represents random fluctuation from the
deterministic model
• The probabilistic model is assumed for the population
• Simple linear regression model:
y = α + βx + ε
• Without the random deviation ε, all observed points (x, y)
points would fall exactly on the deterministic line. The
inclusion of ε in the model equation allows points to deviate
from the line by random amounts.
Basic Assumptions of the Simple
Linear Regression Model
1. The distribution of ε at any particular x value has mean value 0.
2. The standard deviation of ε is the same for any particular value
of x. This standard deviation is denoted by 𝜎.
3. The distribution of ε at any particular x value is normal.
4. The random errors are independent of one another.
 The Distribution of y
The figure below shows the regression model when the conditions are met. The
line in the figure is the population regression line µy= α + βx.

The Normal curves show how y


will vary when x is held fixed at
different values. All the curves
For each possible value of have the same standard
the explanatory variable x, deviation σ, so the variability of
the mean of the responses y is the same for all values of x.
µ(y | x) moves along this
line.

The value of σ determines


whether the points fall close to
the population regression line
(small σ) or are widely
scattered (large σ).
Data
1. So far we have described the population probabilistic model.
2. Usually three population parameters, α, β and σ, are unknown.
We need to estimate them from data.
3. Data: n pairs of observations of independent and dependent
variables
(x1,y1), (x2,y2), …, (xn,yn)
4. Probabilistic model
yi = a + b xi + ei, i=1,…,n
ei are independent normal with mean 0 and standard deviation s.
 Residuals

In most cases, no line will pass exactly through all the points in a scatterplot. A
good regression line makes the vertical distances of the points from the line
as small as possible.

Definition:
A residual is the difference between an observed value of the response
variable and the value predicted by the regression line. That is,
residual = observed y – predicted y
residual = y - ŷ Positive residuals
(above line)

residual
Negative residuals
(below line)
 Least-Squares Regression Line
Different regression lines produce different residuals. The regression
line we want is the one that minimizes the sum of the squared
residuals.

Definition:
The least-squares regression line of y on x is the line that makes the sum of
the squared residuals as small as possible.
Least Squares Regression
The sum of squared errors in regression is:
n n
SSE =  e i   (y i - y$ i ) 2
2
SSE: sum of squared errors
i=1 i=1

The least squares regression line is that which minimizes the SSE
with respect to the estimates a and b.
a
SSE

Parabola function

Least squares a

Least squares b b
Sums of Squares, Cross Products, and
Least Squares Estimators
Sums of Squares and Cross Products:
SSxx  (x x )  x
 - 2
 2
-
( x)
2

n 2
( y)
SSyy  ( y y )  y
 - 2
 2
-
n
(  x)( y )
SSxy  (x x )( y y )  xy
 - -  -
n
ŷ  a  bx

Least - squares regression estimators:


SS
b  xy
SSxx
a  y -bx ŷ  a  bx
Example

Suppose a experiment is conducted to study the relationship between the percentage of a certain drug in the
bloodstream (𝑥) and the length of the time it takes to react to a stimulus (𝑦). The results are below.

Amount of drug (𝒙) 2 1 4 3 5


Reaction time 𝑦 1 1 2 2 4

෍ 𝑥 = 15, ෍ 𝑦 = 10, ෍ 𝑥𝑦 = 37,

෍ 𝑥 2 = 55, ෍ 𝑦 2 = 26

(1) Find the least squares regression line.


(2) Interpret the meaning of the slope in the context of the problem.
(3) Predict the reaction time when the amount of drug is 3.5
(4) Predict the reaction time when the amount of drug is 0
 Extrapolation
We can use a regression line to predict the response ŷ for a specific
value of the explanatory variable x. The accuracy of the prediction
depends on how much the data scatter about the line.
While we can substitute any value of x into the equation of the
regression line, we must exercise caution in making predictions
outside the observed values of x.

Definition:

Extrapolation is the use of a regression line for prediction far outside


the interval of values of the explanatory variable x used to obtain the
line. Such predictions are often not accurate.

Don’t make predictions using values of x that are much larger or much
smaller than those that actually appear in your data.
𝟐
The estimation of 𝝈
 Because 𝜎 2 is a population parameter, we will rarely know its true value. The
best we can do is to estimate it!
 The mean square error estimates 𝜎 2
σ𝑛𝑖=1(𝑦𝑖 −𝑦ො𝑖 )2 𝑆𝑆𝐸 𝑆𝑆𝐸
MSE = = =
𝑛−2 𝑑. 𝑓. 𝑛 − 2
Where
𝑛

𝑆𝑆𝐸 = ෍(𝑦𝑖 −𝑦ො𝑖 )2 = 𝑆𝑆𝑦𝑦 − 𝛽መ1 𝑆𝑆𝑥𝑥


𝑖=1
Total Variance and Error Variance
Y
 ( y - y ) 2 Y
 ( y - ˆ
y ) 2

n -1 n-2

X X

What you see when looking


What you see when looking
at the total variation of Y.
along the regression line at
the error variance of Y.
The Analysis of Variance
The total variation in the experiment is measured by the total sum
of squares:

SST  S yy  ( y - y ) 2

The SST is divided into two parts:


SSR (sum of squares for regression): measures the variation
explained by including the independent variable x in the model.
SSE (sum of squares for error): measures the leftover
variation not explained by x.
The Analysis of Variance

( y - y )  ( y - y$)  ( y$ - y )
Y Total = Unexplained Explained
Deviation Deviation Deviation
Y . (Error) (Regression)

Y$

Y
Unexplained Deviation

Explained Deviation
{
}
{
Total Deviation

SST
2
= SSE
2
 ( y - y )   ( y - y$)   ( y$ - y )
+ SSR

Percentage of
2

SSR SSE
Rr22=   1- total variation
SST SST explained by the
X regression.
X
𝟐
𝑹 : Coefficient of Determination
 The coefficient of determination, R2 is a descriptive measure of the
strength of the regression relationship, a measure how well the regression
line fits the data.
The coefficient of determination is defined as
SSR SSE
R 
2
 1-
SST SST

 R2 is the percentage of total variation explained by the regression.


 It is is a number between zero and one and a value close to zero suggests a
poor model.
 A very high value of R2 can arise even though the relationship between the
two variables is non-linear. The fit of a model should never simply be
judged from the R2 value alone.
 It can be proved that
𝑅2 = 𝑟 2
Example

Suppose a experiment is conducted to study the relationship between the percentage of a certain drug in the
bloodstream (𝑥) and the length of the time it takes to react to a stimulus (𝑦). The results are below.

Amount of drug (𝒙) 2 1 4 3 5


Reaction time 𝑦 1 1 2 2 4

෍ 𝑥 = 15, ෍ 𝑦 = 10, ෍ 𝑥𝑦 = 37,

෍ 𝑥 2 = 55, ෍ 𝑦 2 = 26

(1) Find the MSE.


(2) Find the coefficient of determination.

You might also like