Lecture8 4
Lecture8 4
REGRESSION (PART I)
Weihua Zhou
Department of Mathematics and Statistics
Introduction
So far we have done statistics
on one variable at a time. We
now interested in relationships
between two variables and how to
use one variable to predict another
variable.
Does weight depend on height?
Does blood pressure level predict life expectancy?
Do SAT scores predict college performance?
Does taking Statistics course make you a better
person?
Dependent and Independent Variables
Most statistical studies examine data on more than one
variable. In many of these settings, the two variables
play different roles.
Definition:
A dependent (response) variable measures an outcome
of a study. An independent (predictor) variable may
help explain or influence changes in a response variable.
Outlier
There is one possible outlier, the hiker with
the body weight of 187 pounds seems to be
carrying relatively less weight than are the
other group members.
Curvilinear No relationship
The Correlation Coefficient
The strength and direction of the relationship between x and y
are measured using the correlation coefficient (Pearson
product moment coefficient of correlation), r.
𝒔𝒔𝒙𝒚
𝒓= where
𝒔𝒔𝒙𝒙 ∙ 𝒔𝒔𝒚𝒚
2
1
𝑠𝑠𝑥𝑥 = 𝑥 2 − ( 𝑥)
𝑛
2
1
𝑠𝑠𝑦𝑦 = 𝑦 2 − ( 𝑦)
𝑛
1
𝑠𝑠𝑥𝑦 = 𝑥𝑦 − ( 𝑥) ( 𝑦)
𝑛
Example
210
200
190
Weight
180
r = .8261
170
160
Strong positive
150
correlation
66 67 68 69 70 71 72 73 74 75
Height
As the player’s
height increases, so
does his weight.
Interpreting r
• -1 r 1 Sign of r indicates direction of
the linear relationship.
Suppose a experiment is conducted to study the relationship between the percentage of a certain drug in the
bloodstream (𝑥) and the length of the time it takes to react to a stimulus (𝑦). The results are below.
𝑥 2 = 55, 𝑦 2 = 26
Find the correlation coefficient and explain in the context of the problem.
Probabilistic Model
• Probabilistic model:
y = deterministic model + random error
• Random error represents random fluctuation from the
deterministic model
• The probabilistic model is assumed for the population
• Simple linear regression model:
y = α + βx + ε
• Without the random deviation ε, all observed points (x, y)
points would fall exactly on the deterministic line. The
inclusion of ε in the model equation allows points to deviate
from the line by random amounts.
Basic Assumptions of the Simple
Linear Regression Model
1. The distribution of ε at any particular x value has mean value 0.
2. The standard deviation of ε is the same for any particular value
of x. This standard deviation is denoted by 𝜎.
3. The distribution of ε at any particular x value is normal.
4. The random errors are independent of one another.
The Distribution of y
The figure below shows the regression model when the conditions are met. The
line in the figure is the population regression line µy= α + βx.
In most cases, no line will pass exactly through all the points in a scatterplot. A
good regression line makes the vertical distances of the points from the line
as small as possible.
Definition:
A residual is the difference between an observed value of the response
variable and the value predicted by the regression line. That is,
residual = observed y – predicted y
residual = y - ŷ Positive residuals
(above line)
residual
Negative residuals
(below line)
Least-Squares Regression Line
Different regression lines produce different residuals. The regression
line we want is the one that minimizes the sum of the squared
residuals.
Definition:
The least-squares regression line of y on x is the line that makes the sum of
the squared residuals as small as possible.
Least Squares Regression
The sum of squared errors in regression is:
n n
SSE = e i (y i - y$ i ) 2
2
SSE: sum of squared errors
i=1 i=1
The least squares regression line is that which minimizes the SSE
with respect to the estimates a and b.
a
SSE
Parabola function
Least squares a
Least squares b b
Sums of Squares, Cross Products, and
Least Squares Estimators
Sums of Squares and Cross Products:
SSxx (x x ) x
- 2
2
-
( x)
2
n 2
( y)
SSyy ( y y ) y
- 2
2
-
n
( x)( y )
SSxy (x x )( y y ) xy
- - -
n
ŷ a bx
Suppose a experiment is conducted to study the relationship between the percentage of a certain drug in the
bloodstream (𝑥) and the length of the time it takes to react to a stimulus (𝑦). The results are below.
𝑥 2 = 55, 𝑦 2 = 26
Definition:
Don’t make predictions using values of x that are much larger or much
smaller than those that actually appear in your data.
𝟐
The estimation of 𝝈
Because 𝜎 2 is a population parameter, we will rarely know its true value. The
best we can do is to estimate it!
The mean square error estimates 𝜎 2
σ𝑛𝑖=1(𝑦𝑖 −𝑦ො𝑖 )2 𝑆𝑆𝐸 𝑆𝑆𝐸
MSE = = =
𝑛−2 𝑑. 𝑓. 𝑛 − 2
Where
𝑛
n -1 n-2
X X
SST S yy ( y - y ) 2
( y - y ) ( y - y$) ( y$ - y )
Y Total = Unexplained Explained
Deviation Deviation Deviation
Y . (Error) (Regression)
Y$
Y
Unexplained Deviation
Explained Deviation
{
}
{
Total Deviation
SST
2
= SSE
2
( y - y ) ( y - y$) ( y$ - y )
+ SSR
Percentage of
2
SSR SSE
Rr22= 1- total variation
SST SST explained by the
X regression.
X
𝟐
𝑹 : Coefficient of Determination
The coefficient of determination, R2 is a descriptive measure of the
strength of the regression relationship, a measure how well the regression
line fits the data.
The coefficient of determination is defined as
SSR SSE
R
2
1-
SST SST
Suppose a experiment is conducted to study the relationship between the percentage of a certain drug in the
bloodstream (𝑥) and the length of the time it takes to react to a stimulus (𝑦). The results are below.
𝑥 2 = 55, 𝑦 2 = 26