0% found this document useful (0 votes)
39 views43 pages

ch12 0

The document discusses Simple Linear Regression, presenting its mathematical model and the relationship between independent and dependent variables. It explains the estimation of model parameters using the least squares method, the interpretation of regression coefficients, and the significance of residuals and outliers in regression analysis. Additionally, it highlights the importance of assessing model fit and the implications of influential points on regression outcomes.

Uploaded by

rk43.koundal
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
39 views43 pages

ch12 0

The document discusses Simple Linear Regression, presenting its mathematical model and the relationship between independent and dependent variables. It explains the estimation of model parameters using the least squares method, the interpretation of regression coefficients, and the significance of residuals and outliers in regression analysis. Additionally, it highlights the importance of assessing model fit and the implications of influential points on regression outcomes.

Uploaded by

rk43.koundal
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 43

Simple Linear

1 Regression

2
Material from Devore’s book (Ed 8), and Cengagebrain.com
Simple Linear Regression

8
0
6
0
Ratin
g
4
0
2
0

0 5 1 1
0 5
Suga
r 2
Simple Linear Regression

8
0
6
0
Ratin
g
4
0
2
0

0 5 1 1
0 5
Suga
r 3
Simple Linear Regression

8
0
6
0
Ratin
g
4
0

x
x
2
0

0 5 1 1
0 5
Suga
r 4
The Simple Linear Regression
Model
The simplest deterministic mathematical relationship
between two variables x and y is a linear relationship:
y = 0 + 1x.
The objective of this section is to develop an
equivalent linear probabilistic model.

If the two (random) variables are probabilistically related,


then for a fixed value of x, there is uncertainty in the
value of the second variable.

So we assume Y = 0 + 1x + ε, where ε is a


random variable.
2 variables are related linearly “on average” if for
fixed x the actual value of Y differs from its expected
value by a random amount (i.e. there is random
5
error).
A Linear Probabilistic Model
Definition The Simple Linear Regression Model
There are parameters 0, 1, and  2, such that for
any fixed value of the independent variable x, the
dependent variable is a random variable related to x
through the model equation
Y = 0 + 1x + ε

The quantity ε in the model equation is the “error”


-­-­a random variable, assumed to be
symmetrically distributed with
E(ε) = 0 and V(ε) =ε  2
=2
(no assumption made about the distribution
of ε, yet)
6
A Linear Probabilistic Model
X: the independent, predictor, or explanatory variable
(usually known). NOT RANDOM.

Y: The dependent or response variable. For fixed x, Y will


be random variable.

ε: The random deviation or random error term. For fixed x, ε


will be random variable.

What exactly does ε do?

7
A Linear Probabilistic Model
The points (x1, y1), …, (xn, yn) resulting from n
independent observations will then be scattered
about the true regression line:
This image cannot currently be
displayed.

8
A Linear Probabilistic Model
How do we know simple linear
regression is appropriate?

-­ Theoretical considerations
-­ Scatterplots

9
A Linear Probabilistic Model
If we think of an entire population of (x, y)
pairs, then
Y2Y| x| xisisthe
a mean of all y of
measure values
how for which
much ,
x = xvalues
these of

and about
out the mean value.
y spread

If, for example, x = age of a child and y =


vocabulary size, then Y | 5 is the average
children in the
vocabulary sizepopulation, and
for all 5-­year-­
o 2
Y |ld describes
thevariability
of amount in vocabulary size 5 for this

part of the population.

10
A Linear Probabilistic Model
Interpreting parameters:

0 (the intercept of the true


regression line): The average value of
Y when x is zero.

1 (the slope of the true regression


line):
The expected (average) change in Y associated with
a 1-­unit increase in the value of x.

11
A Linear Probabilistic Model
What is  2Y | x? How do we interpret  2Y | x?

Homoscedasticity:
We assume the variance (amount of variability) of the
distribution of Y values to be the same at each different
value of fixed x. (i.e.
homogeneity of variance assumption).

12
When errors are normally
distributed…
distribution of 

(b) distribution of Y for different


values of x

The variance parameter  2 determines the extent to


which each normal curve spreads out about the
1
regression line 3
A Linear Probabilistic Model
When  2 is small, an observed point (x, y) will
almost always fall quite close to the true
regression line, whereas observations may deviate
considerably from their expected values
(corresponding to points far from the line) when  2
is large.

Thus, this variance can be used to tell us how


good the linear fit is

But how do we define “good”?

14
Estimating Model Parameters
The values of 0, 1, and  2 will almost never be
known to an investigator.

Instead, sample data consists of n observed pairs


(x1, y1), … , (xn, yn),

from which the model parameters and the true


regression line itself can be estimated.

The data (pairs) are assumed to have been


obtained independently of one another.

15
Estimating Model Parameters
Where
Yi = 0 + 1xi + εi for i = 1, 2, … , n

and the n deviations ε1, ε2,…, εn are independent


r.v.’s. (Y1, Y2, …, Yn are independent too, why?)

16
Estimating Model Parameters
The “best fit” line is motivated by the principle
of least squares, which can be traced back to
the German mathematician Gauss (1777–
1855):

A line provides the best


fit to the data if the sum
of the squared vertical
distances (deviations)
from the observed points
to that line is as small
as it can be.

17
Estimating Model Parameters
The sum of squared vertical deviations from
the points (x1, y1),…, (xn, yn) to the line is then

The point estimates of 0 and 1, denoted by ,


and are
called the least squares estimates – they are
those values that minimize f(b0, b1).

18
Estimating Model Parameters
The fitted regression line or least squares line is
then the line whose equation is y = + x.

The minimizing values of b0 and b1 are found by


taking partial derivatives of f(b0, b1) with respect to
both b0 and b1, equating them both to zero
[analogously to f ʹ(b) = 0 in univariate calculus],
and solving the equations

19
Estimating Model Parameters
The least squares estimate of the slope coefficient
1 of the true regression line is

Shortcut formulas for the numerator and


denominator of are

Sxy = xiyi – (xi)(yi)/n and Sxx = xi2 –


(xi)2/n

(Typically columns for xi, yi, xiyi and xi2 and constructed
and then 20
S and S are calculated.)
Estimating Model Parameters
The least squares estimate of the intercept 0 of
the true regression line is

The computational formulas for Sxy and Sxx require


only the summary statistics xi, yi, xi2 and xiyi.

(yi2 will be needed shortly for the variance.)

21
Example (fitted regression line)

The cetane number is a critical property in


specifying the ignition quality of a fuel used in a
diesel engine.

Determination of this number for a


biodiesel fuel is expensive and time-­
consuming.

The article “Relating the Cetane Number of


Biodiesel Fuels to Their Fatty Acid Composition:
A Critical Study” (J. of Automobile Engr., 2009:
565–583) included the following data on x =
iodine value (g) and y = cetane number for a
sample of 14 biofuels (see next slide).
22
Example (fitted
cont’
regression line) d
The iodine value (x) is the amount of iodine necessary to
saturate a sample of 100 g of oil. The article’s authors fit
the simple linear regression model to this data, so let’s do the
same.

Calculating the relevant statistics


gives
xi = 1307.5, yi = 779.2,
xi =
128,913.93,
2 xi yi =
71,347.30,
from whichSxx = 128,913.93 – (1307.5)2/14 = 6802.7693

and Sxy = 71,347.30 – (1307.5)(779.2)/14 = –


1424.41429
23
Example (fitted regression line)
cont’
d
Scatter plot with the least squares line
superimposed.

24
Fitted Values
Fitted Values:
The fitted (or predicted) are
by substituting x1,…, xn into the equation
values of the
obtained
estimated regression line:

Residuals:
The differences between
observed and fitted y the
values.
Residuals are estimates of the true error –
WHY?
25
Sum of the residuals
When the estimated regression line is obtained
via the principle of least squares, the sum of the
residuals should in theory be zero, if the error
distribution is symmetric, since
0

26
Example (fitted values)

Suppose we have the following data on filtration


rate (x) versus moisture content (y):

Relevant summary quantities (summary


statistics) are
xi = 2817.9, yi = 1574.8, x2i =
415,949.85,
xi yi = y2i =
222,657.88, and 124,039.58,
From Sxx = 18,921.8295, Sxy =
776.434.
27
Calculation of residuals?
Example (fitted values)
cont’
d
All predicted values (fits) and residuals
appear in the accompanying table.

28
Fitted Values
We interpret the fitted value as the value of y that we
would predict or expect when using the estimated
regression line with x = xi;; thus is the estimated true
mean for that population when x = xi (based on the
data).

The residual is a positive number if the point lies


above the line and a negative number if it lies below
the line.(x i , yˆ i )

The residual can be thought of as a measure of


ϵi ≈ βˆ0 + βˆ1 x iand
deviation = Yˆcan
+ ϵˆi we i + ϵˆ
summarize
i the notation in the
following way: ⇒

Y i — Yˆ i = ϵˆ i 29
Residual Plots
Revenue = 2.7 * Temperature – 35

Residual = Observed – Predicted

Residual
Temperature Revenue Revenue
(Observed –
(Celsius) (Observed) (Predicted)
Predicted)

28.2 $44 $41 $3

21.4 $23 $23 $0

32.9 $43 $54 -$11

24.0 $30 $29 $1

etc. etc. etc. etc.


Residual plots (contd.)
Same regression run on two different lemonade stands, one
where the model is very accurate, one where the model is not.
Residual Plots (contd.)
Residual Plots (contd.)
Ideally residual plots
looks like these, i.e.
1. They’re pretty
symmetrically
distributed, tending
to cluster towards
the middle of the
plot.
2. They’re clustered
around the lower
single digits of the y-
axis (e.g., 0.5 or 1.5,
not 30 or 150).
3. In general, there
aren’t any clear
Residual Plots (contd.)

Some not so ideal residual plots


Example Residual Plots and
Their Diagnoses: Y Axis
Imbalanced
Some unexceptionally high value of Y for normal values of X
Example Residual Plots and
Their Diagnoses:
Heteroscedasticity
meaning that the residuals get larger as the prediction
moves from small to large (or from large to small)
Example Residual Plots and
Their Diagnoses: Nonlinear
meaning your model doesn’t accurately represent the
relationship between “Temperature” and “Revenue.”
Example Residual Plots and
Their Diagnoses: Outliers
• data entry error, where the outlier is just wrong, delete it
• If a legitimate outlier, assess the impact of the outlier
Outliers
Data points that diverge in a big way from the overall
pattern are called outliers. There are four ways that a data
point might be considered an outlier.
• It could have an extreme X value compared to other
data points.
• It could have an extreme Y value compared to other
data points.
• It could have extreme X and Y values.
• It might be distant from the rest of the data, even
without extreme X or Y values.
Outliers (contd.)
Each type of
outlier is
depicted
graphically
in the
scatterplots
below.
Influential Points
An influential point is an outlier that greatly affects the
slope of the regression line. One way to test the influence
of an outlier is to compute the regression equation with
and without the outlier.
Influential Points (contd.)
This type of analysis is illustrated below. The scatterplots
are identical, except that one plot includes an outlier.
When the outlier is present, the slope is flatter (-4.10 vs. -
3.32); so this outlier would be considered an influential
point.
Influential Points (contd.)
Here, one chart has a single outlier, located at the high
end of the X axis (where x = 24). As a result of that single
outlier, the slope of the regression line changes greatly,
from -2.5 to -1.6; so the outlier would be considered an
influential point.

You might also like