0% found this document useful (0 votes)
5 views34 pages

Lecture_8 Regression and Correlation - Copy

The document covers the concepts of regression and correlation within the context of probability and statistics, focusing on the relationship between dependent and independent variables. It explains linear regression, least square estimation, and the calculation of regression coefficients, along with examples and assumptions related to these methods. Additionally, it discusses the coefficient of determination (R²) as a measure of explained variation in the dependent variable by the independent variable.

Uploaded by

stam10miston
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views34 pages

Lecture_8 Regression and Correlation - Copy

The document covers the concepts of regression and correlation within the context of probability and statistics, focusing on the relationship between dependent and independent variables. It explains linear regression, least square estimation, and the calculation of regression coefficients, along with examples and assumptions related to these methods. Additionally, it discusses the coefficient of determination (R²) as a measure of explained variation in the dependent variable by the independent variable.

Uploaded by

stam10miston
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 34

IS141: Probability and Statistics

Regression and Correlation

Dr.Emmanuel(Udsm) IS141: Probability and Statistics Regression and Correlation 1 / 34


Outline

1 Regression
Introduction
Regression Line
Least square Estimation

2 Correlation
Correlation Coefficient

Dr.Emmanuel(Udsm) IS141: Probability and Statistics Regression and Correlation 2 / 34


Regression Introduction

Introduction

Many engineering and scientific problems are concerned with determining


a relationship between a set of variables.

For instance, in a chemical process, we might be interested in the


relationship between the output of the process, the temperature at which
it occurs, and the amount of catalyst employed.

Knowledge of such a relationship would enable us to predict the output for


various values of temperature and amount of catalyst.

In many situations, there is a single response variable Y (the dependent


variable), which depends on the value of a set of input (also called
independent variables) x1 , x2 , . . ., xr .

The simplest type of relationship between the dependent variable Y and


the input variables x1 , x2 , . . ., xr is a linear relationship.
Dr.Emmanuel(Udsm) IS141: Probability and Statistics Regression and Correlation 3 / 34
Regression Regression Line

That is, for some constants β0 , β1 , β2 , . . ., βr the equation

Y = β0 + β1 x1 + β2 x2 + . . . , βr xr (1)

would hold.

If this was the relationship between Y and the xi , i = 1, 2 . . ., r , then it


would be possible (once the βi were learned) to exactly predict the
response for any set of input values.

However, in practice such precision is almost never attainable, and the


most that one can expect is that Equation (1) would be valid subject to
random error.

By this we mean that the explicit relationship is

Y = β0 + β1 x1 + β2 x2 + . . . , βr xr + e (2)

where e represent the random error, that is assumed to be a random


variable having mean 0.
Dr.Emmanuel(Udsm) IS141: Probability and Statistics Regression and Correlation 4 / 34
Regression Regression Line

Another way of expressing Equation(2) is

E [Y |x] = β0 + β1 x1 + β2 x2 + . . . , βr xr (3)

where x = (x1 , · · · , xr ) is the set of independent variables, and E [Y |x] is


the expected response given the inputs x.

Equation (2) is called a linear regression equation. It describes the


regression of Y on the set of independent variables x1 , x2 , · · · , xr .

The quantities β0 , β1 , β2 , · · · , βr are called the regression coefficients,


and must usually be estimated from a set of data.

A regression equation containing a single independent variable, that is,


r = 1 is called a simple regression equation, whereas the one containing
many independent variables is called a multiple regression equation.

Dr.Emmanuel(Udsm) IS141: Probability and Statistics Regression and Correlation 5 / 34


Regression Regression Line

Thus, a simple linear regression model gives a linear relationship between


the mean response and the value of a single independent variable.

It can be expressed as
Y = β0 + β1 x + e (4)
where x is the value of the independent variable, also called the input
level, Y is the response, and e represents the random error, i.e. a random
variable having mean 0.

Example
Consider the following 10 data pairs (xi , yi ), i = 1, . . . , 10, relating y , the
percent yield of a laboratory experiment, to x, the temperature at which
the experiment was run.
i 1 2 3 4 5 6 7 8 9 10
xi 100 110 120 130 140 150 160 170 180 190
yi 45 52 54 63 62 68 75 76 92 88

Dr.Emmanuel(Udsm) IS141: Probability and Statistics Regression and Correlation 6 / 34


Regression Regression Line

A plot of yi versus xi is called a scatter diagram as given in Figure below.

As this scatter diagram appears to reflect, subject to random error, a linear


relation between y and x, it seems that a simple linear regression model
would be appropriate.

Dr.Emmanuel(Udsm) IS141: Probability and Statistics Regression and Correlation 7 / 34


Regression Regression Line

Dr.Emmanuel(Udsm) IS141: Probability and Statistics Regression and Correlation 8 / 34


Regression Regression Line

Linear Least Square Assumptions

Error values (ε) are statistically independent

Error values are normally distributed for any given value of x

The probability distribution of the errors is normal

The probability distribution of the errors has constant variance

The underlying relationship between the x variable and the y variable


is linear

Dr.Emmanuel(Udsm) IS141: Probability and Statistics Regression and Correlation 9 / 34


Regression Least square Estimation

Estimated Regression Model

The sample regression line ŷ = β̂0 + β̂1 x provides an estimate of the


population regression line Y = β0 + β1 x + e,

where ŷ is an estimated/predicted y value, b0 is an estimate of the


regression intercept and b1 is an estimate of the regression slope.

The individual random error terms ei have a mean of zero and variance σ 2 ,
i.e ei ∼ N(0, σ 2 ).

Least Square Estimation


The estimated values of b0 and b1 are obtained by minimizing the sum of
the squared residuals/errors.

That is, we minimize the sum of the squared vertical distances of each
point to the fitted line.

Dr.Emmanuel(Udsm) IS141: Probability and Statistics Regression and Correlation 10 / 34


Regression Least square Estimation

Dr.Emmanuel(Udsm) IS141: Probability and Statistics Regression and Correlation 11 / 34


Regression Least square Estimation

This vertical distance of a point from the fitted line is called a residual.

The residual for observation i is denoted ei and defined by


ei = y − ŷ .
So, in least squares estimation, we wish to minimize the sum of the
squared residuals (or error sum of squares SSE ). i.e.
X X
ei2 = (yi − yˆi )2
i i
X 2
= yi − (β̂0 + β̂1 xi )
i
Therefore, to minimize
X 2
g (β̂0 , β̂1 ) = yi − (β̂0 + β̂1 xi )
i

we take the derivative of g with respect to β̂0 and β̂1 and set equal to
zero, and solve it.
Dr.Emmanuel(Udsm) IS141: Probability and Statistics Regression and Correlation 12 / 34
Regression Least square Estimation

n 
∂g X 
= −2 yi − (β̂0 + β̂1 xi ) = 0 and
∂ β̂0 i=1
n 
∂g X 
= −2 yi − (β̂0 + β̂1 xi ) xi = 0
∂ β̂1 i=1

simplifying the above equations, gives


n
X n
X
nβ̂0 + β̂1 xi = yi and
i=1 i=1
n
X Xn n
X
β̂0 xi + β̂1 (xi )2 = xi yi .
i=1 i=1 i=1

It follows that,

Dr.Emmanuel(Udsm) IS141: Probability and Statistics Regression and Correlation 13 / 34


Regression Least square Estimation

n
P n
P
n
P n
P xi yi
i=1 i=1
(xi − x̄)(yi − ȳ ) xi yi − n
i=1 i=1 Sxy
β̂1 = n = n = and
Sxx
(xi − x̄)2 (xi − x̄)2
P P
i=1 i=1

β̂0 = ȳ − β̂1 x̄.

The point (x̄, ȳ ) will always be on the least squares line.

Example
In this table y is the purity of oxygen produced in a chemical distillation
process, and x is the percentage of hydrocarbons that are present in the
main condenser of the distillation unit. Fit a simple linear regression model
to the oxygen purity data given below:

Dr.Emmanuel(Udsm) IS141: Probability and Statistics Regression and Correlation 14 / 34


Regression Least square Estimation

Hydrocarbon level (x%) Purity (y %)


0.99 90.01
1.02 89.05
1.15 91.43
1.29 93.74
1.46 96.73
1.36 94.45
0.87 87.59
1.23 91.77
1.55 99.42
1.40 93.65
1.19 93.54
1.15 92.52
0.98 90.56
1.01 89.54
1.11 89.85
1.20 90.39
1.26 93.25
1.32 93.41
1.43 94.98
0.95 87.33
Dr.Emmanuel(Udsm) IS141: Probability and Statistics Regression and Correlation 15 / 34
Regression Least square Estimation

The following quantities are computed:


20
X 20
X
n = 20 xi = 23.92 yi = 1, 843.21 x̄ = 1.1960 ȳ = 92.1605
i=1 i=1
20
X 20
X 20
X
yi2 = 170, 044.5321 yi2 = 29.2892 xi yi = 2, 214.6566
i=1 i=1 i=1
Sxx = 0.68088 Sxy = 10.17744
Therefore, the least squares estimates of the slope and intercept are
Sxy 10.17744
β̂1 = = = 14.94748
Sxx 0.68088
β̂0 = ȳ − β̂1 x̄ = 92.1605 − (14.94748)1.196 = 74.28331.
The fitted simple linear regression model (with the coefficients reported to
three decimal places) is
y = 74.283 + 14.947x.
Dr.Emmanuel(Udsm) IS141: Probability and Statistics Regression and Correlation 16 / 34
Regression Least square Estimation

Figure: Scatter plot of oxygen purity y versus hydrocarbon level x and regression
model y = 74.283 + 14.947x

Dr.Emmanuel(Udsm) IS141: Probability and Statistics Regression and Correlation 17 / 34


Regression Least square Estimation

Estimation σ 2
Another unknown parameter in the regression model is σ 2 (the variance of
the error term ε).

The residuals ei = yi − ŷi are used to obtain an estimate of σ 2 . The sum


of squares of the residuals, often called the error sum of squares, is
n
X n
X
SSE = ei2 = (yi − ŷi )2 .
i=1 i=1

Substituting ŷ = β̂0 + β̂1 x into the equation for SSE , simplifying we get
n
X
SSE = yi2 − nȳ 2 − β̂1 Sxy
i=1
= SST − β̂1 Sxy ,

Dr.Emmanuel(Udsm) IS141: Probability and Statistics Regression and Correlation 18 / 34


Regression Least square Estimation

where SST = is the total sum of squares of the response variable y .


It can be shown that the expected value of the error sum of squares is

E (SSE ) = (n − 2)σ 2 .

Therefore an unbiased estimator of σ 2 is


SSE
σ̂ 2 = = MSE .
n−2

For the above example, the estimate of σ 2 for the oxygen purity data is
n
yi2 − nȳ 2 − β̂1 Sxy
P
SSE i=1
σ̂ 2 = =
n−2 n−2
170, 044.5321 − 20(92.1605)2 − (14.94748)(10.17744)
=
20 − 2
= 1.18

Dr.Emmanuel(Udsm) IS141: Probability and Statistics Regression and Correlation 19 / 34


Regression Least square Estimation

Explained and Unexplained Variation

Total variation is made up of two parts: Sum of square for errors and sum
of square of regression. i.e.

SST = SSE + SSR ,

where SST = (y − ȳ )2 , SSE = (y − ŷ )2 and SSR = (ŷ − ȳ )2


P P P

SST : Measures the variation of the yi values around their mean ȳ

SSE : Measures the variation attributable to factors other than the


relationship between x and y

SSR : Measures explained variation attributable to the relationship between


x and y .

Dr.Emmanuel(Udsm) IS141: Probability and Statistics Regression and Correlation 20 / 34


Regression Least square Estimation

Explained and Unexplained Variation continues...

Dr.Emmanuel(Udsm) IS141: Probability and Statistics Regression and Correlation 21 / 34


Regression Least square Estimation

Coefficient of Determination, R 2
The coefficient of determination is the portion of the total variation in the
dependent variable that is explained by variation in the independent
variable.

The coefficient of determination is also called R 2 −squared and is denoted


as R 2 .
SSR SSE
R2 = =1− ,
SST SST
where 0 ≤ R 2 ≤ 1.

Note: In the single independent variable case, the coefficient of


determination
R 2 = r 2,

where R 2 = Coefficient of determination


r 2 = Simple correlation coefficient.
Dr.Emmanuel(Udsm) IS141: Probability and Statistics Regression and Correlation 22 / 34
Regression Least square Estimation

Examples of Approximate R 2 Values

Dr.Emmanuel(Udsm) IS141: Probability and Statistics Regression and Correlation 23 / 34


Regression Least square Estimation

Examples of Approximate R 2 Values continues...

Dr.Emmanuel(Udsm) IS141: Probability and Statistics Regression and Correlation 24 / 34


Regression Least square Estimation

Examples of Approximate R 2 Values continues...

Dr.Emmanuel(Udsm) IS141: Probability and Statistics Regression and Correlation 25 / 34


Correlation

Correlation

A scatter plot (or scatter diagram) is used to show the relationship


between two variables

Correlation analysis is used to measure strength of the association (linear


relationship) between two variables.

It is only concerned with strength of the relationship

There is no causal effect implied.

Dr.Emmanuel(Udsm) IS141: Probability and Statistics Regression and Correlation 26 / 34


Correlation

Scatter plots and Nature of relationship

Dr.Emmanuel(Udsm) IS141: Probability and Statistics Regression and Correlation 27 / 34


Correlation

Scatter plots and Nature of relationship continues...

Dr.Emmanuel(Udsm) IS141: Probability and Statistics Regression and Correlation 28 / 34


Correlation Correlation Coefficient

Correlation Coefficient

The population correlation coefficient ρ (rho) measures the strength of the


association between the variables

The sample correlation coefficient r is an estimate of ρ and is used to


measure the strength of the linear relationship in the sample observations

Features of ρ and r
Unit free
Range between −1 and 1
The closer to −1, the stronger the negative linear relationship
The closer to 1, the stronger the positive linear relationship
The closer to 0, the weaker the linear relationship.

Dr.Emmanuel(Udsm) IS141: Probability and Statistics Regression and Correlation 29 / 34


Correlation Correlation Coefficient

Examples of Approximate r Values

Dr.Emmanuel(Udsm) IS141: Probability and Statistics Regression and Correlation 30 / 34


Correlation Correlation Coefficient

Calculating the Correlation Coefficient

Sample correlation coefficient:


n
P
(xi − x̄)(yi − ȳ )
i=1
r = s
n n
 
(xi − x̄)2 (yi − ȳ )2
P P
i=1 i=1

Algebraic equivalent to:


n
P n
P n
P
n xi yi − xi yi
i=1 i=1 i=1
r = v"
u
n
 n 2 # " n  n 2 #
u P
2 2
P P P
t n xi − xi n yi − yi
i=1 i=1 i=1 i=1

Dr.Emmanuel(Udsm) IS141: Probability and Statistics Regression and Correlation 31 / 34


Correlation Correlation Coefficient

Example
Using the following table, calculate the correlation coefficient r .

Dr.Emmanuel(Udsm) IS141: Probability and Statistics Regression and Correlation 32 / 34


Correlation Correlation Coefficient

n
P n
P n
P
n yixi yi − xi
i=1 i=1 i=1
r = v"
u
n
 n 2 # " n  n 2 #
u P
xi2 − yi2 −
t n P P P
xi n yi
i=1 i=1 i=1 i=1

8(3142) − (73)(321)
=p
[8(713) − (73)2 ][8(14111) − (321)2 ]
= 0.886

r = 0.886 → relatively strong positive linear association between x and y .

Dr.Emmanuel(Udsm) IS141: Probability and Statistics Regression and Correlation 33 / 34


Correlation Correlation Coefficient

Figure: Scatter for trunk diameter vs tree height

Dr.Emmanuel(Udsm) IS141: Probability and Statistics Regression and Correlation 34 / 34

You might also like