0% found this document useful (0 votes)
10 views13 pages

MEFall2023 5

The document provides an overview of correlation and regression, focusing on the measurement of the relationship between random variables using Pearson's coefficient of correlation. It explains the concepts of linear correlation, regression models, and the methods for calculating regression coefficients. Additionally, it includes examples and exercises to illustrate the application of these statistical concepts.

Uploaded by

Muhammad Ibrahim
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views13 pages

MEFall2023 5

The document provides an overview of correlation and regression, focusing on the measurement of the relationship between random variables using Pearson's coefficient of correlation. It explains the concepts of linear correlation, regression models, and the methods for calculating regression coefficients. Additionally, it includes examples and exercises to illustrate the application of these statistical concepts.

Uploaded by

Muhammad Ibrahim
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 13

Corr & Reg

Probability and Random Variables


The math, the computation, and examples.

Prof. Dr. Asad Ali

Department of Applied Mathematics and Statistics


Institute of Space Technology
Islamabad, Pakistan

1 / 94
Corr & Reg

Correlation & Regression

Chapter 5: Correlation & Regression

83 / 94
Corr & Reg

Correlation & Regression


Correlation:
In statistics “correlation” is a tool which measures the degree or the strength of relationship between
two or more random variables. Two variables are said to be in correlation or correlated if the change
in one of the variables results in a change in the other variable. It’s denoted by ‘r’ for sample
data and by ‘ρ’ (rho) for population data. There are many types of correlation such as linear,
quadratic, exponential etc. We will be concerned only with linear correlation. Different types of
linear correlations between X and Y are depicted in the following scatter plots.

Given n pairs of observations (x1 , y1 ), (x2 , y2 ), ..., (xn , yn ) taken on two rvs X and Y , their linear
correlation is defined as, P
(X − X̄)(Y − Ȳ )
r= P p P
(X − X̄)2 (Y − Ȳ )2
This is called Pearson’s coefficient of correlation.
84 / 94
Corr & Reg

Correlation & Regression


For computational purpose (calculator) the above formula can be rewritten as
P P
P X Y
XY −
r = s n
( X)2 ( Y )2
P  P 
P 2 P 2
X − Y −
n n
1
Or take n
as common from numerator and denominator and cancel it, to give
P P P
n XY − X Y
r= p P P P P
[n X 2 − ( X)2 ] [n Y 2 − ( Y )2 ]
Use whichever you find easy to remember.
A few properties to remember:
−1 ≤ r ≤ +1 or 0 ≤ |r| ≤ 1
The magnitude of r indicates the strength of the relationship whereas the sign indicates the direction
of the relationship.
r = −1 indicates a perfect negative linear relationship and r = +1 indicates a perfect positive linear
relationship. It happens when X and Y are multiple of each others e.g. X = 2Y .
The correlation coefficient is a symmetric quantity i.e. if you interchange the places of the two
variables, it remains the same; rxy = ryx .
The correlation coefficient is independent of the units of measurement. That is, if X is measured in
km and Y in kg, they can still be correlated.
The correlation is independent of changes in origin and scale. For example if you replace X by
X−µ Y −µ
U = σ X and Y by V = σ Y , then rXY = rU V . 85 / 94
X Y
Corr & Reg

Correlation & Regression


Example 68.
Find the Pearson’s linear correlation coefficient of the following data.
X 2.4 3.4 4.6 3.7 2.2 3.3 4.0 2.1
Y 1.33 2.12 1.80 1.65 2.00 1.76 2.11 1.63
Solution:
The Pearson’s linear correlation coefficients is given by
P P
P X Y
XY −
r = s n
P 2 ( X)2 P 2 ( Y )2
P  P 
X − Y −
n n
To get the required quantities we construct the following table.
s.no X Y X2 Y2 XY
1 2.4 1.33 5.76 1.7689 3.192
2 3.4 2.12 11.56 4.4944 7.208
3 4.6 1.80 21.16 3.2400 8.280
4 3.7 1.65 13.69 2.7225 6.105
5 2.2 2.00 4.84 4.0000 4.400
6 3.3 1.76 10.89 3.0976 5.808
7 4.0 2.11 16.00 4.4521 8.440
8 2.1 1.63 4.41 2.6569 3.423
P
25.7 14.4 88.31 26.4324 46.856 86 / 94
Corr & Reg

Correlation & Regression

Now putting the values in the above formula


(25.7)(14.4)
46.856 −
r = s 8
2
(14.4)2
 
(25.7)
88.31 − 26.4324 −
8 8
0.5960
= p
[5.75][0.5124]
= 0.3473

This indicates a weak correlation between the two variables.


The strength and significance of the correlation
The following general categories indicate a quick way of interpreting a calculated r value:
0.0 to 0.2 Very weak to negligible correlation
0.2 to 0.4 Weak, low correlation (not very significant)
0.4 to 0.7 Moderate correlation
0.7 to 0.9 Strong, high correlation
0.9 to 1.0 Very strong correlation
The above interpretations apply to both ± signs, since these signs are just the direction of the
relationship.

87 / 94
Corr & Reg

Correlation & Regression


Regression
We often want to predict the values of one variable based on the knowledge of other variable(s).
In general, for this purpose we use certain mathematical models in which one variable depends in
one or more ways on one or more (independent) variables. These mathematical models can either
be deterministic or probabilistic. The deterministic models are those in which for each value of one
(the independent) variable there is a fixed value of the other (dependent) variable. For example,
consider the Celsius-Fahrenheit model.
9
F = 32 + C
5
For C = 37, F will be forever equal to 98.6. Obviously, for a given value of C there is a fixed
value of F , so it is a deterministic model. However, in most problems the relationship of variables
is not deterministic. For example, for a given human age, there is no fixed human body weight.
Different people with exactly the same date (and even time) of birth can have different weights.
Thus, the weight here is a random variable as we can’t predict its value for a given age. To predict
the weight corresponding to a given age in the face of uncertainty, we need a probabilistic model.
Regression provides those mathematical models. A simple linear regression model consists of a
linear deterministic model Yi = a + bXi of two variables X and Y plus a random error term ei :
Yi = a + bXi + ei , i = 1, 2, ..., n
Where the constants a and b represents the intercept and slope, respectively, of the resulting regres-
sion line. This linear equation models the dependency of Y on X in a probabilistic manner, called
the simple linear regression model. The word simple means that there are only two variables
X and Y . 88 / 94
Corr & Reg

Correlation & Regression


The dependent variable Y is also called the response variable or the regressand. Similarly, the
independent variable X is also called explanatory variable, predictor variable or regressor. The
error term e, also called the residual, is the thing that introduces randomness to this model and
is assumed normally distributed with mean zero and variance σ 2 , i.e. e ∼ N (0, σ 2 ). A typical
regression line overlaid with the scatter plot of X, Y data points is shown in the following scatter
plot.

What we actually need is to estimate a and b from the given values of X and Y to get an estimate
of the above linear model that’s Ŷi = â + b̂Xi . The residual is then equal to ei = Yi − Ŷi = the
difference between the observed and estimated responses. 89 / 94
Corr & Reg

Correlation & Regression


Now, how to calculate â and b̂?
The first thing that we need is ˆ which can be calculated as:
P
(X − X̄)(Y − Ȳ )
b̂ = P
(X − X)2
This is almost the same formula as that for the correlation, except that in the denominator the term
(Y − Ȳ )2 and the square root are removed. The computationally convenient formula is
P P
P X Y
XY − P
n XY − X
P P
Y
b̂ = n or b̂ =
P 2 ( X)2
P P
n X 2 − ( X)2
P
X −
n
The intercept â can then be calculated easily, as following

â = Ȳ − b̂X̄

In reality, the formulas for coefficients â and b̂ can be established using the concept of the least
squares method. In which, we try to choose those values of a and b that minimize the sum of the
squared differences (the residuals) between the observed responses Yi and the estimated responses
Ŷi , that is (Yi − Ŷi )2 = e2i . You can simply say that we choose those values of a and b that are
different from their true values to a very less extent.
Note: You can denote the coefficients a and b by α and β and their estimates by α̂ and β̂ too.
90 / 94
Corr & Reg

Correlation & Regression


The least squares estimators (LSE) of a and b
As we said before, we chooseP those values of a and b, that give the smallest sum of squared
residuals, i.e. that minimize e2i .
The sum of squared residuals is given as,
X 2 X
ei = (Yi − Ŷi )2
X
= (Yi − â − b̂Xi )2 ∵ Ŷi = â + b̂Xi

Derivative with respect to a and equating to zero gives,


d X 2 X
ei = 2 (Yi − â − b̂Xi )(−1) = 0
da
Simplifying, we get
X X
Yi = nâ − b̂ Xi (1)

Similarly, differentiating with respect to b and equating to zero gives,


d X 2 X
ei = 2 (Yi − â − b̂Xi )(−Xi ) = 0
db
Simplifying, we get
X X X
Xi Yi = â Xi − b̂ Xi2 (2)
91 / 94
Corr & Reg

Correlation & Regression

Solving equation (1) and (2) simultaneously for â and b̂ gives the LSE of a and b, given as below.
P P
P X Y
XY −
b̂ = n
P 2 ( X)2
P
X −
n

and
P P
X XY
X2
P P
Y −
â = P n
P 2 ( X)2
X −
n

But in practice, to estimate a (i.e. to calculate â) we use the following easier formula

â = Ȳ − b̂X̄

Note: Calculation (or computation) and estimation, are two different things. For example,
multiplying 7 by 8 is a calculation. There is no specific rule or reference used in calculations.
Whereas, we estimate a quantity (or a parameter) by using certain rule. For example, we estimate
the population mean (µ) by sample mean (X̄).
92 / 94
Corr & Reg

Correlation & Regression


Example 69.
Fit a linear regression line to the data in Example 59, using Y as response variable (or regress Y on X).
Solution:
When Y is taken as response, our estimated linear regression model is
Ŷi = â + b̂Xi

Now b̂ is given as:


P
P
P X Y (25.7)(14.4)
XY − 46.856 −
b̂ = n = 8
P 2 ( X)2 (25.7)2
P
X − 88.31 −
n 8
= 0.1037

Also, X̄ = 3.2125 and Ȳ = 1.8. Therefore,

â = Ȳ − b̂X̄ = 1.8 + (0.1037)(3.2125) = 1.4669

Hence, the estimated regression line is

Ŷi = 1.4669 + 0.1037Xi


93 / 94
Corr & Reg

Correlation & Regression


Putting the values of the explanatory variable X in the estimated equation we get the estimated
responses Ŷ .
Ŷi 1.7158 1.8195 1.9439 1.8506 1.6950 1.8091 1.8817 1.6847
If we plot the observed response Y and the estimated response ∩Y both against explanatory variable
X we get the following graph.

This straight line enables us to interpret the interaction behavior of X and Y , and also helps in
predicting as to what would be the values of Y corresponding to any other (missing, past, future)
values of X.
Do Exercises 10.14(b) 10.14(c), 10.15(b), 10.6, 10.7, 10.8, 10.9(a and b parts).
Class Quiz:
What is the difference between correlation and regression (google it). Exercises 10.17 and 10.7.
94 / 94

You might also like